Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed
to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup.
However, this approach won't work once component digests are stored in scylla_metadata,
as replacing a component like Statistics will require atomically updating both the component
and scylla_metadata with the new digest—impossible with POSIX rename.
The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it.
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction
This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.
This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata
this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.
C++20 introduced two new attributes--likely and unlikely--that
function as a built-in replacement for __builtin_expect implemented
in various compilers. Since it makes code easier to read and it's
an integral part of the language, there's no reason to not use it
instead.
Closesscylladb/scylladb#24786
When writing large partitions, that is: partitions with size or row count
above a configurable threshold, ScyllaDB outputs a warning to the log:
WARN ... large_data - Writing large partition test/test: (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db
This warning contains the information about the size of the partition,
but it does not contain the number of rows written. This can lead to
confusion because in cases where the warning was written because of the
row count being larger than the threshold, but the partition size is below
the threshold, the warning will only contain the partition size in bytes,
leading the user to believe the warning was output because of the
partition size, when in reality it was the row count that triggered the
warning. See #20125
This change adds a size_desc argument to cql_table_large_data_handler::try_record(),
which will contain the description of the size of the object written.
This method is used to output warnings for large partitions, row counts,
row sizes and cell sizes. This change does not modify the warning message
for row and cell sizes, only for partition size and row count.
The warning for large partitions and row counts will now look like this:
WARN ... large_data - Writing large partition test/test: (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db
Closesscylladb/scylladb#22010
To allow safe plug and unplug of the system_keyspace.
This patch follows-up on 917fdb9e53
(more specifically - f9b57df471)
Since just keeping a shared_ptr<system_keyspace> doesn't prevent
stopping the system_keyspace shards, while using the `pluggable`
interface allows safe draining of outstanding async calls
on shutdown, before stopping the system_keyspace.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
Previously, writing into system.large_partitions was done by calling
record_large_partition(). In order to write different data based on
the cluster feature flag, another level of indirection was added by
calling _record_large_partitions which is initialized to a lambda
which calls internal_record_large_partitions(). This function does
not record the values of the two new columns (dead_rows and
range_tombstones). After the cluster feature flag becomes true,
_record_large_partitions is set to a lambda which calls
internal_record_large_partitions_all_data() which record the values
of the two new columns.
When issuing warnings about partitions with the number of rows above a configured threshold,
the large partitions handler does not take into consideration the number of range tombstone
markers in the total rows count. This fix adds the number of range tombstone markers to the
total number of rows and saves this total in system.large_partitions.rows (if it is above
the threshold). It also adds a new column range_tombstones to the system.large_partitions
table which only contains the number of range tombstone markers for the given partition.
This PR fixes the first part of issue #13968
It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
The l._d._h.'s way to update system keyspace is not like in other code.
Instead of a dedicated helper on the system_keyspace's side it executes
the insertion query directly with the help of qctx.
Now when the l._d._h. has the weak system keyspace reference it can
execute queries on _it_ rather than on the qctx.
Just like in previous patch, it needs to keep the sys._k.s. weak
reference alive until the query's future resolves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a circular dependency between system_keyspace and database. The
former needs the latter because it needs to execula local requests via
query_processor. The latter needs the former via compaction manager and
large data handler, database depends on both and these too need to
insert their entries into system keyspace.
To cut this loop the compaction manager and large data handler both get
a weak reference on the system keysace. Once system keyspace starts is
activcates this reference via the database call. When system keyspace is
shutdown-ed on stop, it deactivates the reference.
Technically the weak reference is implemented by marking the system_k.s.
object as async_sharded_service, and the "reference" in question is the
shared_from_this() pointer. When compaction manager or large data
handler need to update a system keyspace's table, they both hold an
extra reference on the system keyspace until the entry is committed,
thus making sure that sys._k.s. doesn't stop from under their feet. At
the same time, unplugging the reference on shutdown makes sure that no
new entries update will appear and the system_k.s. will eventually be
released.
It's not a C++ classical reference, because system_keyspace starts after
and stops before database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
make the various large data thresholds live-updateable
and construct the observers and updaters in
cql_table_large_data_handler to dynamically update
the base large_data_handler class threshold members.
Fixes#11685
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When the large_collection_detection cluster feature is enabled,
select the internal_record_large_cells_and_collections method
to record the large collection cell, storing also the collection_elements
column.
We want to do that only when the cluster feature is enabled
to facilitate rollback in case rolling upgrade is aborted,
otherwise system.large_cells won't be backward compatible
and will have to be deleted manually.
Delete the sstable from system.large_cells if it contains
elements_in_collection above threshold.
Closes#11449
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For recording collection_elements of large_collections when
the large_collection_detection feature is enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Detect large_collections when the number of collection_elements
is above the configured threshold.
Next step would be to record the number of collection_elements
in the system.large_cells table, when the respective
cluster feature is enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Add "rows" field to system.large_partitions. Add partitions to the
table when they are too large or have too many rows.
Fixes#9506
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#9577
Since the actual deletion if the large data entries
is done in the background, and we don't captures the shared_sstable,
we can safely pass it to maybe_delete_large_data_entries when
deleting the sstable in sstable::unlink and it will be release
as soon as maybe_delete_large_data_entries returns.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
This is not just a direct flip to a variable with the negated Boolean
value. When created, a large_data_handler is not considered to be
running, the user has to call start() before it can be used.
The advantaged of doing this is that if initialization fails and a
database is destructed before the large_data_handler is started, the
assert
database::stop() {
assert(!_large_data_handler->running());
is not triggered.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
This patch adds a warning option to the user for situations where
rows count may get bigger than initially designed. Through the
warning, users can be aware of possible data modeling problems.
The threshold is initially set to '100,000'.
Tests: unit (dev)
Message-Id: <20190528075612.GA24671@shenzou.localdomain>
The path leading to the issue was:
The sstable name is allocated and passed to maybe_delete_large_data_entries by reference
auto name = sst->get_filename();
return large_data_handler.maybe_delete_large_data_entries(*sst->get_schema(), name, sst->data_size());
A future is created with a reference to it
large_partitions = with_sem([&s, &filename, this] {
return delete_large_data_entries(s, filename, db::system_keyspace::LARGE_PARTITIONS);
});
The semaphore blocks.
The filename is destroyed.
delete_large_data_entries is called with a destroyed filename.
The reason this did not reproduce trivially in a debug build was that
the sstable itself was in the stack and the destructed value was read
as an internal value, and so asan had nothing to complain about.
Unfortunately we also had no tests that the entry in
system.large_rows was actually deleted.
This patch passes the name by value. It might create up to 3 copies of
it. If that is too inefficient it can probably be avoided with a
do_with in maybe_delete_large_data_entries.
Fixes#4335
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this changes the futures returned by large_data_handler will not
normally wait for entries to be written to system.large_rows or
system.large_partitions.
We use a semaphore to bound how behind system.large_* table updates
can get.
This should avoid delaying sstables writes in the common case, which
is more relevant once we warn of large cells since the the default
threshold will be just 1MB.
Note that there is no ordering between the various maybe_record_* and
maybe_delete_large_data_entries requests. This means that we can end
up with a stale entry that is only removed once the TTL expires.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These will use a member semaphore variable in a followup patch, so they
cannot be const.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This should have been changed in the patch
db: stop the commit log after the tables during shutdown
But unfortunately I missed it then.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We had almost identical error handling for large_partitions and
large_rows. Refactor in preparation for large_cells.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This renames it to record_large_partitions, which matches
record_large_rows. It also changes the signature to be closer to
record_large_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The code for deleting entries from system.large_partitions was almost
a duplicate from the code for deleting entries from system.large_rows.
This patch unifies the two, which also improves the error message when
we fail to delete entries from system.large_partitions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This allows for system.large_partitions to be updated if a large
partition is found while writing the last sstables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It now records large rows when they are first written to an sstable
and removes them when the sstable is deleted.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The implementations large_data_handler should only be called if
large_data_handler hasn't been stopped yet.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These functions will record into tables in a followup patch, so they
will need to return a future.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>