scylladb

Author	SHA1	Message	Date
Taras Veretilnyk	51c345aaf6	sstables: add new rewrite component mechanism for safe sstable component rewriting Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup. However, this approach won't work once component digests are stored in scylla_metadata, as replacing a component like Statistics will require atomically updating both the component and scylla_metadata with the new digest—impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it. - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.	2026-02-26 22:38:55 +01:00
Dawid Mędrek	a151944fa6	treewide: Replace __builtin_expect with (un)likely C++20 introduced two new attributes--likely and unlikely--that function as a built-in replacement for __builtin_expect implemented in various compilers. Since it makes code easier to read and it's an integral part of the language, there's no reason to not use it instead. Closes scylladb/scylladb#24786	2025-07-03 13:34:04 +03:00
Ferenc Szili	96267960f8	logging: Add row count to large partition warning message When writing large partitions, that is: partitions with size or row count above a configurable threshold, ScyllaDB outputs a warning to the log: WARN ... large_data - Writing large partition test/test: (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db This warning contains the information about the size of the partition, but it does not contain the number of rows written. This can lead to confusion because in cases where the warning was written because of the row count being larger than the threshold, but the partition size is below the threshold, the warning will only contain the partition size in bytes, leading the user to believe the warning was output because of the partition size, when in reality it was the row count that triggered the warning. See #20125 This change adds a size_desc argument to cql_table_large_data_handler::try_record(), which will contain the description of the size of the object written. This method is used to output warnings for large partitions, row counts, row sizes and cell sizes. This change does not modify the warning message for row and cell sizes, only for partition size and row count. The warning for large partitions and row counts will now look like this: WARN ... large_data - Writing large partition test/test: (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db Closes scylladb/scylladb#22010	2025-06-26 12:25:38 +02:00
Benny Halevy	fba88bdd62	database, compaction_manager, large_data_handler: use pluggable<system_keysapce> To allow safe plug and unplug of the system_keyspace. This patch follows-up on `917fdb9e53` (more specifically - `f9b57df471`) Since just keeping a shared_ptr<system_keyspace> doesn't prevent stopping the system_keyspace shards, while using the `pluggable` interface allows safe draining of outstanding async calls on shutdown, before stopping the system_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:27:23 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Ferenc Szili	90634b419c	sstable: added cluster feature for dead rows and range tombstones Previously, writing into system.large_partitions was done by calling record_large_partition(). In order to write different data based on the cluster feature flag, another level of indirection was added by calling _record_large_partitions which is initialized to a lambda which calls internal_record_large_partitions(). This function does not record the values of the two new columns (dead_rows and range_tombstones). After the cluster feature flag becomes true, _record_large_partitions is set to a lambda which calls internal_record_large_partitions_all_data() which record the values of the two new columns.	2024-05-02 11:49:46 +02:00
Ferenc Szili	b06af5b2b9	sstable: write dead_rows count to system.large_partitions	2024-05-02 11:49:10 +02:00
Ferenc Szili	98bec4e02a	sstable: large data handler needs to count range tombstones as rows When issuing warnings about partitions with the number of rows above a configured threshold, the large partitions handler does not take into consideration the number of range tombstone markers in the total rows count. This fix adds the number of range tombstone markers to the total number of rows and saves this total in system.large_partitions.rows (if it is above the threshold). It also adds a new column range_tombstones to the system.large_partitions table which only contains the number of range tombstone markers for the given partition. This PR fixes the first part of issue #13968 It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.	2024-04-22 15:24:18 +02:00
Avi Kivity	69a385fd9d	Introduce schema/ module Schema related files are moved there. This excludes schema files that also interact with mutations, because the mutation module depends on the schema. Those files will have to go into a separate module. Closes #12858	2023-02-15 11:01:50 +02:00
Pavel Emelyanov	b1f4273f0d	large_data_handler: Use local system_keyspace to update entries The l._d._h.'s way to update system keyspace is not like in other code. Instead of a dedicated helper on the system_keyspace's side it executes the insertion query directly with the help of qctx. Now when the l._d._h. has the weak system keyspace reference it can execute queries on _it_ rather than on the qctx. Just like in previous patch, it needs to keep the sys._k.s. weak reference alive until the query's future resolves. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Pavel Emelyanov	f9b57df471	database: Plug/unplug system_keyspace There's a circular dependency between system_keyspace and database. The former needs the latter because it needs to execula local requests via query_processor. The latter needs the former via compaction manager and large data handler, database depends on both and these too need to insert their entries into system keyspace. To cut this loop the compaction manager and large data handler both get a weak reference on the system keysace. Once system keyspace starts is activcates this reference via the database call. When system keyspace is shutdown-ed on stop, it deactivates the reference. Technically the weak reference is implemented by marking the system_k.s. object as async_sharded_service, and the "reference" in question is the shared_from_this() pointer. When compaction manager or large data handler need to update a system keyspace's table, they both hold an extra reference on the system keyspace until the entry is committed, thus making sure that sys._k.s. doesn't stop from under their feet. At the same time, unplugging the reference on shutdown makes sure that no new entries update will appear and the system_k.s. will eventually be released. It's not a C++ classical reference, because system_keyspace starts after and stops before database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Benny Halevy	2c4ff71d2b	db: large_data_handler: dynamically update config thresholds make the various large data thresholds live-updateable and construct the observers and updaters in cql_table_large_data_handler to dynamically update the base large_data_handler class threshold members. Fixes #11685 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:53:40 +03:00
Benny Halevy	46ebffcc93	db/large_data_handler: cql_table_large_data_handler: record large_collections When the large_collection_detection cluster feature is enabled, select the internal_record_large_cells_and_collections method to record the large collection cell, storing also the collection_elements column. We want to do that only when the cluster feature is enabled to facilitate rollback in case rolling upgrade is aborted, otherwise system.large_cells won't be backward compatible and will have to be deleted manually. Delete the sstable from system.large_cells if it contains elements_in_collection above threshold. Closes #11449 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:10 +03:00
Benny Halevy	3f8bba202f	db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler For recording collection_elements of large_collections when the large_collection_detection feature is enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:10 +03:00
Benny Halevy	dc4e7d8e01	db/large_data_handler: cql_table_large_data_handler: move ctor out of line Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:09 +03:00
Benny Halevy	6dadca2648	db/large_data_handler: maybe_record_large_cells: consider collection_elements Detect large_collections when the number of collection_elements is above the configured threshold. Next step would be to record the number of collection_elements in the system.large_cells table, when the respective cluster feature is enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:05 +03:00
Benny Halevy	a107f583fd	db/large_data_handler: get the collection_elements_count_threshold Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:11 +03:00
Benny Halevy	244df07771	large_data_handler: use only basename to identify the sstable SSTables may be created in one directory (e.g. staging) and be removed from another directory (base table dir, or quarantine if scrub moved them there), so identify the sstable by its unique component basename rather than the full path. Fixes #10075 Test: unit(dev) DTest: wide_rows_test.py (w/ https://github.com/scylladb/scylla-dtest/pull/2606) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214131923.1468870-1-bhalevy@scylladb.com>	2022-02-14 17:57:49 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Michael Livshin	a7511cf600	system keyspace: record partitions with too many rows Add "rows" field to system.large_partitions. Add partitions to the table when they are too large or have too many rows. Fixes #9506 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9577	2021-11-14 14:25:18 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Benny Halevy	8cebe7776f	large_data_handler: maybe_delete_large_data_entries: accept shared_sstable Since the actual deletion if the large data entries is done in the background, and we don't captures the shared_sstable, we can safely pass it to maybe_delete_large_data_entries when deleting the sstable in sstable::unlink and it will be release as soon as maybe_delete_large_data_entries returns. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	f7d0ae3d10	large_data_handler: maybe_delete_large_data_entries: move out of line It is called on the cold path, when the sstable is deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	8ab053bd44	large_data_handler: expose methods to get threshold To be used for keeping large_data statistics in sstable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:18:14 +02:00
Benny Halevy	dd7422a713	large_data_handler: indicate recording of large data entries Return true from the maybe_{record,log}_* methods if a large data record or log entry were emitted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:18:14 +02:00
Benny Halevy	873107821b	large_data_handler: move constructor out of line No need for it to be inlined. Also, add debug logging to the large data handler options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:18:14 +02:00
Pavel Emelyanov	4fa12f2fb8	header: De-bloat schema.hh The header sits in many other headers, but there's a handy schema_fwd.hh that's tiny and contains needed declarations for other headers. So replace shema.hh with schema_fwd.hh in most of the headers (and remove completely from some). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200303102050.18462-1-xemul@scylladb.com>	2020-03-03 11:34:00 +01:00
Rafael Ávila de Espíndola	5d4671526c	db: Replace large_data_handler::_stopped with _running This is not just a direct flip to a variable with the negated Boolean value. When created, a large_data_handler is not considered to be running, the user has to call start() before it can be used. The advantaged of doing this is that if initialization fails and a database is destructed before the large_data_handler is started, the assert database::stop() { assert(!_large_data_handler->running()); is not triggered. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-02-04 21:15:44 -08:00
Rafael Ávila de Espíndola	33dfe34f78	db: Move nop_large_data_handler constructor out-of-line Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-02-04 21:12:01 -08:00
Rafael Ávila de Espíndola	e99a225f25	db: Move large_data_handler::stop out-of-line Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-02-04 21:11:49 -08:00
Pavel Solodovnikov	2f442f28af	treewide: add const qualifiers throughout the code base	2019-11-26 02:24:49 +03:00
Botond Dénes	fddd9a88dd	treewide: silence discarded future warnings for legit discards This patch silences those future discard warnings where it is clear that discarding the future was actually the intent of the original author, and they did the necessary precautions (handling errors). The patch also adds some trivial error handling (logging the error) in some places, which were lacking this, but otherwise look ok. No functional changes.	2019-08-26 18:54:44 +03:00
Juliana Oliveira	fd83f61556	Add a warning for partitions with too many rows This patch adds a warning option to the user for situations where rows count may get bigger than initially designed. Through the warning, users can be aware of possible data modeling problems. The threshold is initially set to '100,000'. Tests: unit (dev) Message-Id: <20190528075612.GA24671@shenzou.localdomain>	2019-06-06 19:48:57 +03:00
Rafael Ávila de Espíndola	c8da28a3eb	Allow large_data_handler to be stopped twice This will be used in a testcase. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-21 10:47:23 -07:00
Rafael Ávila de Espíndola	e7749e7aee	large_data_handler: Fix a use after destruction The path leading to the issue was: The sstable name is allocated and passed to maybe_delete_large_data_entries by reference auto name = sst->get_filename(); return large_data_handler.maybe_delete_large_data_entries(*sst->get_schema(), name, sst->data_size()); A future is created with a reference to it large_partitions = with_sem([&s, &filename, this] { return delete_large_data_entries(s, filename, db::system_keyspace::LARGE_PARTITIONS); }); The semaphore blocks. The filename is destroyed. delete_large_data_entries is called with a destroyed filename. The reason this did not reproduce trivially in a debug build was that the sstable itself was in the stack and the destructed value was read as an internal value, and so asan had nothing to complain about. Unfortunately we also had no tests that the entry in system.large_rows was actually deleted. This patch passes the name by value. It might create up to 3 copies of it. If that is too inefficient it can probably be avoided with a do_with in maybe_delete_large_data_entries. Fixes #4335 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-20 09:30:42 -07:00
Rafael Ávila de Espíndola	63251b66c1	db: Record large cells Fixes #4234. Large cells are now recorded in system.large_cells. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:19:04 -07:00
Rafael Ávila de Espíndola	8b4ae95168	large_data_handler: Run large data recording in parallel With this changes the futures returned by large_data_handler will not normally wait for entries to be written to system.large_rows or system.large_partitions. We use a semaphore to bound how behind system.large_* table updates can get. This should avoid delaying sstables writes in the common case, which is more relevant once we warn of large cells since the the default threshold will be just 1MB. Note that there is no ordering between the various maybe_record_* and maybe_delete_large_data_entries requests. This means that we can end up with a stale entry that is only removed once the TTL expires. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:19:04 -07:00
Rafael Ávila de Espíndola	54b856e5e4	large_data_handler: propagate a future out of stop() stop() will close a semaphore in a followup patch, so it needs to return a future. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:19:04 -07:00
Rafael Ávila de Espíndola	989ab33507	large_data_handler: Remove const from a few functions These will use a member semaphore variable in a followup patch, so they cannot be const. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:19:04 -07:00
Rafael Ávila de Espíndola	5fcb3ff2d7	db: don't use _stopped directly This gives flexibility in how it is implemented. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:19:04 -07:00
Rafael Ávila de Espíndola	a17a936882	large_data_handler: assert it is not used after stop() This should have been changed in the patch db: stop the commit log after the tables during shutdown But unfortunately I missed it then. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:19:04 -07:00
Rafael Ávila de Espíndola	f3089bf3d1	db: refactor a try_record helper We had almost identical error handling for large_partitions and large_rows. Refactor in preparation for large_cells. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:19:02 -07:00
Rafael Ávila de Espíndola	d7f263d334	db: Rename (maybe_)?update_large_partitions This renames it to record_large_partitions, which matches record_large_rows. It also changes the signature to be closer to record_large_rows. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:16:04 -07:00
Rafael Ávila de Espíndola	f254664fe6	db: refactor large data deletion code The code for deleting entries from system.large_partitions was almost a duplicate from the code for deleting entries from system.large_rows. This patch unifies the two, which also improves the error message when we fail to delete entries from system.large_partitions. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-12 13:16:04 -07:00
Rafael Ávila de Espíndola	16ed9a2574	db: stop the commit log after the tables during shutdown This allows for system.large_partitions to be updated if a large partition is found while writing the last sstables. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-03-05 18:04:51 -08:00
Rafael Ávila de Espíndola	25f81cf3e3	Populate system.large_rows. It now records large rows when they are first written to an sstable and removes them when the sstable is deleted. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-02-26 15:56:42 -08:00
Rafael Ávila de Espíndola	b7fd03d0fd	Don't call record_large_rows if stopped The implementations large_data_handler should only be called if large_data_handler hasn't been stopped yet. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-02-26 15:46:21 -08:00
Rafael Ávila de Espíndola	0c401f56f8	Add a delete_large_rows_entries method to large_data_handler This will be responsible for removing large rows from system.large_rows. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-02-26 15:46:21 -08:00
Rafael Ávila de Espíndola	81a21ea425	db::large_data_handler::(maybe_)?record_large_rows: Return future<> instead of void These functions will record into tables in a followup patch, so they will need to return a future. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-02-26 15:46:21 -08:00

1 2

56 Commits