scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-30 05:07:05 +00:00

Author	SHA1	Message	Date
Avi Kivity	0900a88884	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start (cherry picked from commit `c762425ea7`)	2025-09-07 13:38:33 +03:00
Taras Veretilnyk	ddb7c8ea12	keys: from_nodetool_style_string don't split single partition keys Users with single-column partition keys that contain colon characters were unable to use certain REST APIs and 'nodetool' commands, because the API split key by colon regardless of the partition key schema. Affected commands: - 'nodetool getendpoints' - 'nodetool getsstables' Affected endpoints: - '/column_family/sstables/by_key' - '/storage_service/natural_endpoints' Refs: #16596 - This does not fully fix the issue, as users with compound keys will face the issue if any column of the partition key contains a colon character. Closes scylladb/scylladb#24829 Closes scylladb/scylladb#25565	2025-09-01 15:36:56 +03:00
Calle Wilund	fe87af4674	commitlog: Ensure segment deletion is re-entrant Fixes #25709 If we have large allocations, spanning more than one segment, and the internal segment references from lead to secondary are the only thing keeping a segment alive, the implicit drop in discard_unused_segments and orphan_all can cause a recursive call to discard_unused_segments, which in turn can lead to vector corruption/crash, or even double free of segment (iterator confusion). Need to separate the modification of the vector (_segments) from actual releasing of objects. Using temporaries is the easiest solution. To further reduce recursion, we can also do an early clear of segment dependencies in callbacks from segment release (cf release). Closes scylladb/scylladb#25719 (cherry picked from commit `cc9eb321a1`) Closes scylladb/scylladb#25756	2025-08-30 18:50:47 +03:00
Ernest Zaslavsky	8a017834a0	s3_client: add memory fallback in `chunked_download_source` Introduce fallback logic in `chunked_download_source` to handle memory exhaustion. When memory is low, feed the `deque` with only one uncounted buffer at a time. This allows slow but steady progress without getting stuck on the memory semaphore. Fixes: https://github.com/scylladb/scylladb/issues/25453 Fixes: https://github.com/scylladb/scylladb/issues/25262 Closes scylladb/scylladb#25452 (cherry picked from commit `dd51e50f60`) Closes scylladb/scylladb#25511	2025-08-15 13:30:38 +03:00
Ernest Zaslavsky	c70ba8384e	s3_client: make memory semaphore acquisition abortable Add `abort_source` to the `get_units` call for the memory semaphore in the S3 client, allowing the acquisition process to be aborted. Fixes: https://github.com/scylladb/scylladb/issues/25454 Closes scylladb/scylladb#25469 (cherry picked from commit `380c73ca03`) Closes scylladb/scylladb#25499	2025-08-14 10:34:28 +02:00
Nikos Dragazis	26174a9c67	test: kmip: Fix segfault from premature destruction of port_promise `kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP server for a particular Boost test case. The function runs the server as an external process and uses a thread to parse the port from the server's logs. The thread communicates the port to the main thread via a promise. The current implementation has a bug where the thread may set a value to the promise after its destruction, causing a segfault. This happens when the server does not start within 20 seconds, in which case the port future throws and the stack unwinding machinery destroys the port promise before the thread that writes to it. Fix the bug by declaring the promise before the cleanup action. The bug has been encountered in CI runs on slow machines, where the PyKMIP server takes too long to create its internal tables (due to slow fdatasync calls from SQLite). This patch does not improve CI stability - it only ensures that the error condition is properly reflected in the test output. This patch is not a backport. The same bug has been fixed in master as part of a larger rewrite of the `kmip_test_helper()` (see `722e2bce96`). Refs #24747, #24842. Fixes #24574. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#25030	2025-08-06 11:59:40 +03:00
Ernest Zaslavsky	ccdc98c8f0	s3_creds: Make `reload` unconditional Assume that any caller invoking `reload` intends to refresh credentials. Remove conditional logic that checks for expiration before reloading. (cherry picked from commit `e4ebe6a309`)	2025-08-06 00:50:36 +00:00
Ernest Zaslavsky	bd79ae3826	s3_creds: Add test exposing credentials renewal issue Add a test demonstrating that renewing credentials does not update their expiration. After requesting credentials again, the expiration remains unchanged, indicating no actual update occurred. (cherry picked from commit `68855c90ca`)	2025-08-06 00:50:36 +00:00
Avi Kivity	a8193bd503	Merge '[Backport 2025.3] transport: remove throwing protocol_exception on connection start' from Dario Mirovic `protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future. This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future. There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it. Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance. In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test. Testing Build: `release` Test file: `test/cqlpy/test_protocol_exceptions.py` Test name: `test_protocol_version_mismatch` (modified for mass connection requests) Test arguments: ``` max_attempts=100'000 num_parallel=10 ``` Throwing `protocol_exception` results: ``` real=1:26.97 user=10:00.27 sys=2:34.55 cpu=867% real=1:26.95 user=9:57.10 sys=2:32.50 cpu=862% real=1:26.93 user=9:56.54 sys=2:35.59 cpu=865% real=1:26.96 user=9:54.95 sys=2:32.33 cpu=859% real=1:26.96 user=9:53.39 sys=2:33.58 cpu=859% real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862% # average ``` Returning `protocol_exception` as `result_with_exception` or an exceptional future: ``` real=1:18.46 user=9:12.21 sys=2:19.08 cpu=881% real=1:18.44 user=9:04.03 sys=2:17.91 cpu=869% real=1:18.47 user=9:12.94 sys=2:19.68 cpu=882% real=1:18.49 user=9:13.60 sys=2:19.88 cpu=883% real=1:18.48 user=9:11.76 sys=2:17.32 cpu=878% real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879% # average ``` This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567. Refs: #24567 Fixes: #25271 This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting. * (cherry picked from commit `7aaeed012e`) * (cherry picked from commit `30d424e0d3`) * (cherry picked from commit `9f4344a435`) * (cherry picked from commit `5390f92afc`) * (cherry picked from commit `4a6f71df68`) Parent PR: #24738 Closes scylladb/scylladb#25117 * github.com:scylladb/scylladb: test/cqlpy: add cpp exception metric test conditions transport/server: replace protocol_exception throws with returns utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception transport/server: avoid exception-throw overhead in handle_error test/cqlpy: add protocol_exception tests	2025-08-05 14:16:14 +03:00
Nikos Dragazis	257ebbeca9	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995 (cherry picked from commit `2656fca504`) Closes scylladb/scylladb#25300	2025-08-02 17:12:05 +03:00
Piotr Dulikowski	ba70b39486	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 (cherry picked from commit `2bb800c004`)	2025-07-31 15:13:57 +00:00
Dario Mirovic	1078a1f03a	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567 Fixes: #25271 (cherry picked from commit `9f4344a435`)	2025-07-30 21:35:15 +02:00
Pavel Emelyanov	7c04619ecf	Merge '[Backport 2025.3] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot] Fixes #24574 * Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert * Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free. - (cherry picked from commit `ee98f5d361`) - (cherry picked from commit `8d37e5e24b`) Parent PR: #24633 Closes scylladb/scylladb#24772 * github.com:scylladb/scylladb: encryption_at_rest_test: Add exception handler to ensure proxy stop encryption: Ensure stopping timers in provider cache objects	2025-07-24 16:35:53 +03:00
Pavel Emelyanov	53637fdf61	Merge '[Backport 2025.3] storage: add `make_data_or_index_source` to the storages' from Scylladb[bot] Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance * Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. * Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage` Fixes: https://github.com/scylladb/scylladb/issues/22458 - (cherry picked from commit `211daeaa40`) - (cherry picked from commit `7e5e3c5569`) - (cherry picked from commit `0de61f56a2`) - (cherry picked from commit `8ac2978239`) - (cherry picked from commit `dff9a229a7`) - (cherry picked from commit `8d49bb8af2`) Parent PR: #23695 Closes scylladb/scylladb#25016 * github.com:scylladb/scylladb: sstables: Start using `make_data_or_index_source` in `sstable` sstables: refactor readers and sources to use coroutines sstables: coroutinize futurized readers sstables: add `make_data_or_index_source` to the `storage` encryption: refactor key retrieval encryption: add `encrypted_data_source` class	2025-07-21 18:05:53 +03:00
Dawid Mędrek	10a9ced4d1	cdc: Forbid altering columns of CDC log tables directly The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 (cherry picked from commit `20d0050f4e`)	2025-07-21 11:43:49 +00:00
Ernest Zaslavsky	549d139e84	sstables: Start using `make_data_or_index_source` in `sstable` Convert all necessary methods to be awaitable. Start using `make_data_or_index_source` when creating data_source for data and index components. For proper working of compressed/checksummed input streams, start passing stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`. (cherry picked from commit `8d49bb8af2`)	2025-07-16 12:45:58 +00:00
Ernest Zaslavsky	243ba1fb66	encryption: add `encrypted_data_source` class Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is Co-authored-by: Calle Wilund <calle@scylladb.com> (cherry picked from commit `211daeaa40`)	2025-07-16 12:45:58 +00:00
Ernest Zaslavsky	873c8503cd	s3_test: Add s3_client test for non-retryable error handling Introduce a test that injects a non-retryable error and verifies that the chunked download source throws an exception as expected. (cherry picked from commit `acf15eba8e`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	dbf4bd162e	s3_test: Add trace logging for default_retry_strategy Introduce trace-level logging for `default_retry_strategy` in `s3_test` to improve visibility into retry logic during test execution. (cherry picked from commit `a5246bbe53`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	54db6ca088	s3_client: Stop retries in chunked download source Disable retries for S3 requests in the chunked download source to prevent duplicate chunks from corrupting the buffer queue. The response handler now throws an exception to bypass the retry strategy, allowing the next range to be attempted cleanly. This exception is only triggered for retryable errors; unretryable ones immediately halt further requests. (cherry picked from commit `d2d69cbc8c`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	c748a97170	s3_client: Add test for Content-Range fix Introduce a test that accurately verifies the Content-Range behavior, ensuring the previous fix is properly validated. (cherry picked from commit `ec59fcd5e4`)	2025-07-13 13:17:14 +00:00
Benny Halevy	f78a352a29	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2c0bafb934`)	2025-07-07 09:38:17 +03:00
Benny Halevy	b647dbd547	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2b2cfaba6e`)	2025-07-07 09:34:10 +03:00
Benny Halevy	0e7d3b4eb9	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `e0a19b981a`)	2025-07-07 09:30:01 +03:00
Calle Wilund	46e3794bde	encryption_at_rest_test: Add exception handler to ensure proxy stop If boost test is run such that we somehow except even in a test macro such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy used, causing a use after free. (cherry picked from commit `8d37e5e24b`)	2025-07-02 10:13:08 +00:00
Botond Dénes	ee6d7c6ad9	test/boost/memtable_test: only inject error for test table Currently the test indiscriminately injects failures into the flushes of any table, via the IO extension mechanism. The tests want to check that the node correctly handles the IO error by self isolating, however the indiscriminate IO errors can have unintended consequences when they hit raft, leading to disorderly shutdown and failure of the tests. Testing raft's resiliency to IO errors if of course worth doing, but it is not the goal of this particular test, so to avoid the fallout, the IO errors are limited to the test tables only. Fixes: https://github.com/scylladb/scylladb/issues/24637 Closes scylladb/scylladb#24638	2025-06-30 10:08:49 +03:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Avi Kivity	48d9f3d2e3	Merge 'mutation: check key of inserted rows' from Botond Dénes Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. Fixes: https://github.com/scylladb/scylladb/issues/24506 Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters. Closes scylladb/scylladb#24497 * github.com:scylladb/scylladb: mutation: check key of inserted rows compound: optimize is_full() for single-component types	2025-06-29 18:10:17 +03:00
Lakshmi Narayanan Sreethar	279253ffd0	utils/big_decimal: fix scale overflow when parsing values with large exponents The exponent of a big decimal string is parsed as an int32, adjusted for the removed fractional part, and stored as an int32. When parsing values like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32 limit, and since the scale is stored as an int32, it overflows and wraps around, losing the value. This patch fixes that the by parsing the exponent as an int64 value and then adjusting it for the fractional part. The adjusted scale is then checked to see if it is still within int32 limits before storing. An exception is thrown if it is not within the int32 limits. Note that strings with exponents that exceed the int32 range, like `0.01E2147483650`, were previously not parseable as a big decimal. They are now accepted if the final adjusted scale fits within int32 limits. For the above value, unscaled_value = 1 and scale = -2147483648, so it is now accepted. This is in line with how Java's `BigDecimal` parses strings. Fixes: #24581 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#24640	2025-06-26 15:29:28 +03:00
Avi Kivity	947906e6fd	Merge 'Make uuid sstable generations mandatory' from Benny Halevy Before we can eradicate the numerical sstable generations, This series completes https://github.com/scylladb/scylladb/issues/20337 by disabling the use of numerical sstable generations where we can and making sure the feature is never disabled. Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables. Refs #24248 * Enhancement. No backport required Closes scylladb/scylladb#24554 * github.com:scylladb/scylladb: feature_service: never disable UUID_SSTABLE_IDENTIFIERS test: sstable_move_test: always use uuid sstable generation test: sstable_directory_test: always use uuid sstable generation sstables: sstable_generation_generator: set last_generation=0 by default test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation test: lib: test_env: always use uuid sstable generation test: sstable_test: always use uuid sstable generation test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation test: sstable_compaction_test: always use uuid sstable generation	2025-06-26 12:25:38 +02:00
Pavel Emelyanov	0f5b358c47	test: Use test sched groups, not database ones Some tests want to switch between sched groups. For that there's cql-test-env facility to create and use them. However, there's a test that uses replica::database as sched groups provider, which is not nice. Fix it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24615	2025-06-26 12:25:38 +02:00
Botond Dénes	edc2906892	test/boost/sstable_datafile_test: add test for corrupt data * create a table with random schema * generate data: random mutations + one row with bad key * write data to sstable * check that only good data is written to sstable * check that the bad data was saved to system.corrupt_data	2025-06-25 08:41:29 +03:00
Botond Dénes	ab96c703ff	mutation: check key of inserted rows Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to work with the changes: it has to make the schema use compact storage, otherwise the non-full changes used by this tests are rejected by the new checks. Fixes: https://github.com/scylladb/scylladb/issues/24506	2025-06-23 09:38:45 +03:00
Nadav Har'El	85c19d21bb	Merge 'cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Karol Nowacki cql, schema: Extend name length limit from 48 to 192 bytes This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes. The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389) and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint. This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases. The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data. When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID. For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name. The directory name for this log table becomes the longest possible representation. Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas. To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows: 255 bytes (common filesystem limit for a path component) - 32 bytes (for the 32-character UUID string) - 1 byte (for the '-' separator) - 15 bytes (for the '_scylla_cdc_log' suffix) - 15 bytes (reserved for future use) ---------- = 192 bytes (Maximum allowed name length) This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038). This patch also updates/adds all associated tests to validate the new 192-byte limit. The documentation has been updated accordingly. Fixes #4480 Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release. Closes scylladb/scylladb#24500 * github.com:scylladb/scylladb: cql, schema: Extend name length limit from 48 to 192 bytes replica: Remove unused keyspace::init_storage()	2025-06-22 17:41:10 +03:00
Avi Kivity	770b91447b	Merge 'memtable: ensure _flushed_memory doesn't grow above total_memory' from Michał Chojnowski `dirty_memory_manager` tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. `dirty_memory_manager` controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in `~flush_memory_accounter` which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen by `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the `mutation_cleaner` later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in `mutation_cleaner` intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`. Fixes scylladb/scylladb#21413 Backport candidate because it fixes a crash that can happen in existing stable branches. Closes scylladb/scylladb#21638 * github.com:scylladb/scylladb: memtable: ensure _flushed_memory doesn't grow above total memory usage replica/memtable: move region_listener handlers from dirty_memory_manager to memtable	2025-06-22 11:19:25 +03:00
Michał Chojnowski	7d551f99be	replica/memtable: move region_listener handlers from dirty_memory_manager to memtable The memtable wants to listen for changes in its `total_memory` in order to decrease its `_flushed_memory` in case some of the freed memory has already been accounted as flushed. (This can happen because the flush reader sees and accounts even outdated MVCC versions, which can be deleted and freed during the flush). Today, the memtable doesn't listen to those changes directly. Instead, some calls which can affect `total_memory` (in particular, the mutation cleaner) manually check the value of `total_memory` before and after they run, and they pass the difference to the memtable. But that's not good enough, because `total_memory` can also change outside of those manually-checked calls -- for example, during LSA compaction, which can occur anytime. This makes memtable's accounting inaccurate and can lead to unexpected states. But we already have an interface for listening to `total_memory` changes actively, and `dirty_memory_manager`, which also needs to know it, does just that. So what happens e.g. when `mutation_cleaner` runs is that `mutation_cleaner` checks the value of `total_memory` before it runs, then it runs, causing several changes to `total_memory` which are picked up by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of `total_memory` and passes the difference to `memtable`, which corrects whatever was observed by `dirty_memory_manager`. To allow memtable to modify its `_flushed_memory` correctly, we need to make `memtable` itself a `region_listener`. Also, instead of the situation where `dirty_memory_manager` receives `total_memory` change notifications from `logalloc` directly, and `memtable` fixes the manager's state later, we want to only the memtable listen for the notifications, and pass them already modified accordingl to the manager, so there is no intermediate wrong states. This patch moves the `region_listener` callbacks from the `dirty_memory_manager` to the `memtable`. It's not intended to be a functional change, just a source code refactoring. The next patch will be a functional change enabled by this.	2025-06-20 11:42:30 +02:00
Łukasz Paszkowski	a9a53d9178	compaction_manager: cancel submission timer on drain The `drain` method, cancels all running compactions and moves the compaction manager into the disabled state. To move it back to the enabled state, the `enable` method shall be called. This, however, throws an assertion error as the submission time is not cancelled and re-enabling the manager tries to arm the armed timer. Thus, cancel the timer, when calling the drain method to disable the compaction manager. Fixes https://github.com/scylladb/scylladb/issues/24504 All versions are affected. So it's a good candidate for a backport. Closes scylladb/scylladb#24505	2025-06-20 11:33:49 +03:00
Karol Nowacki	4577c66a04	cql, schema: Extend name length limit from 48 to 192 bytes This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes. The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389) and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint. This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases. The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data. When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID. For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name. The directory name for this log table becomes the longest possible representation. Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas. To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows: 255 bytes (common filesystem limit for a path component) - 32 bytes (for the 32-character UUID string) - 1 byte (for the '-' separator) - 15 bytes (for the '_scylla_cdc_log' suffix) - 15 bytes (reserved for future use) ---------- = 192 bytes (Maximum allowed name length) This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038). This patch also updates/adds all associated tests to validate the new 192-byte limit. The documentation has been updated accordingly.	2025-06-18 14:08:38 +02:00
Benny Halevy	ecc7272a07	test: sstable_move_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	49ca442e7c	test: sstable_directory_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	15bee9f232	sstables: sstable_generation_generator: set last_generation=0 by default Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	079c5fe5e3	test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	0310a03de6	test: sstable_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	b00b805da6	test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config Which is `true` by default anyhow. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	f644c5896f	test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	bfa0bb78f9	test: sstable_compaction_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Michał Chojnowski	27f66fb110	test/boost/mutation_reader_test: fix a use-after-free in `test_fast_forwarding_combined_reader_is_consistent_with_slicing` The contract in mutation_reader.hh says: ``` // pr needs to be valid until the reader is destroyed or fast_forward_to() // is called again. future<> fast_forward_to(const dht::partition_range& pr) { ``` `test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates this by passing a temporary to `fast_forward_to`. Fix that. Fixes scylladb/scylladb#24542 Closes scylladb/scylladb#24543	2025-06-17 19:30:50 +03:00
Pavel Emelyanov	b0766d1e73	Merge 's3_client: Refactor `range` class for state validation' from Ernest Zaslavsky Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333 No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880 Closes scylladb/scylladb#24312 * github.com:scylladb/scylladb: s3_client: headers cleanup s3_client: Refactor `range` class for state validation	2025-06-17 10:34:55 +03:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Ernest Zaslavsky	9ad7a456fe	s3_client: Refactor `range` class for state validation Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.	2025-06-16 16:02:24 +03:00

1 2 3 4 5 ...

3982 Commits