Commit Graph

3982 Commits

Author SHA1 Message Date
Avi Kivity
0900a88884 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start

(cherry picked from commit c762425ea7)
2025-09-07 13:38:33 +03:00
Taras Veretilnyk
ddb7c8ea12 keys: from_nodetool_style_string don't split single partition keys
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.

Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'

Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.

Closes scylladb/scylladb#24829

Closes scylladb/scylladb#25565
2025-09-01 15:36:56 +03:00
Calle Wilund
fe87af4674 commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719

(cherry picked from commit cc9eb321a1)

Closes scylladb/scylladb#25756
2025-08-30 18:50:47 +03:00
Ernest Zaslavsky
8a017834a0 s3_client: add memory fallback in chunked_download_source
Introduce fallback logic in `chunked_download_source` to handle
memory exhaustion. When memory is low, feed the `deque` with only
one uncounted buffer at a time. This allows slow but steady progress
without getting stuck on the memory semaphore.

Fixes: https://github.com/scylladb/scylladb/issues/25453
Fixes: https://github.com/scylladb/scylladb/issues/25262

Closes scylladb/scylladb#25452

(cherry picked from commit dd51e50f60)

Closes scylladb/scylladb#25511
2025-08-15 13:30:38 +03:00
Ernest Zaslavsky
c70ba8384e s3_client: make memory semaphore acquisition abortable
Add `abort_source` to the `get_units` call for the memory semaphore
in the S3 client, allowing the acquisition process to be aborted.

Fixes: https://github.com/scylladb/scylladb/issues/25454

Closes scylladb/scylladb#25469

(cherry picked from commit 380c73ca03)

Closes scylladb/scylladb#25499
2025-08-14 10:34:28 +02:00
Nikos Dragazis
26174a9c67 test: kmip: Fix segfault from premature destruction of port_promise
`kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP
server for a particular Boost test case. The function runs the server as
an external process and uses a thread to parse the port from the
server's logs. The thread communicates the port to the main thread via
a promise.

The current implementation has a bug where the thread may set a value
to the promise after its destruction, causing a segfault. This happens
when the server does not start within 20 seconds, in which case the port
future throws and the stack unwinding machinery destroys the port
promise before the thread that writes to it.

Fix the bug by declaring the promise before the cleanup action.

The bug has been encountered in CI runs on slow machines, where the
PyKMIP server takes too long to create its internal tables (due to slow
fdatasync calls from SQLite). This patch does not improve CI stability -
it only ensures that the error condition is properly reflected in the
test output.

This patch is not a backport. The same bug has been fixed in master as
part of a larger rewrite of the `kmip_test_helper()` (see 722e2bce96).

Refs #24747, #24842.
Fixes #24574.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#25030
2025-08-06 11:59:40 +03:00
Ernest Zaslavsky
ccdc98c8f0 s3_creds: Make reload unconditional
Assume that any caller invoking `reload` intends to refresh credentials.
Remove conditional logic that checks for expiration before reloading.

(cherry picked from commit e4ebe6a309)
2025-08-06 00:50:36 +00:00
Ernest Zaslavsky
bd79ae3826 s3_creds: Add test exposing credentials renewal issue
Add a test demonstrating that renewing credentials does not update
their expiration. After requesting credentials again, the expiration
remains unchanged, indicating no actual update occurred.

(cherry picked from commit 68855c90ca)
2025-08-06 00:50:36 +00:00
Avi Kivity
a8193bd503 Merge '[Backport 2025.3] transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567
Fixes: #25271

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

* (cherry picked from commit 7aaeed012e)

* (cherry picked from commit 30d424e0d3)

* (cherry picked from commit 9f4344a435)

* (cherry picked from commit 5390f92afc)

* (cherry picked from commit 4a6f71df68)

Parent PR: #24738

Closes scylladb/scylladb#25117

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-08-05 14:16:14 +03:00
Nikos Dragazis
257ebbeca9 test: Use in-memory SQLite for PyKMIP server
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.

This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).

Fix this by configuring the server to use an in-memory SQLite.

Fixes #24842.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#24995

(cherry picked from commit 2656fca504)

Closes scylladb/scylladb#25300
2025-08-02 17:12:05 +03:00
Piotr Dulikowski
ba70b39486 qos: don't populate effective service level cache until auth is migrated to raft
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.

Fixes: scylladb/scylladb#24963
(cherry picked from commit 2bb800c004)
2025-07-31 15:13:57 +00:00
Dario Mirovic
1078a1f03a utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
Fixes: #25271
(cherry picked from commit 9f4344a435)
2025-07-30 21:35:15 +02:00
Pavel Emelyanov
7c04619ecf Merge '[Backport 2025.3] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot]
Fixes #24574

* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.

- (cherry picked from commit ee98f5d361)

- (cherry picked from commit 8d37e5e24b)

Parent PR: #24633

Closes scylladb/scylladb#24772

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Add exception handler to ensure proxy stop
  encryption: Ensure stopping timers in provider cache objects
2025-07-24 16:35:53 +03:00
Pavel Emelyanov
53637fdf61 Merge '[Backport 2025.3] storage: add make_data_or_index_source to the storages' from Scylladb[bot]
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance

* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it  for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`

Fixes: https://github.com/scylladb/scylladb/issues/22458

- (cherry picked from commit 211daeaa40)

- (cherry picked from commit 7e5e3c5569)

- (cherry picked from commit 0de61f56a2)

- (cherry picked from commit 8ac2978239)

- (cherry picked from commit dff9a229a7)

- (cherry picked from commit 8d49bb8af2)

Parent PR: #23695

Closes scylladb/scylladb#25016

* github.com:scylladb/scylladb:
  sstables: Start using `make_data_or_index_source` in `sstable`
  sstables: refactor readers and sources to use coroutines
  sstables: coroutinize futurized readers
  sstables: add `make_data_or_index_source` to the `storage`
  encryption: refactor key retrieval
  encryption: add `encrypted_data_source` class
2025-07-21 18:05:53 +03:00
Dawid Mędrek
10a9ced4d1 cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

(cherry picked from commit 20d0050f4e)
2025-07-21 11:43:49 +00:00
Ernest Zaslavsky
549d139e84 sstables: Start using make_data_or_index_source in sstable
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.

For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.

(cherry picked from commit 8d49bb8af2)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
243ba1fb66 encryption: add encrypted_data_source class
Introduce the `encrypted_data_source` class that wraps an existing data
source to read and decrypt data on the fly using block encryption. Also add
unit tests to verify correct decryption behavior.
NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is

Co-authored-by: Calle Wilund <calle@scylladb.com>
(cherry picked from commit 211daeaa40)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
873c8503cd s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.

(cherry picked from commit acf15eba8e)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
dbf4bd162e s3_test: Add trace logging for default_retry_strategy
Introduce trace-level logging for `default_retry_strategy` in
`s3_test` to improve visibility into retry logic during test
execution.

(cherry picked from commit a5246bbe53)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
54db6ca088 s3_client: Stop retries in chunked download source
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.

This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.

(cherry picked from commit d2d69cbc8c)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
c748a97170 s3_client: Add test for Content-Range fix
Introduce a test that accurately verifies the Content-Range
behavior, ensuring the previous fix is properly validated.

(cherry picked from commit ec59fcd5e4)
2025-07-13 13:17:14 +00:00
Benny Halevy
f78a352a29 token_metadata: clear_and_destroy_impl when destroyed
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.

This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.

Fixes #13381

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c0bafb934)
2025-07-07 09:38:17 +03:00
Benny Halevy
b647dbd547 token_metadata: keep a reference to shared_token_metadata
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2b2cfaba6e)
2025-07-07 09:34:10 +03:00
Benny Halevy
0e7d3b4eb9 token_metadata: move make_token_metadata_ptr into shared_token_metadata class
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e0a19b981a)
2025-07-07 09:30:01 +03:00
Calle Wilund
46e3794bde encryption_at_rest_test: Add exception handler to ensure proxy stop
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.

(cherry picked from commit 8d37e5e24b)
2025-07-02 10:13:08 +00:00
Botond Dénes
ee6d7c6ad9 test/boost/memtable_test: only inject error for test table
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.

Fixes: https://github.com/scylladb/scylladb/issues/24637

Closes scylladb/scylladb#24638
2025-06-30 10:08:49 +03:00
Avi Kivity
b33dd2bd7d Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

Closes scylladb/scylladb#24492

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-06-29 18:18:36 +03:00
Avi Kivity
48d9f3d2e3 Merge 'mutation: check key of inserted rows' from Botond Dénes
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.

Fixes: https://github.com/scylladb/scylladb/issues/24506

Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.

Closes scylladb/scylladb#24497

* github.com:scylladb/scylladb:
  mutation: check key of inserted rows
  compound: optimize is_full() for single-component types
2025-06-29 18:10:17 +03:00
Lakshmi Narayanan Sreethar
279253ffd0 utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640
2025-06-26 15:29:28 +03:00
Avi Kivity
947906e6fd Merge 'Make uuid sstable generations mandatory' from Benny Halevy
Before we can eradicate the numerical sstable generations,
This series completes https://github.com/scylladb/scylladb/issues/20337
by disabling the use of numerical sstable generations where we can
and making sure the feature is never disabled.

Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables.

Refs #24248

* Enhancement.  No backport required

Closes scylladb/scylladb#24554

* github.com:scylladb/scylladb:
  feature_service: never disable UUID_SSTABLE_IDENTIFIERS
  test: sstable_move_test: always use uuid sstable generation
  test: sstable_directory_test: always use uuid sstable generation
  sstables: sstable_generation_generator: set last_generation=0 by default
  test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
  test: lib: test_env: always use uuid sstable generation
  test: sstable_test: always use uuid sstable generation
  test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
  test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
  test: sstable_compaction_test: always use uuid sstable generation
2025-06-26 12:25:38 +02:00
Pavel Emelyanov
0f5b358c47 test: Use test sched groups, not database ones
Some tests want to switch between sched groups. For that there's
cql-test-env facility to create and use them. However, there's a test
that uses replica::database as sched groups provider, which is not nice.
Fix it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24615
2025-06-26 12:25:38 +02:00
Botond Dénes
edc2906892 test/boost/sstable_datafile_test: add test for corrupt data
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data
2025-06-25 08:41:29 +03:00
Botond Dénes
ab96c703ff mutation: check key of inserted rows
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.

The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.

Fixes: https://github.com/scylladb/scylladb/issues/24506
2025-06-23 09:38:45 +03:00
Nadav Har'El
85c19d21bb Merge 'cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Karol Nowacki
cql, schema: Extend name length limit from 48 to 192 bytes

    This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
    The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
    and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
    This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

    The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
    When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
    For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
    The directory name for this log table becomes the longest possible representation.
    Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
    To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
      255 bytes (common filesystem limit for a path component)
    -  32 bytes (for the 32-character UUID string)
    -   1 byte  (for the '-' separator)
    -  15 bytes (for the '_scylla_cdc_log' suffix)
    -  15 bytes (reserved for future use)
    ----------
    = 192 bytes (Maximum allowed name length)
    This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

    This patch also updates/adds all associated tests to validate the new 192-byte limit.
    The documentation has been updated accordingly.

Fixes #4480

Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release.

Closes scylladb/scylladb#24500

* github.com:scylladb/scylladb:
  cql, schema: Extend name length limit from 48 to 192 bytes
  replica: Remove unused keyspace::init_storage()
2025-06-22 17:41:10 +03:00
Avi Kivity
770b91447b Merge 'memtable: ensure _flushed_memory doesn't grow above total_memory' from Michał Chojnowski
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

Closes scylladb/scylladb#21638

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-06-22 11:19:25 +03:00
Michał Chojnowski
7d551f99be replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
2025-06-20 11:42:30 +02:00
Łukasz Paszkowski
a9a53d9178 compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505
2025-06-20 11:33:49 +03:00
Karol Nowacki
4577c66a04 cql, schema: Extend name length limit from 48 to 192 bytes
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
  255 bytes (common filesystem limit for a path component)
-  32 bytes (for the 32-character UUID string)
-   1 byte  (for the '-' separator)
-  15 bytes (for the '_scylla_cdc_log' suffix)
-  15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.
2025-06-18 14:08:38 +02:00
Benny Halevy
ecc7272a07 test: sstable_move_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
49ca442e7c test: sstable_directory_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
15bee9f232 sstables: sstable_generation_generator: set last_generation=0 by default
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
079c5fe5e3 test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
0310a03de6 test: sstable_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
b00b805da6 test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
Which is `true` by default anyhow.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
f644c5896f test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
bfa0bb78f9 test: sstable_compaction_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Michał Chojnowski
27f66fb110 test/boost/mutation_reader_test: fix a use-after-free in test_fast_forwarding_combined_reader_is_consistent_with_slicing
The contract in mutation_reader.hh says:

```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
    future<> fast_forward_to(const dht::partition_range& pr) {
```

`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.

Fix that.

Fixes scylladb/scylladb#24542

Closes scylladb/scylladb#24543
2025-06-17 19:30:50 +03:00
Pavel Emelyanov
b0766d1e73 Merge 's3_client: Refactor range class for state validation' from Ernest Zaslavsky
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333

No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880

Closes scylladb/scylladb#24312

* github.com:scylladb/scylladb:
  s3_client: headers cleanup
  s3_client: Refactor `range` class for state validation
2025-06-17 10:34:55 +03:00
Avi Kivity
cd79a8fc25 Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz"
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).

Fixes #24513
2025-06-16 22:38:12 +03:00
Ernest Zaslavsky
9ad7a456fe s3_client: Refactor range class for state validation
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.
2025-06-16 16:02:24 +03:00