Commit Graph

4016 Commits

Author SHA1 Message Date
Benny Halevy
dee0d7ffbf locator: tablets: get rid of synchronous mutate_tablet_map
It is currently used only by tests that could very well
do with mutate_tablet_map_async.

This will simplify the following patch to prevent
accidental copy of the tablet_map, provding explicit
clone/clone_gently methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-22 15:03:02 +03:00
Ernest Zaslavsky
408aa289fe treewide: Move misc files to utils directory
As requested in #22114, moved the files and fixed other includes and build system.

Moved files:
- interval.hh
- Map_difference.hh

Fixes: #22114

This is a cleanup, no need to backport

Closes scylladb/scylladb#25095
2025-07-21 11:56:40 +03:00
Piotr Dulikowski
7fd97e6a93 Merge 'cdc: Forbid altering columns of CDC log tables directly' from Dawid Mędrek
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.

Closes scylladb/scylladb#25008

* github.com:scylladb/scylladb:
  cdc: Forbid altering columns of inactive CDC log table
  cdc: Forbid altering columns of CDC log tables directly
2025-07-21 09:31:00 +02:00
Avi Kivity
3dfdcf7d7a Merge 'transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

Closes scylladb/scylladb#24738

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-07-20 17:42:30 +03:00
Dario Mirovic
9f4344a435 utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
2025-07-17 16:40:02 +02:00
Botond Dénes
20693edb27 Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).

In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.

This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.

Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.

No backports needed, this is a preparation for new functionality.

Closes scylladb/scylladb#25000

* github.com:scylladb/scylladb:
  sstables: add sstable::make_index_reader() and use where appropriate
  sstables/mx: in readers, use abstract_index_reader instead of index_reader
  sstables: in validate(), use abstract_index_reader instead of index_reader where possible
  test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
  sstables/index_reader: introduce abstract_index_reader
  sstables/index_reader: extract a prefetch_lower_bound() method
2025-07-17 14:32:08 +03:00
Michał Chojnowski
4e4a4b6622 sstables: add sstable::make_index_reader() and use where appropriate
If we add multiple index implementations, users of index readers won't
easily know which concrete index reader type is the right one to construct.

We also don't want pieces of code to depend on functionality specific to
certain concrete types, if that's not necessary.

So instead of constructing the readers by themselves, they can use a helper
function, which will return an abstract (virtual) index reader.
This patch adds such a function, as a method of `sstable`.
2025-07-17 10:32:57 +02:00
Dawid Mędrek
20d0050f4e cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643
2025-07-16 15:35:48 +02:00
Avi Kivity
c762425ea7 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start
2025-07-16 13:15:54 +03:00
Andrzej Jackowski
b20aa7b5eb auth: require scheme as parameter for generate_salt
This is a refactoring commit that changes the `generate_salt` function
to require a password hashing scheme as a parameter. This change is
motivated by the upcoming removal of support for obsolete password
hashing schemes and removal of `identify_best_supported_scheme()`
function.

Ref. scylladb/scylladb#24524
2025-07-15 20:26:39 +02:00
Botond Dénes
a26b6a3865 Merge 'storage: add make_data_or_index_source to the storages' from Ernest Zaslavsky
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance

* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it  for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`

No backport needed since it enhances functionality which has not been released yet

fixes: https://github.com/scylladb/scylladb/issues/22458

Closes scylladb/scylladb#23695

* github.com:scylladb/scylladb:
  sstables: Start using `make_data_or_index_source` in `sstable`
  sstables: refactor readers and sources to use coroutines
  sstables: coroutinize futurized readers
  sstables: add `make_data_or_index_source` to the `storage`
  encryption: refactor key retrieval
  encryption: add `encrypted_data_source` class
2025-07-15 13:32:13 +03:00
Ernest Zaslavsky
8d49bb8af2 sstables: Start using make_data_or_index_source in sstable
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.

For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.
2025-07-15 10:10:23 +03:00
Botond Dénes
26f135a55a Merge 'Make KMIP host do nice TLS close on dropped connection + make PyKMIP test fixure not generate TLS noise + remove boost::process' from Calle Wilund
Fixes #24873

In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors.

This just makes sure `release`  closes the connection if it neither retains or caches it.

Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors.
Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs.

This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency.

Closes scylladb/scylladb#24874

* github.com:scylladb/scylladb:
  encryption_test: Make PyKMIP run under seastar::experimental::process
  test/lib: Add wrapper helper for test process fixtures
  kmip_host: Close connections properly if dropped by pool being full
  encryption_at_rest_test: Do port check using TLS
2025-07-15 06:55:34 +03:00
Calle Wilund
722e2bce96 encryption_test: Make PyKMIP run under seastar::experimental::process
Removes the requirement of boost::process, and all its non-seastar-ness.
Hopefully also makes the IO and shutdown handling a bit more reliable.
2025-07-14 12:18:16 +00:00
Calle Wilund
0fe8836073 encryption_at_rest_test: Do port check using TLS
If we connect using just a socket, and don't terminate connection
nicely, we will get annoying errors in PyKMIP log. These distract
from real errors. So avoid them.
2025-07-14 08:31:02 +00:00
Avi Kivity
6fce817aa8 Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531

Closes scylladb/scylladb#24886

[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]

* github.com:scylladb/scylladb:
  test: add type creation to test_snapshot
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-07-13 20:47:55 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Benny Halevy
0e455c0d45 utils: clear_gently: add support for sets
Since set and unordered_set do not allow modifying
their stored object in place, we need to first extract
each object, clear it gently, and only then destroy it.

To achieve that, introduce a new Extractable concept,
that extracts all items in a loop and calls clear_gently
on each extracted item, until the container is empty.

Add respective unit tests for set and unordered_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24608
2025-07-13 12:30:45 +03:00
Marcin Maliszkiewicz
15b4db47c7 storage_service: always wake up load balancer on update tablet metadata
Lack of wakeup is error-prone, as it relies on a wakeup occurring
elsewhere.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
19bc6ffcb0 replica: make truncate_table_on_all_shards get whole schema from table_shards
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.

Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
2f840e51d1 service: pull out update_tablet_metadata from migration_listener
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
fa157e7e46 db: service: add store_service dependency to schema_applier
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
847d7f4a3a service: simplify load_tablet_metadata and update_tablet_metadata
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)

- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
c2cd02272a replica: db: split drop_table into steps
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.

We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()

prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.

We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.

Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
2025-07-10 10:40:43 +02:00
Pawel Pery
eadbf69d6f vector_store_client: implement ANN API
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It implements a functionality for ANN search request to a vector-store
service. It sends request, receive response and after parsing it returns
the list of primary keys.

It adds json parsing functionality specific for the HTTP ANN API.

It adds a hardcoded http request timeout for retrieving response from
the Vector Store service.

It also adds an automatic boost test of the ANN search interface, which
uses a mockup http server in a background to simulate vector-store
service.

It adds a documentation for HTTP API protocol used used for ANN
functionality.

Fixes: VS-47
2025-07-09 11:54:51 +02:00
Pawel Pery
1f797e2fcd vector_store_client: implement ip addr retrieval from dns
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It implements functionality for refreshing ip address of the
vector-store service dns name and creating a new HTTP client with that
address. It also provides cleanup of unused http clients. There are
hardcoded intervals for dns refresh and old http clients cleanup, and
timeout for requesting new http client.

This patch introduces two background tasks - for dns resolving
task and for cleanup old http clients.

It adds unit tests for possible dns refreshing issues.

Reference: VS-47
Fixes: VS-45
2025-07-09 11:54:51 +02:00
Pawel Pery
7bf53fc908 vector_store_client: implement initial vector_store_client service
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.

For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.

This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.

Reference: VS-47 VS-45
2025-07-08 16:29:55 +02:00
Benny Halevy
2c0bafb934 token_metadata: clear_and_destroy_impl when destroyed
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.

This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.

Fixes #13381

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 15:07:31 +03:00
Benny Halevy
2b2cfaba6e token_metadata: keep a reference to shared_token_metadata
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 15:07:31 +03:00
Benny Halevy
e0a19b981a token_metadata: move make_token_metadata_ptr into shared_token_metadata class
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 14:22:20 +03:00
Ernest Zaslavsky
211daeaa40 encryption: add encrypted_data_source class
Introduce the `encrypted_data_source` class that wraps an existing data
source to read and decrypt data on the fly using block encryption. Also add
unit tests to verify correct decryption behavior.
NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is

Co-authored-by: Calle Wilund <calle@scylladb.com>
2025-07-06 09:18:39 +03:00
Pavel Emelyanov
fa0077fb77 Merge 'S3 chunked download source bug fixes' from Ernest Zaslavsky
- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle

No need to backport anything since this class in not used yet

Closes scylladb/scylladb#24657

* github.com:scylladb/scylladb:
  s3_test: Add s3_client test for non-retryable error handling
  s3_test: Add trace logging for default_retry_strategy
  s3_client: Fix edge case when the range is exhausted
  s3_client: Fix indentation in try..catch block
  s3_client: Stop retries in chunked download source
  s3_client: Enhance test coverage for retry logic
  s3_client: Add test for Content-Range fix
  s3_client: Fix missing negation
  s3_client: Refine logging
  s3_client: Improve logging placement for current_range output
2025-07-02 14:45:10 +03:00
Avi Kivity
dfaed80f55 Merge 'types: add byte-comparable format support for native cql3 types' from Lakshmi Narayanan Sreethar
This PR introduces a new `comparable_bytes` class to add byte-comparable format support for all the [native cql3 data types](https://opensource.docs.scylladb.com/stable/cql/types.html#native-types) except `counter` type as that is not comparable. The byte-comparable format is a pre-requisite for implementing the trie based index format for our sstables(https://github.com/scylladb/scylladb/issues/19191). This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md

Note that support for composite data types like lists, maps, and sets has not been implemented yet and will be made available in a separate PR.

Refs https://github.com/scylladb/scylladb/issues/19407

New feature - backport not required.

Closes scylladb/scylladb#23541

* github.com:scylladb/scylladb:
  types/comparable_bytes: add testcase to verify compatibility with cassandra
  types/comparable_bytes: support variable-length natively byte-ordered data types
  types/comparable_bytes: support decimal cql3 types
  types/comparable_bytes: introduce count_digits() method
  types/comparable_bytes: support uuid and timeuuid cql3 types
  types/comparable_bytes: support varint cql3 type
  types/comparable_bytes: support skipping sign byte write in decode_signed_long_type
  types/comparable_bytes: introduce encode/decode_varint_length
  types/comparable_bytes: support float and double cql3 types
  types/comparable_bytes: support date, time and timestamp cql3 types
  types/comparable_bytes: support bigint cql3 type
  types/comparable_bytes: support fixed length signed integers
  types/comparable_bytes: support boolean cql3 type
  types: introduce comparable_bytes class
  bytes_ostream: overload write() to support writing from FragmentedView
  docs: fix minor typo in docs/dev/cql3-type-mapping.md
2025-07-02 11:58:32 +03:00
Avi Kivity
1e0b015c8b Merge 'cql3: Represent create_statement using managed_bytes' from Dawid Mędrek
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by

```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```

in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.

However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.

In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.

A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.

Fixes scylladb/scylladb#24018

Backport: all of the supported versions are affected, so we want to backport the changes there.

Closes scylladb/scylladb#24151

* github.com:scylladb/scylladb:
  cql3/description: Serialize only rvalues of description
  cql3: Represent create_statement using managed_string
  cql3/statements/describe_statement.cc: Don't copy descriptions
  cql3: Use managed_bytes instead of bytes in DESCRIBE
  utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
2025-07-01 21:59:38 +03:00
Lakshmi Narayanan Sreethar
5f5a8cf54c types/comparable_bytes: add testcase to verify compatibility with cassandra 2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
6c1853a830 types/comparable_bytes: support variable-length natively byte-ordered data types
The following cql3 data types - ascii, blob, duration, inet, and text -
are natively byte-ordered in their serialized forms. To encode them into
a byte-comparable format, zeros are escaped, and since these types have
variable lengths, the encoded form is terminated in an escaped state to
mark its end.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
5c77d17834 types/comparable_bytes: support decimal cql3 types
The decimal cql3 type is internally stored as a scale and an unscaled
integer. To convert them into a byte comparable format, they are first
normalized into a base-100 exponent and a mantissa that lies in [0.01, 1)
and then encoded into a byte sequence that preserves the numerical order.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
832236d044 types/comparable_bytes: introduce count_digits() method
Implemented a method `count_digits()` to return the number of significant
digits in a given boost::multiprecision:cpp_int. This is required to
convert big_decimal to a byte comparable format.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
a00c5d3899 types/comparable_bytes: support uuid and timeuuid cql3 types
The uuid type values are composed of two fixed-length unsigned integers:
an msb and an lsb. The msb contains a version digit, which must be
pulled first in a byte-comparable representation. For version 1 uuids,
in addition to extracting the version digit first, the msb must be
rearranged to make it byte comparable. The lsb is written as is.

For the timeuuid type, the msb is handled simliar to the version 1 uuid
values. The lsb however is treated differently - the sign bits of all
bytes are inverted to preserve the legacy comparison order, which
compared individual bytes as signed values.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar
4592b9764c types/comparable_bytes: support varint cql3 type
Any varint value less than 7 bytes is encoded using the signed long
encoding format and remaining values are all encoded using the full form
encoding :

  <signbyte><length as unsigned integer - 7><7 or more bytes>,

where <signbyte> is 00 for negative numbers and FF for positive ones,
and the length's bytes are inverted if the number is negative (so that
longer length sorts smaller).

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
ad45a19373 types/comparable_bytes: introduce encode/decode_varint_length
The length of a varint value is encoded separately as an unsigned
variable-length integer. For negative varint values, the encoded bytes
are flipped to ensure that longer lengths sort smaller. This patch
implements both encoding and decoding logic for varint lengths and will
be used by the subsequent patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
7af153c237 types/comparable_bytes: support float and double cql3 types
The sign bit is flipped for positive values to ensure that they are
ordered after negative values. For negative values, all the bytes are
inverted, allowing larger negative values to be ordered before smaller
ones.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
0145c1d705 types/comparable_bytes: support date, time and timestamp cql3 types
Both the date and time cql3 types are internally unsigned fixed length
integers. Their serialized form is already byte comparable, so the
encoder and decoder return the serialized bytes as it is.

The timestamp type is encoded using the fixed length signed integer
encoding.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
b6ff3f5304 types/comparable_bytes: support bigint cql3 type
The bigint type, internally implemented as a long data type, is encoded
using a variable-length encoding similar to UTF-8. This enables a
significant amount of space to be saved when smaller numbers are
frequently used, while still permitting large values to be efficiently
encoded.

The first bit of the encoding represents the inverted sign (i.e., 1 for
positive, 0 for negative), followed by length encoded as a sequence of
bits matching the inverted sign. This is then followed by a differing
bit (except for 9-byte encodings) and the bits of the number's two's
complement.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
c0d25060bd types/comparable_bytes: support fixed length signed integers
To encode fixed-length signed integers in a byte-comparable format, the
first bit of each value is inverted. This ensures that negative numbers
are ordered before positive ones during comparison. This patch adds
support for the data types : byte_type (tinyint), short_type (smallint),
and int32_type (int). Although long_type (bigint) is a fixed length
integer type, it has different byte comparable encoding and will be
handled separately in another patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
8572afca2b types/comparable_bytes: support boolean cql3 type
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
74c556a33d types: introduce comparable_bytes class
This patch implements a new class, `comparable_bytes`, designed to
implement methods for converting data values to and from byte-comparable
formats. The class stores the comparable bytes as `managed_bytes` and
currently provides the structure for all required methods. The actual
logic for converting various data types will be implemented in subsequent
patches.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar
e4c7cb7834 bytes_ostream: overload write() to support writing from FragmentedView
Overloaded write() method to support writing a FragmentedView into
bytes_ostream. Also added a testcase to verify the implementation.
The new helper will be used by the byte_comparable implementation
during the encode/decode process.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-07-01 22:19:07 +05:30
Ernest Zaslavsky
acf15eba8e s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
a5246bbe53 s3_test: Add trace logging for default_retry_strategy
Introduce trace-level logging for `default_retry_strategy` in
`s3_test` to improve visibility into retry logic during test
execution.
2025-07-01 18:45:17 +03:00