We don't want tests to create the concrete `index_reader` directly. We
would like them to be able to test both sstables which use
`index_reader`, and those which will use the planned new index implementation.
So we will let the tests construct an abstract_index_reader and pass it
to the index_reader_assertions, which will be able to assert the requested
properties on various implementations as it wants.
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.
The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.
Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.
In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running
test/alternator/run --runveryslow \
test_query.py::test_query_large_page_small_rows
reports in the log:
oversized allocation: 573440 bytes.
After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.
The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.
Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.
Fixes#23535
The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.
Closesscylladb/scylladb#24480
* github.com:scylladb/scylladb:
alternator: clean up by co-routinizing
alternator: avoid spamming the log when failing to write response
alternator: clean up and simplify request_return_type
alternator: avoid oversized allocation in Query/Scan
Fixes#24998
Helper routine translating input_stream buffers to single lines
did not loop over current buffer state, leading to only the first
line being sent to end listener.
Rewrote to use range iteration instead. Nicer.
Closesscylladb/scylladb#24999
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.
The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B
Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.
Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.
To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.
Fixes#24807Fixes#24925Closesscylladb/scylladb#24929
* github.com:scylladb/scylladb:
streaming: Avoid deadlock by running view checks in a separate scheduling group
service: migration_manager: Run group0 barrier in gossip scheduling group
repair: Speed up ranges calculation when small table optimization is on
Normally, during bootstrap, in repair_service::bootstrap_with_repair, we
need to calculate which range to sync data from carefully for the new
node. With small table optimization on, we pass a single full range and
all peer nodes to row level repair to sync data with. Now that we only
need to pass a single range and full peers, there is no need to calculate
the ranges and peers in repair_service::bootstrap_with_repair and drop
it later. The calculation takes time which slows down bootstrap, e.g.,
```
Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - bootstrap_with_repair: started with
keyspace=system_distributed_everywhere, nr_ranges=23809
Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]:
sync data for keyspace=system_distributed_everywhere, status=started,
reason=bootstrap, small_table_optimization=true
```
The range calculation took 15 seconds for system_distributed_everywhere
table.
To fix, the ranges calculation is skipped if small table optimization is
on for the keyspace.
Before:
cluster dev [ PASS ] cluster.test_boot_nodes.1 104.59s
After:
cluster dev [ PASS ] cluster.test_boot_nodes.1 89.23s
A 15% improvement to bootstrap 30 node cluster was observed.
Fixes#24817Closesscylladb/scylladb#24901
* github.com:scylladb/scylladb:
repair: Speed up ranges calculation when small table optimization is on
test: Add test_boot_nodes.py
The tests cover a variety of scenarios, including:
* Authentication with client secrets, client certificates, and IMDS.
* Valid and invalid encryption options in the configuration and table
schema.
* Common error conditions such as insufficient permissions, non-existent
keys and network errors.
All tests run against a local mock server by default. A subset of the
tests can also against real Azure services if properly configured. The
tests that support real Azure services were kept to a minimum to cover
only the most basic scenarios (success path and common error
conditions).
Running the tests with real resources requires parameterizing them with
env vars:
* ENABLE_AZURE_TEST - set to non-zero (1/true) to run Azure tests (enabled by default)
* ENABLE_AZURE_TEST_REAL - set to non-zero (1/true) to run against real Azure services
* AZURE_TENANT_ID - the tenant where the principals live
* AZURE_USER_1_CLIENT_ID - the client ID of user1
* AZURE_USER_1_CLIENT_SECRET - the secret of user1
* AZURE_USER_1_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user1
* AZURE_USER_2_CLIENT_ID - the client ID of user2
* AZURE_USER_2_CLIENT_SECRET - the secret of user2
* AZURE_USER_2_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user2
* AZURE_KEY_NAME - set to <vault_name>/<keyname>
User1 is assumed to have permissions to wrap/unwrap using the given key.
User2 is assumed to not have permissions for these operations.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The Azure Key Provider depends on three Azure services:
- Azure Key Vault
- IMDS
- Entra STS
To enable local testing, introduce a mock server that offers all the
needed APIs from these services. The server also offers an error
injection endpoint to configure a particular service to respond with
some error code for a number of consecutive requests.
The server is integrated as a 3rd party service in test.py.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.
In this commit, we're restricting those operations. We also provide two
validation tests.
One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.
Fixesscylladb/scylladb#24643
Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups`
Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed.
Fixes: https://github.com/scylladb/scylladb/issues/24927Closesscylladb/scylladb#24928
* https://github.com/scylladb/scylladb:
test.py: add bypassing x_log2_compaction_groups to boost tests
test.py: add bypassing random seed to boost tests
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.
This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.
Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.
Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.
Fixes https://github.com/scylladb/scylladb/issues/24524
Backport not needed, as it is a new feature.
Closesscylladb/scylladb#24924
* github.com:scylladb/scylladb:
main: utils: add thread names to alien workers
auth: move passwords::check call to alien thread
test: wait for 3 clients with given username in test_service_level_api
auth: refactor password checking in password_authenticator
auth: make SHA-512 the only password hashing scheme for new passwords
auth: whitespace change in identify_best_supported_scheme()
auth: require scheme as parameter for `generate_salt`
auth: check password hashing scheme support on authenticator start
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
Analysis of customer stalls showed that the `detail::hash_with_salt`
function, called from `passwords::check`, often blocks the reactor.
This function internally uses the `crypt_r` function from an external
library to compute password hashes, which is a CPU-intensive operation.
To prevent such reactor stalls, this commit moves the
`passwords::check` call to a dedicated alien thread. This thread is
created at system startup and is shared by all shards.
Within the alien thread, an `std::mutex` synchronizes access between
the thread and the shards. While this could theoretically cause
frequent lock contentions, in practice, even during connection storms,
the number of new connections per second per shard is limited
(typically hundreds per second). Additionally, the
`_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not
too many connections are processed at once.
Fixesscylladb/scylladb#24524
test_service_level_api tests create a new session and wait for all
clients to authenticate. However, the check that all connections are
authenticated is done by verifying that there are no connections
with the username 'anonymous', which is insufficient if new connections
have not yet been listed.
To avoid test failures, this commit introduces an additional check that
verifies all expected clients are present in the system.clients table
before proceeding with the test.
This is a refactoring commit that changes the `generate_salt` function
to require a password hashing scheme as a parameter. This change is
motivated by the upcoming removal of support for obsolete password
hashing schemes and removal of `identify_best_supported_scheme()`
function.
Ref. scylladb/scylladb#24524
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance
* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`
No backport needed since it enhances functionality which has not been released yet
fixes: https://github.com/scylladb/scylladb/issues/22458Closesscylladb/scylladb#23695
* github.com:scylladb/scylladb:
sstables: Start using `make_data_or_index_source` in `sstable`
sstables: refactor readers and sources to use coroutines
sstables: coroutinize futurized readers
sstables: add `make_data_or_index_source` to the `storage`
encryption: refactor key retrieval
encryption: add `encrypted_data_source` class
To discover what tests are included into combined_tests, pytest check this at
the very beginning. In the case if combined_tests binary is missing, it will
fail discovery and will not run test, even when it was not included into
combined_tests. This PR changes behavior, so it will not fail when
combined_tests is missing and only fail in case someone tries to run test from
it.
Closesscylladb/scylladb#24761
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.
For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.
In #24442 it was noticed that accidentally, for a year now, test.py and CI were running the Alternator functional tests (test/alternator) using one write isolation mode (`only_rmw_uses_lwt`) while the manual test/alternator/run used a different write isolation mode (`always_use_lwt`). There is no good reason for this discrepancy, so in the second patch of this 2-patch series we change test/alternator/run to use the write isolation mode that we've had in CI for the last year.
But then, discussion on #24442 started: Instead of picking one mode or the other, don't we need test both modes? In fact, all four modes?
The honest answer is that running **all tests** with **all combinations of options** is not practical - we'll find ourselves with an exponentially growing number of tests. What we really need to do is to run most tests that have nothing to do with write isolation modes on just one arbitrary write isolation mode like we're doing today. For example, numerous tests for the finer details of the ConditionExpression syntax will run on one mode. But then, have a separate test that verifies that one representative example of ConditionExpression (for example) works correctly on all four write isolation modes - rejected in forbid_rmw mode, allowed and behaves as expected on the other three. We had **some** tests like that in our test suite already, but the first patch in this series adds many more, making the test much more exhaustive and making it easier to review that we're really testing all four write isolation modes in every scenario that matters.
Fixes#24442
No need to backport this patch - it's just adding more tests and changing developer-only test behavior.
Closesscylladb/scylladb#24493
* github.com:scylladb/scylladb:
test/alternator: make "run" script use only_rmw_uses_lwt
test/alternator: improve tests for write isolation modes
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
responding before it applied mutations from any of the writes.
We fix the test by not expecting reads with CL=ONE to return a row.
We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.
Fixesscylladb/scylladb#22967
The fix addresses CI flakiness and only changes the test, so it
should be backported.
Closesscylladb/scylladb#23518
Before this series, it is possible to crash Scylla (due to an I/O error) by creating an Alternator table close to the maximum name length of 222, and then enabling Alternator Streams. This series fixes this bug in two ways:
1. On a pre-existing table whose name might be up to 222 characters, enabling Streams will check if the resulting name is too long, and if it is, fail with a clear error instead of crashing. This case will effect pre-existing tables whose name has between 207 and 222 characters (207 is `222 - strlen("_scylla_cdc_log")`) - for such tables enabling Streams will fail, but no longer crash.
2. For new tables, the table name length limit is lowered from 222 to 192. The new limit is still high enough, but ensures it will be possible to enable streams any new table. It will also always be possible to add a GSI for such a table with name up to 29 characters (if the table name is shorter, the GSI name can be longer - the sum can be up to 221 characters).
No need to backport, Alternator Streams is still an experimental feature and this patch just improves the unlikely situation of extremely long table names.
Fixes#24598Closesscylladb/scylladb#24717
* github.com:scylladb/scylladb:
alternator: lower maximum table name length to 192
alternator: don't crash when adding Streams to long table name
alternator: split length limit for regular and auxiliary tables
alternator: avoid needlessly validating table name
Fixes#24873
In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors.
This just makes sure `release` closes the connection if it neither retains or caches it.
Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors.
Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs.
This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency.
Closesscylladb/scylladb#24874
* github.com:scylladb/scylladb:
encryption_test: Make PyKMIP run under seastar::experimental::process
test/lib: Add wrapper helper for test process fixtures
kmip_host: Close connections properly if dropped by pool being full
encryption_at_rest_test: Do port check using TLS
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.
Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.
In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running
test/alternator/run --runveryslow \
test_query.py::test_query_large_page_small_rows
reports in the log:
oversized allocation: 573440 bytes.
After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.
The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.
Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.
Fixes#23535
Adds a wrapper for seastar::experimental::process, to help
use external process fixtures in unit test. Mainly to share
concepts such as line reading of stdout/err etc, and sync
the shutdown of these. Also adds a small path searcher to
find what you want to run.
If we connect using just a socket, and don't terminate connection
nicely, we will get annoying errors in PyKMIP log. These distract
from real errors. So avoid them.
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531Closesscylladb/scylladb#24886
[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]
* github.com:scylladb/scylladb:
test: add type creation to test_snapshot
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
Since set and unordered_set do not allow modifying
their stored object in place, we need to first extract
each object, clear it gently, and only then destroy it.
To achieve that, introduce a new Extractable concept,
that extracts all items in a loop and calls clear_gently
on each extracted item, until the container is empty.
Add respective unit tests for set and unordered_set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#24608
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.
The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B
Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.
To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.
Fixes: #24807
Vector Store service is a http server which provides vector search index and an ANN (Approximate Nearest Neighbor) functionality. Vector Store retrieves metadata & data from Scylla about indexes using CQL protocol & CDC functionality. Scylla will request ann search using http api.
Commits for the patch:
- implement initial `vector_store_client` service. It adds also a parameter `vector_store_uri` to the scylla.
- refactor sequential_producer as abortable
- implement ip addr retrieval from dns. The uri for Vector Store must contains dns name, this commit implements ip addr refreshing functionality
- refactor primary_key as a top-level class. It is needed for the forward declaration of a primary_key
- implement ANN API. It implements a core ANN search request functionality, adds Vector Store HTTP API description in docs/protocols.md, and implements automatic boost tests with mocked http server for checking error conditions.
New feature, should not be backported.
Fixes: VECTOR-47
Fixes: VECTOR-45
-~-
Closesscylladb/scylladb#24331
* github.com:scylladb/scylladb:
vector_store_client: implement ANN API
cql3: refactor primary_key as a top-level class
vector_store_client: implement ip addr retrieval from dns
utils: refactor sequential_producer as abortable
vector_store_client: implement initial vector_store_client service
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.
Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)
- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.
We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()
prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.
We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.
Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.
Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.
However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
stopped.
To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.
Fixesscylladb/scylladb#24857Fixesscylladb/scylladb#24828Closesscylladb/scylladb#24896
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.
It implements a functionality for ANN search request to a vector-store
service. It sends request, receive response and after parsing it returns
the list of primary keys.
It adds json parsing functionality specific for the HTTP ANN API.
It adds a hardcoded http request timeout for retrieving response from
the Vector Store service.
It also adds an automatic boost test of the ANN search interface, which
uses a mockup http server in a background to simulate vector-store
service.
It adds a documentation for HTTP API protocol used used for ANN
functionality.
Fixes: VS-47
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.
It implements functionality for refreshing ip address of the
vector-store service dns name and creating a new HTTP client with that
address. It also provides cleanup of unused http clients. There are
hardcoded intervals for dns refresh and old http clients cleanup, and
timeout for requesting new http client.
This patch introduces two background tasks - for dns resolving
task and for cleanup old http clients.
It adds unit tests for possible dns refreshing issues.
Reference: VS-47
Fixes: VS-45
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.
It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.
For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.
This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.
Reference: VS-47 VS-45
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.
Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.
backport to relevant versions since the issue can cause node shutdown to hang
Fixesscylladb/scylladb#24599Closesscylladb/scylladb#24595
* github.com:scylladb/scylladb:
test: test_batchlog_manager: batchlog replay includes cdc
test: test_batchlog_manager: test batch replay when a node is down
batchlog_manager: set timeout on writes
batchlog_manager: abort writes on shutdown
batchlog_manager: create cancellable write response handler
storage_proxy: add write type parameter to mutate_internal
Skip removing any artifacts when -s provided between test.py invocation.
Logs from the previous run will be overridden if tests were executed one
more time. Fox example:
1. Execute tests A, B, C with parameter -s
2. All logs are present even if tests are passed
3. Execute test B with parameter -s
4. Logs for A and C are from the first run
5. Logs for B are from the most recent run
Currently we create a view for every index, however
for currently supported custom index classes (vector_index)
that work is redundant, as we store the index in the external
service.
This patch adds a way for custom indexes to choose whether to
create a view when creating the index and makes it so that
for vector indexes the view is not created.
Currently, to describe an index we look at
a corresponding view. However for custom indexes
the view may not exist (as we are removing the views
from vector indexes). This commit adds a way for a custom
index class to override the default describing logic
and provides such an override for the vector_index
class.
Several audit test issues caused test failures, and in the result, almost all of audit syslog tests were marked with xfail.
This patch series enables the syslog audit tests, that should finally pass after the following fixes are introduced:
- bring back commas to audit syslog (scylladb#24410 fix)
- synchronize audit syslog server
- fix parsing of syslog messages
- generate unique uuid for each line in syslog audit
- allow audit logging from multiple nodes
Fixes: scylladb/scylladb#24410
Test improvements, no backport required.
Closesscylladb/scylladb#24553
* github.com:scylladb/scylladb:
test: audit: use automatic comparators in AuditEntry
test: audit: enable syslog audit tests
test: audit: sort new audit entries before comparing with expected ones
test: audit: check audit logging from multiple nodes
test: audit: generate unique uuid for each line in syslog audit
test: audit: fix parsing of syslog messages
test: audit: synchronize audit syslog server
docs: audit: update syslog audit format to the current one
audit: bring back commas to audit syslog
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.
This is done in order to verify that it works currently as expected and
doesn't break in the future.