scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Andrzej Jackowski	9c9371511f	test: add API tests for client_routes endpoints Ref: scylladb/scylla-enterprise#5699	2025-12-15 17:46:14 +01:00
Andrzej Jackowski	2e80997630	test: add `timeout` parameter to `delete` in RESTClient The parameter was missing and is needed to implement a test case later in this patch series.	2025-12-15 17:44:48 +01:00
Andrzej Jackowski	1143acaf5b	test: allow json_body in send Needed to test `/v2/connection_metadata` endpoints that receive JSON input. Ref: scylladb/scylla-enterprise#5699	2025-12-15 17:44:48 +01:00
Nadav Har'El	4e106b9820	test/cqlpy: remove unused variables Copilot detected a few cases of cqlpy tests setting a variable which they don't use. In all the cases in this patch, we can just remove the variable. Although the AI found all these unused variables, I verified each case carefully before changing it in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 18:11:04 +02:00
Nadav Har'El	64d9c370ee	test/alternator: remove unnecessary duplicate statement copilot noticed that test/alternator/test_scan.py had a duplicate statement (call to full_scan()). It doesn't break the test, but also adds nothing but confusion - so let's just remove it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 18:07:45 +02:00
Nadav Har'El	a3959fe3db	test/alternator: remove unused variable assignments copilot noticed in that in in many of Alternator tests, we have some unnecessary assignments. For example, in a few places, we use the idiom: with pytest.raises(...): ret = ... The "ret=" part is unnecessary, as this test expects the statement to fail (hence the raises()), and ret is never assigned. The assignment was only there because we copied this statement from another place in the test, which does expect the statement to pass and wants to validate the returned value. So we should just drop the "ret=" from these tests. Another common occurance is that we used the idiom response = table.do_something() Without checking the response and no intention to check it (either we know it will work, or we just want to check it doesn't throw). So we can drop the "response=" here too. All of the unused variables in this patch were discovered by Copilot, but I reviewed each of them carefully myself and prepared this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 18:07:05 +02:00
Nadav Har'El	4fa4f40712	test/cqlpy: use unique partition in test It is traditional to use a unique (or random) partition key in cqlpy tests, to allow multiple tests to share the same table and make the test suite a bit faster. One of the tests, test_multi_column_relation_desc, set up a unique key "k", but then forgot to use it and used partition key 0 instead. Fix the test to use this k. This problem was spotted by Copilot, who saw the unused variable k. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 17:08:51 +02:00
Yauheni Khatsianevich	07867a9a0d	test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split, and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency	2025-12-15 14:32:30 +01:00
Nadav Har'El	ccacea621f	test/cqlpy: fix flaky test test_view_in_system_tables The cqlpy test test_materialized_view.py::test_view_in_system_tables checks that the system table "system.built_views" can inform us that a view has been built. This test was flaky, starting to fail quite often recently, and this patch fixes the problem in the test. For historic reasons this test began by calling a utility function wait_for_view_built() - which uses a different system table, system_distributed.view_build_status, to wait until the view was built. The test then immediately tries to verify that also system.built_views lists this view. But there is no real reason why we could assume - or want to assume - that these two tables are updated in this order, or how much time passed between the two tables being changed. The authors of this test already acknowledged there is a problem - they included a hack purporting to be a "read barrier" that claimed to solve this exact problem - but it seems it doesn't, or at least no longer does after recent changes to the view builder's implementation. The solution is simple - just remove the call to wait_for_view_built() and the "hack" after it. We should just wait in a loop (until a timeout) for the system table that we really wanted to check - system.built_views. It's as simple as that. No need for any other assumptions or hacks. Fixes #27296 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27626	2025-12-15 15:29:08 +03:00
Nadav Har'El	f287484f4d	test/cqlpy: rename test with duplicate name When translating Cassandra's test validation/operations/CreateTest.java I accidentally used the same name for two tests, resulting in the first of them never being run. Let's fix the name of the second of the two to be the real name it had in the original Cassandra test. After this patch pytest reports 16 tests in this file, instead of 15 before this patch. The previously-ignored test was correct, and it now passes in both Scylla and Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 14:19:24 +02:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
Nadav Har'El	a9442e6d56	test/cqlpy: rename test with duplicate name When translating Cassandra's test validation/operations/DeleteTest.java I accidentally used the same name for two tests, resulting in the first of them never being run. Let's fix the name of the second of the two to be the real name it had in the original Cassandra test. After this patch pytest reports 52 tests in this file, instead of 51 before this patch. The previously-ignored test was correct, and it now passes in both Scylla and Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 12:02:59 +02:00
Michael Litvak	b9ec1180f5	alternator: require rf_rack_valid_keyspaces when creating index When creating an alternator table with tablets, if it has an index, LSI or GSI, require the config option rf_rack_valid_keyspaces to be enabled. The option is required for materialized views in tablets keyspaces to function properly and avoid consistency issues that could happen due to cross-rack migrations and pairing switches when RF-rack validity is not enforced. Currently the option is validated when creating a materialized view via the CQL interface, but it's missing from the alternator interface. Since alternator indexes are based on materialized views, the same check should be added there as well. Fixes scylladb/scylladb#27612 Closes scylladb/scylladb#27622	2025-12-15 10:36:57 +02:00
Michał Hudobski	12483d8c3c	vector_search: throw an error when we restrict primary in vector search We currently allow restrictions on single column primary key, but we ignore the restriction and return all results. This can confuse the users. We change it so such a restriction will throw an error and add a test to validate it. Fixes: VECTOR-331 Closes scylladb/scylladb#27143	2025-12-15 09:45:56 +02:00
Alex	d21faab9dc	test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.	2025-12-14 18:46:06 +02:00
Nadav Har'El	c06e63daed	Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski This patch series contains the following changes: - Incorporation of `crypt_sha512.c` from musl to out codebase - Conversion of `crypt_sha512.c` to C++ and coroutinization - Coroutinization of `auth::passwords::check` - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for computing SHA 512 passwords of length <=255 - Addition of yielding in the aforementioned hashing implementation. The alien thread was a solution for reactor stalls caused by indivisible password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524). However, because there is only one alien thread, overall hashing throughput was reduced (see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this, the alien‑thread solution is reverted, and a hashing implementation with yielding is introduced in this patch series. Before this patch series, ScyllaDB used SHA-512 hashing provided by the `crypt_r` function, which in our case meant using the implementation from the `libxcrypt` library. Adding yielding to this `libxcrypt` implementation is problematic, both due to licensing (LGPL) and because the implementation is split into many functions across multiple files. In contrast, the SHA-512 implementation from `musl libc` has a more permissive license and is concise, which makes it easier to incorporate into the ScyllaDB codebase. The performance of this solution was compared with the previous implementation that used one alien thread and the implementation after the alien thread was reverted. The results (median) of `perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters are as follows: - Alien thread: 41.5 new connections/s per shard - Reverted alien thread: 244.1 new connections/s per shard - This commit (yielding in hashing): 198.4 new connections/s per shard The roughly 20% performance deterioration compared to the old implementation without the alien thread comes from the fact that the new hashing algorithm implemented in `utils/crypt_sha512.cc` performs an expensive self-verification and stack cleanup. On the other hand, with smp=10 the current implementation achieves roughly 5x higher throughput than the alien thread. In addition, due to yielding added in this commit, the algorithm is expected to provide similar protection from stalls as the alien thread did. In a test that in parallel started a cassandra-stress workload and created thousands of new connections using python-driver, the values of `scylla_reactor_stalls_count` metric were as follows: - Alien thread: 109 stalls/shard total - Reverted alien thread: 13186 stalls/shard total - This commit (yielding in hashing): 149 stalls/shard total Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms` values were: - Alien thread: 1087 ms/shard total - Reverted alien thread: 72839 ms/shard total - This commit (yielding in hashing): 1623 ms/shard total To summarize, yielding during hashing computations achieves similar throughput to the old solution without the alien thread but also prevents stalls similarly to the alien thread. Fixes: scylladb/scylladb#26859 Refs: scylladb/scylla-enterprise#5711 No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion. Closes scylladb/scylladb#26860 * github.com:scylladb/scylladb: test/boost: add too_long_password to auth_passwords_test test/boost: add same_hashes_as_crypt_r to auth_passwords_test auth: utils: add yielding to crypt_sha512 auth: change return type of passwords::check to future auth: remove code duplication in verify_scheme test/boost: coroutinize auth_passwords_test utils: coroutinize crypt_sha512 utils: make crypt_sha512.cc to compile utils: license: import crypt_sha512.c from musl to the project Revert "auth: move passwords::check call to alien thread"	2025-12-14 14:01:01 +02:00
Raphael S. Carvalho	a0a7941eb1	test: Add reproducer for split vs intra-node migration race This is a problem caught after removing split from add_sstable_and_update_cache(), which was used by intra node migration when loading new sstables into the destination shard. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	e3b9abdb30	test: Verify split failure on behalf of repair during critical disk utilization Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	bc772b791d	test: boost: Add failure_when_adding_new_sstable_test Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	77a4f95eb8	test: Add reproducer for split vs incremental repair race condition Refs #26041. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:16 -03:00
Raphael S. Carvalho	2dae0a7380	compaction: Preserve state of input sstable in maybe_split_new_sstable() This is crucial with MVs, since the splitting must preserve the state of the original sstable. We want the sstable to be in staging dir, so it's excluded when calculating the diff for performing pushes to view replicas. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	1fdc410e24	Rename maybe_split_sstable() to maybe_split_new_sstable() Since the function must only be used on new sstables, it should be renamed to something describing its usage should be restricted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	1a077a80f1	sstables: Allow storage::snapshot() to leave destination sstable unsealed Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	c10486a5e9	test: Verify unsealed sstable can be compacted This is crucial for splitting before sealing the sstable produced by repair. This way, unsplit sstables won't be left on disk sealed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Cezar Moise	7c8ab3d3d3	test.py: add pid to ServerInfo Adding pid info to servers allows matching coredumps with servers Other improvements: - When replacing just some fields of ServerInfo, use `_replace` instead of building a new object. This way it is agnostic to changes to the Object - When building ServerInfo from a list, the types defined for its fields are not enforced, so ServerInfo(*list) works fine and does not need to be changed if fields are added or removed.	2025-12-12 15:11:03 +02:00
copilot-swe-agent[bot]	77ee7f3417	Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy" This reverts commit `8192f45e84`. The merge exposed a bug where truncate (via drop) fails and causes Raft errors, leading to schema inconsistencies across nodes. This results in test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist' errors. The specific problematic change was in commit `19b6207f` which modified truncate_table_on_all_shards to set use_sstable_identifier = true. This causes exceptions during truncate that are not properly handled, leading to Raft applier fiber stopping and nodes losing schema synchronization.	2025-12-12 03:55:13 +00:00
Botond Dénes	e7ca52ee79	Merge 'api: storage_service/tablets/repair: disable incremental repair by default' from Benny Halevy Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Backport to 2025.4 where `611918056a` was introduced Closes scylladb/scylladb#27530 * github.com:scylladb/scylladb: api: storage_service/tablets/repair: disable incremental repair by default docs: nodetool-commands: cluster: repair: fix incremental-mode example	2025-12-11 15:23:09 +02:00
Benny Halevy	c8cff94a5a	api: storage_service/tablets/repair: disable incremental repair by default Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-11 14:25:21 +02:00
Avi Kivity	24264e24bb	Revert "repair: Add tablet repair progress report support" This reverts commit `faad0167d7`. It causes a regression in test_two_tablets_concurrent_repair_and_migration_repair_writer_level in debug mode (with ~5%-10% probability). Fixes #27510. Closes scylladb/scylladb#27560	2025-12-11 12:18:11 +02:00
Nadav Har'El	b3b0860e7c	test/alternator: add reproducer for bug with storing invalid values This patch adds a reproducer for a long-known bug, #8070, where Alternator can store invalid values which are just blindly stored as JSON, and we will only see the failure when reading the item back - and either the client will fail to parse it, or sometimes even Alternator's own code (e.g., FilterExpression) will fail to parse it. The right behavior is to fail the write - not the read. The included test checks writing different kinds of invalid values using PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB, but fail on Alternator so marked as "xfail". Refs #8070. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:58:22 +02:00
Nadav Har'El	db15c212a6	test/alternator: reproducer for issue 27375 This patch adds a reproducer for issue #27375, where even with alternator_streams_increased_compatibility set to true, if an attribute is set to the same value it had but using a different JSON representation - a Alternator Streams event is unduly produced. For example, if a map {'dog': 1, 'cat': 2} is changed to {'cat': 2, 'dog': 1}, this non-change should not be reported. The new test added in this patch passes on DynamoDB (an event is not generated) but fails on Alternator (an event is generated), so the new test is marked with xfail. Refs #27375. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:34:19 +02:00
Nadav Har'El	3595941020	utils/rjson: fix error messages from rjson::parse() rjson::parse() when parsing JSON stored in a chunked_content (a vector of temporary buffers) failed to initialize its byte counter to 0, resulting in garbage positions in error messages like: Parsing JSON failed: Missing a name for object member. at 1452254 These error messages were most noticable in Alternator, which parses JSON requests using a chunked_content, and reports these errors back to the user. The fix is trivial: add the missing initialization of the counter. The patch also adds a regression test for this bug - it sends a JSON corrupt at position 1, and expect to see "at 1" and not some large random number. Fixes #27372 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:17:01 +02:00
Nadav Har'El	102516a787	test/alternator: extract get_signed_request() to util.py get_signed_request() started in test_manual_requests.py as a way to sign a manually-created DynamoDB-API request - for sending requests that boto3 can't. Over time, we started to use this function in additional test files, and it's about time to move it to util.py - which is more natural to import from multiple files. This patch also adds a new function, manual_request(), which combines get_signed_request() and actually sending the request via requests.post(). New tests should prefer it, because it's easier to use. We'll use the new function in tests that we add in the next patches. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:16:42 +02:00
Dario Mirovic	f545ed37bc	test: dtest: audit_test.py: fix audit error log detection `test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py` has an insert statement that is expected to fail. Dtest environment uses `FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails means we expect 6 error audit logs. The test failed because `create keyspace ks` failed once, then succeeded on retry. It allowed the test to proceed properly, but the last part of the test that expects exactly 6 failed queries actually had 7. The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed queries, counting only the query expected to fail. If other queries fail with successful retry, it's fine. If other queries fail without successful retry, the test will fail, as it should in such situations. They are not related to this expected failed insert statement. Fixes #27322 Closes scylladb/scylladb#27378	2025-12-11 10:17:07 +03:00
Calle Wilund	8c4ac457af	test::cluster::dtest::tools::files: Remove file This contained only one routine; `corrupt_file`, which is highly problematic, and not used. If you want to "corrupt" a file, it should be done controlled, not at random.	2025-12-10 15:37:04 +01:00
Calle Wilund	e48170ca8e	commitlog_replay: Handle fully corrupt files same as partial corruption. Fixes #26744 If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category. However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up. The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of loosing data as expected. Changed test to consistently cause this exact error instead.	2025-12-10 15:37:04 +01:00
Andrzej Jackowski	11ad32c85e	test/boost: add too_long_password to auth_passwords_test The test documents the current behavior of hashing algorithms that fail if the passphrase has 512 bytes or more. Moreover, it documents the behavior of the current bcrypt implementation that compares only the first 72 bytes of the password. Although we don't typically use bcrypt for password hashing, it is possible to insert such a hash using `CREATE ROLE ... WITH HASHED PASSWORD ...`. Refs: scylladb/scylladb#26842	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	4c8c9cd548	test/boost: add same_hashes_as_crypt_r to auth_passwords_test The test verifies that the old and new implementation of SHA-512 hashing returns exactly the same values. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	4ffdb0721f	auth: change return type of passwords::check to future Introduce a new `passwords::hash_with_salt_async` and change the return type of `passwords::check` to `future<bool>`. This enables yielding during password computations later in this patch series. The old method, `hash_with_salt`, is marked as deprecated because new code should use the new `hash_with_salt_async` function. We are not removing `hash_with_salt` now to reduce the regression risk of changing the hashing implementation—at least the methods that change persistent hashes (CREATE, ALTER) will continue to use the old hashing method. However, in the future, `hash_with_salt` should be entirely removed. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	11eca621b0	test/boost: coroutinize auth_passwords_test This commit prepares `auth_passwords_test` for using coroutines, because later in this patch series `auth::passwords::check` and other similar functions will return Seastar futures. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	5afcec4a3d	Revert "auth: move passwords::check call to alien thread" The alien thread was a solution for reactor stalls caused by indivisible password‑hashing tasks (scylladb/scylladb#24524). However, because there is only one alien thread, overall hashing throughput was reduced (see, e.g., scylladb/scylla-enterprise#5711). To address this, the alien‑thread solution is reverted, and a hashing implementation with yielding will be introduced later in this patch series. This reverts commit `9574513ec1`.	2025-12-10 15:36:09 +01:00
Calle Wilund	9b5f3d12a3	test::pylib::suite::base: Split options.name test specifier only once For some arcane reason, we split optional the test pattern given to test.py twice across '::' to get the file + case specifiers later given to pytest etc. This means that for a test with a class group (such as some migrated dtests), we cannot really specify the exact test to run (pattern <file>::<class>::test). Simply splitting only on first '::' fixes this. Should not affect any other tests.	2025-12-10 15:35:12 +01:00
Pavel Emelyanov	a3ca4fccef	object_storage: Create s3 client with "extended" endpoint name For this, add the s3::client::make(endpoint, ...) overload that accepts endpoint in proto://host:port format. Then it parses the provided url and calls the legacy one, that accepts raw host string and config with port, https bit, etc. The generic object_storage_endpoint_param no longer needs to carry the internal s3::endpoint_config, the config option parsing changes respectively. Tests, that generate the config files, and docs are updated. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:47 +03:00
Pavel Emelyanov	5953a89822	test: Add named constants for test_get_object_store_endpoints endpoint names Instead of hard-coded 'a' and 'b' here and there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:46 +03:00
Nadav Har'El	8822c23ad4	Merge 'test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Dario Mirovic …eshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again. Fixes #27247 Fixes #27325 Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested. Closes scylladb/scylladb#27412 * github.com:scylladb/scylladb: test: cqlpy: test_protocol_exceptions.py: enable debug exception logging test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold	2025-12-10 10:53:30 +02:00
Nadav Har'El	95e303faf3	Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack. In this series we remove no longer needed code and reorganize the needed code for better clarity. After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines. Fixes https://github.com/scylladb/scylladb/issues/26313 Closes scylladb/scylladb#27383 * github.com:scylladb/scylladb: mv: replace the simple/complex rack-aware pairing with exact rack matching mv: split out vnode pairing code from get_view_natural_endpoint mv: unify self-pairing and rack-aware pairing into one bool mv: remove the workaround for left nodes when sending view updates	2025-12-09 13:19:13 +02:00
Nadav Har'El	8ba595e472	Merge 'alternator: fix batch writes during intranode tablet migrations' from Petr Gusev Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false. The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`. Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation. This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem. Fixes: scylladb/scylladb#27353 backport: need to be backported to 2025.4 Closes scylladb/scylladb#27396 * github.com:scylladb/scylladb: alternator/executor.cc: eliminate redundant dk copy alternator/executor.cc: release cas_shard on the original shard alternator/executor.cc: move shard check into cas_write alternator/executor.cc: make cas_write a private method alternator/executor.cc: make do_batch_write a private method alternator/executor.cc: fix indent test_alternator: add test_alternator_invalid_shard_for_lwt	2025-12-09 11:25:15 +02:00
Petr Gusev	3a865fe991	alternator/executor.cc: move shard check into cas_write This change ensures that if cas_shard points to a different shard, the executor will continue issuing shard jumps until cas_shard.this_shard() returns true. The commit simply moves the this_shard() check from the parallel_for_each lambda into cas_write, with minimal functional changes. We enable test_alternator_invalid_shard_for_lwt since now it should pass. Fixes scylladb/scylladb#27353	2025-12-09 10:21:01 +01:00
Pavel Emelyanov	fb32e1c7ee	Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes The change is purely aesthetic, no need to backport Closes scylladb/scylladb#27101 * github.com:scylladb/scylladb: streaming: remove unnecessary lambda creating sstable token range streaming: simplify get_sstables_for_tablets logic streaming: switch to range-based for loop streaming: drop sstable skip microoptimization in tablet loop streaming: replace reverse iterators with reverse view in sstables scan streaming: return from get_sstables_for_tablets earlier streaming: add get_sstables_by_tablet_range tests test,sstables: add helper to set sstable first and last keys streaming: refactor get_sstables_for_tablets to make it accessible streaming: refactor get_sstables_for_tablets to make it testable streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic	2025-12-09 10:53:57 +03:00
Patryk Jędrzejczak	b6895f0fa7	test: make test_broken_bootstrap faster This change makes the test ~20 s faster. It's a forgotten follow-up: https://github.com/scylladb/scylladb/pull/18927#discussion_r1627331946 Closes scylladb/scylladb#27445	2025-12-09 09:25:42 +02:00

... 29 30 31 32 33 ...

11801 Commits