The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.
Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.
We replace all_tables with new methods that can be used for each of the
cases.
Fixes#24574
* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.
Closesscylladb/scylladb#24633
* github.com:scylladb/scylladb:
encryption_at_rest_test: Add exception handler to ensure proxy stop
encryption: Ensure stopping timers in provider cache objects
Move 3rd party services starting under `try` clause to avoid situation that main process is collapses without going stopping services.
Without this, if something wrong during start it will not trigger execution exit artifacts, so the process will stay forever.
This functionality in 2025.2 and can potentially affect jobs, so backport needed.
Closesscylladb/scylladb#24734
* github.com:scylladb/scylladb:
test.py: use unique hostname for Minio
test.py: Catch possible exceptions during 3rd party services start
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.
As a part of the porting process, remove unused imports and
markers, remove non-next_gating tests and tests marked with
`required_features("!consistent-topology-changes")` marker.
Remove `test_permissions_caching` test because it's too
flaky when running using test.py
Also, make few time execution optimizations:
- remove redundant `time.sleep(10)`
- use smaller timeouts for CQL sessions
Enable the test in suite.yaml (run in dev mode only)
Make `wait_for_any_log()` function to work closer to the original
dtest's version: use `ScyllaLogFile.grep()` method instead of
the usage of `ScyllaNode.wait_log_for()` with a small timeout to
have at least one try to find.
Also, add `max_count` argument to `.grep()` method for the
optimization purpose.
Technically, `new_node()`'s `bootstrap` parameter used to mark a node
as a seed if it's False. In test.py, seeds parameter passed on start of
a node, so, save it as `ScyllaNode.bootstrap` attribute to use in
`ScyllNode.start()` method.
Modify ManagerClient.server_update_config() method to change
multiple config options in one call in addition to one `key: value`
pair. All internal machinery converted to get a values dict as a
parameter. Type hints were adjusted too.
This patch adds tests reproducing issue #24581, where Scylla incorrectly
parsed "decimal"-type literals in CQL with very high exponents, near or
above the 32-bit limit.
For example, 1.1234e-2147483647 was incorrectly read as 1.1234E+2147483649,
while it should be (as we explain in comments in the test) an error.
The tests in this patch failed (in multiple checks) before #24581 was
fixed, and pass after it was fixed.
These tests all pass on Cassandra 3, confirming our understanding on the
limits of "decimal" to be correct. But they fail on Cassandra 4 and 5 due
to a regression https://issues.apache.org/jira/browse/CASSANDRA-20723
in Cassandra, that mistakenly limited "decimal" exponents to just 309.
Refs #24581
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#24646
In audit tests, UnixDatagramServer is used to receive audit logs.
This commit introduces a synchronization between the logs receiver and
a function that reads already received logs. Without this, there was
a race condition that resulted in test failures (e.g., audit logs were
missing during assertion check).
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.
Fixes: https://github.com/scylladb/scylladb/issues/24637Closesscylladb/scylladb#24638
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations:
1. A table name must be up to 222 characters.
2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters.
3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters.
The first patch documents these existing limitations, improves their testing, and fixes a tiny bug found by one of the tests (where UpdateTable adding a GSI's limit testing is off by one).
The second patch unfortunately shows with a reproducer (issue #24598) this limit of 222 is problematic and we may need to lower it: If a user creates a table of length 222 and then enables Alternator streams, Scylla shuts down on an IO error. This will need to be fixed later, but at least this patch properly documents the existing behavior.
No need to backport this patch - it is a very minor improvement that it is unlikely users care about and there is no potential for harm.
Closesscylladb/scylladb#24597
* github.com:scylladb/scylladb:
test/alternator: reproducer for streams bug with long table name
alternator: improve, document and test table/index name lengths
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
Closesscylladb/scylladb#24492
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.
Fixes: https://github.com/scylladb/scylladb/issues/24506
Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.
Closesscylladb/scylladb#24497
* github.com:scylladb/scylladb:
mutation: check key of inserted rows
compound: optimize is_full() for single-component types
Originally (since commit c3da9f2), Alternator's functional test suite
(test/alternator) ran "always_use_lwt" write isolation mode. The original
thinking was that we need to exercise this more difficult mode and it's
the most important mode. This mode was originally chosen in
test/alternator/run.
However, starting with commit 76a766c (a year ago), test.py no longer
runs test/alternator/run. Instead, it runs Scylla itself, and the options
for running Scylla appear in test/alternator/suite.yaml, and accidentally
the write isolation mode only_rmw_uses_lwt was chosen there.
The purpose of this patch is to reconcile this difference and use the
same mode in test.py (which CI is using) and test/alternator/run (which
is only used by some developers, during development).
I decided to have this patch change test/alternator/run to use
only_rmw_uses_lwt. As noted above, this is anyway how all Alternator
tests have been running in CI in the past year (through test.py).
Also, the mode only_rmw_uses_lwt makes running the Alternator test
suite slightly faster (52 seconds instead of 58 seconds, on my laptop)
which is always nice for developers.
This patch changes nothing for testing in CI - only manual runs through
test/alternator/run are affected.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, we had in test_condition_expression.py and
test_update_expression.py some rudimentary tests that the different
write isolation modes behave as expected. Basically, we wanted to test
that read-modify-write (RMW) operations are recognized and forbidden
in forbid_rmw mode, but work correctly in the three other modes.
We only check non-concurrent writes, so the actual write isolation is
NOT checked, just the correctness of non-concurrent writes.
However, since these tests were split across several files, and many
of the tests just ran other existing tests in different write isolation
modes, it was hard to see what exactly was being tested, and what was
missed. And indeed we missed checking some RMW operations, such as
requests with ReturnValues, requests with the older Expected or
AttributeUpdates (only the newer ConditionExpression and UpdateExpression
were tested), and ADD and DELETE operations in UpdateExpression.
So this patch replaces the existing partial tests with a new test file
test_write_isolation.py dedicated to testing all kinds of RMW operations
in one place, and how they don't work in forbid_rmw and do work in
the other modes. Writing all these tests in one place made it easier
to create a really exhaustive test of all the different operations and
optional parameters, and conversely - make sure that we don't test
*unnecessary* things such as different ConditionExpression expressions
(we already have 1800 lines of tests for ConditionExpression, and the
actual content of the condition is unrelated to write isolation modes).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The two tests in this patch reproduce issue #24598: When enabling
Alternator streams on an Alternator table with a very long name,
such as the maximum allowed name length 222, the result is an
I/O error and a Scylla shutdown.
The two tests are currently marked "skip", otherwise they would
crash the Scylla being tested.
Refs #24598
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255
characters each, Alternator currently has different (and lower)
limitations:
1. A table name must be up to 222 characters.
2. For a GSI, the sum of the table's and GSI's name length, plus 1,
must be up to 222 characters.
3. For an LSI, the sum of the table's and LSI's name length, plus 2,
must be up to 222 characters.
These specific limitations were never documented, so in this patch we
add this information to docs/alternator/compatibility.md.
Moreover, these limitations where only partially tested, so in this patch
we add testing for more cases that we forgot to check - such as length
of LSI names (only GSI were checked before this patch), or adding a
GSI to an existing table. It is important to check all these corner
cases because there is a risk that if we attempt to create a table
without checking its length, we can end up with an I/O error that brings
down Scylla.
In one case - UpdateTable adding a GSI to an existing table - the new
test exposed a trivial bug: Because UpdateTable wants to verify the new
GSI doesn't have the same name as an existing LSI, it mistakenly applied
the LSI's length name limit instead of the GSI's name length limit,
which is one byte less than it should be. So this patch fixes this
trivial bug as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Audit component defines `audit` logger which it uses only for `error` and `info` logs,
regarding `audit` module initialization and errors during audit log writing.
This change introduces `debug` level logs on the happy path of audit log writes.
Fixes: https://github.com/scylladb/scylladb/issues/23773
No backport needed - this is a small quality-of-life improvement.
Closesscylladb/scylladb#24658
* github.com:scylladb/scylladb:
audit: change audit test logger level to `debug`
audit: introduce debug level logs on happy path
Audit module tests should show the `debug` level messages.
This change makes audit_test.py `audit` module log level to `debug`.
Closesscylladb/scylladb#23773
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).
Fixes: #24330Closesscylladb/scylladb#24403
Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails
Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/).
Backport is not needed, this is only framework enhancements
Closesscylladb/scylladb#24677
* github.com:scylladb/scylladb:
test.py: Attach node logs in allure report in case of fail
test.py: Add run id to the boost output file
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.
Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.
If server is started without cluster quorum we'll skip creating role or password.
Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2
Closesscylladb/scylladb#24451
* github.com:scylladb/scylladb:
test: auth_cluster: add test for password reset procedure
auth: cache roles table scan during startup
test: auth_cluster: add test for replacing default superuser
test: pylib: add ability to specify default authenticator during server_start
test: pylib: allow rolling restart without waiting for cql
auth: split auth-v2 logic for adding default superuser password
auth: split auth-v2 logic for adding default superuser role
auth: ldap: fix waiting for underlying role manager
auth: wait for default role creation before starting authorizer and authenticator
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.
This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.
Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.
Fixes: #24581
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#24640
To avoid overwriting the output tests adding the run id to it.
Previously, when first repeat failed and the second passes, because the
are using the same name for the output, it will be overwritten and
deleted since the second repeat passed
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).
The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time.
The new algorithm can be summarized as follows:
1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space.
2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each
artition range (in such cases, the partition range is assigned to the appropriate tablet_replica).
In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host.
In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch.
In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range.
Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits.
This patch series:
- Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets).
- Implements tablet-aware dispatching algorithm.
- Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard.
- Adds test_long_query_timeout_erm to verify the new functionality.
Fixes: scylladb#21831
No backport, as it is rather new feature than a bugfix.
Closesscylladb/scylladb#24383
* github.com:scylladb/scylladb:
mapreduce: add missing comma and space in mapreduce_request operator<<
mapreduce: add shard_id_hint to mapreduce request
test: add test_long_query_timeout_erm
mapreduce: add tablet-aware dispatching algorithm
storage_proxy: make storage_proxy::is_alive public
mapreduce: remove _shared_token_metadata from mapreduce_service
mapreduce: move dispatching logic to dispatch_to_vnodes
mapreduce: remove underscores from variable names
mapreduce: move req_with_modified_pr handling to a new function
mapreduce: change next_vnode lambda to get_next_partition_range function
Before we can eradicate the numerical sstable generations,
This series completes https://github.com/scylladb/scylladb/issues/20337
by disabling the use of numerical sstable generations where we can
and making sure the feature is never disabled.
Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables.
Refs #24248
* Enhancement. No backport required
Closesscylladb/scylladb#24554
* github.com:scylladb/scylladb:
feature_service: never disable UUID_SSTABLE_IDENTIFIERS
test: sstable_move_test: always use uuid sstable generation
test: sstable_directory_test: always use uuid sstable generation
sstables: sstable_generation_generator: set last_generation=0 by default
test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
test: lib: test_env: always use uuid sstable generation
test: sstable_test: always use uuid sstable generation
test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
test: sstable_compaction_test: always use uuid sstable generation
Some tests want to switch between sched groups. For that there's
cql-test-env facility to create and use them. However, there's a test
that uses replica::database as sched groups provider, which is not nice.
Fix it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24615
In ed3e4f33fd we introduced new connection throttling feature which is controlled by uninitialized_connections_semaphore_cpu_concurrency config. But live updating of it was broken, this patch fixes it.
When the temporary value from observer() is destroyed, it disconnects from updateable_value, so observation stops right away. We need to retain the observer.
Backport: to 2025.2 where this feature was added
Fixes: https://github.com/scylladb/scylladb/issues/24557Closesscylladb/scylladb#24484
* github.com:scylladb/scylladb:
test: add test for live updates of generic server config
utils: don't allow do discard updateable_value observer
generic_server: fix connections semaphore config observer
This test verifies the effectiveness of the mechanism for releasing ERM
introduced in this patch series. In test scenario, during processing of
a query in mapreduce service, reads are intentionally blocked by
an injected error. However, when table uses tablets, ERM is now often
released by the mapreduce service, so the topology is not blocked to the
end of the request. As a result, it is possible to add a new node
before the query finishes.
Refs. scylladb#21831
test_dict_memory_limit trains new dictionaries and checks (via metrics)
that the old dictionaries are appropriately cleaned up.
The problem is that the cleanup is asynchronous (because the lifetimes
are handled by foreign_ptr, which sends the destructor call
to the owner shard asynchronously), so the metrics might be
checked a few milliseconds before the old dictionary is cleaned up.
The dict lifetimes are lazy on purpose, the right thing to do is
to just let the test retry the check.
Fixesscylladb/scylladb#24516Closesscylladb/scylladb#24526
In issue #6527 it was suggested that a zero-token node (a.k.a coordinator-
only node, or data-less node) could serve as a topology-aware Alternator
load balancer - requests could be sent to it and they will be forwarded to
the right node.
This feature was implemented, but we never tested that it actually works
for Alternator requests. So this patch tests this by starting a 5-node
cluster with 4 regular nodes and one zero-token node, and testing that
requests to the zero-token node work as expected.
It is important to know that this feature does indeed work as expected,
and also to have a regression test for it so the feature doesn't break
in the future.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23114
Before this change, `mapreduce_service` used `_shared_token_metadata`
to get the topology. However, the token was used in a part of the code
that already had its own ERM with its own metadata token. Moreover,
as mapreduce_service's token and ERM's token are not guaranteed to be
the same, inconsistencies could occur.
Therefore, this commit removes `_shared_token_metadata` and its usage.
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.
Wait until at least one child of a root repair task is created.
Fixes: #24556.
Closesscylladb/scylladb#24560
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
BostFacade and UnitFacade saving the logs only when test failed,
ignoring the -s parameter that should allow save logs on success. This
PR adding checking this parameter.
Closesscylladb/scylladb#24596