In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.
Fixes: #26328Closesscylladb/scylladb#26349
This PR migrates limits tests from dtest to this repository.
One reason is that there is an ongoing effort to migrate tests from dtest to here.
Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.
Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.
scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)
Fixes#25097
This is a migration of existing tests to this repository. No need for backport.
Closesscylladb/scylladb#26077
* github.com:scylladb/scylladb:
test: dtest: limits_test.py: test_max_cells log level
test: dtest: limits_test.py: make the tests work
test: dtest: test_limits.py: remove test that are not being migrated
test: dtest: copy unmodified limits_test.py
Generate view updates from a pending base replica if it's a reading
replica, i.e. it's in the last stage of transition write_both_read_new
before becoming the new base replica.
Previously we didn't generate view updates on a pending replica. The
problem with that is that when a base token is migrated from one replica
B1 to another B2, at one stage we generate view updates only from B1,
then at the next stage we generate view updates only from B2. During
this transition, it can happen that for some write neither B1 nor B2
generate view update, because each one sees the other as the base
replica.
We fix this by generating view updates from both base replicas in the
phase before the transition. We can generate view updates on the pending
replica in this case, even if it requires read-before-write, because
it's in a stage where it contains all data and serves reads.
Fixes https://github.com/scylladb/scylladb/issues/24292
backport not needed - the issue mostly affects MV with tablets which is still experimental
Closesscylladb/scylladb#25904
* github.com:scylladb/scylladb:
test: mv: test view update during topology operations
mv: generate view updates on both shards in intranode migration
mv: generate view updates on pending replica
We would like to have an additional service level
available for users of the Vector Store service,
which would allow us to de/prioritize vector
operations as needed. To allow that, we increase
the number of scheduling groups from 19 to 20
and adjust the related test accordingly.
Closesscylladb/scylladb#26316
The handler in question when called for tablets-enabled keyspace, returns ranges that are inconsistent with those from system.tablets. Like this:
system.tablets:
```
TabletReplicas(last_token=-4611686018427387905, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 0)])
TabletReplicas(last_token=-1, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)])
TabletReplicas(last_token=4611686018427387903, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 1), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)])
TabletReplicas(last_token=9223372036854775807, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 1), ('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0)])
```
range_to_endpoint_map:
```
{'key': ['-9069053676502949657', '-8925522303269734226'], 'value': ['127.110.40.2', '127.110.40.3']}
{'key': ['-8925522303269734226', '-8868737574445419305'], 'value': ['127.110.40.2', '127.110.40.3']}
...
{'key': ['-337928553869203886', '-288500562444694340'], 'value': ['127.110.40.1', '127.110.40.3']}
{'key': ['-288500562444694340', '105026475358661740'], 'value': ['127.110.40.1', '127.110.40.3']}
{'key': ['105026475358661740', '611365860935890281'], 'value': ['127.110.40.1', '127.110.40.3']}
...
{'key': ['8307064440200319556', '9117218379311179096'], 'value': ['127.110.40.2', '127.110.40.1']}
{'key': ['9117218379311179096', '9125431458286674075'], 'value': ['127.110.40.2', '127.110.40.1']}
```
Not only the number of ranges differs, but also separating tokens do not match (e.g. tokens -2 and 0 belong to different tablets according to system.tablets, but fall into the same "range" in the API result).
The source of confusion is that despite storage_service::get_range_to_address_map() is given correct e.r.m. pointer from the table, it still uses token_metadata::sorted_token() to work with. The fix is -- when the e.r.m. is per-table, the tokens should be get from token_metadata's tablet_map (e.g. compare this to storage_service::effective_ownership() -- it grabs tokens differently for vnodes/tables cases).
This PR fixes the mentioned problem and adds validation test. The test also checks /storage_service/describe_ring endpoint that happens to return correct set of values.
The API is very ancient, so the bug is present in all versions with tablets
Fixes#26331Closesscylladb/scylladb#26231
* github.com:scylladb/scylladb:
test: Add validation of data returned by /storage_service endpoints
test,lib: Add range_to_endpoint_map() method to rest client
api: Indentation fix after previous patches
storage_service: Get tablet tokens if e.r.m. is per-table
storage_service,api: Get e.r.m. inside get_range_to_address_map()
storage_service: Calculate tokens on stack
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.
This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.
(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).
The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.
This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.
Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 70 18.0 18 128
500001 1 1 647 19.0 19 132
0 1000000 1 748 15.0 15 116
0 500000 2 372 29.0 29 284
0 250000 4 227 56.0 56 504
0 125000 8 116 106.0 106 928
0 62500 16 67 195.0 195 1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 51 5.1 5 20
500001 1 1 64 5.3 5 20
0 1000000 1 679 4.0 4 16
0 500000 2 492 8.0 8 88
0 250000 4 804 16.0 16 232
0 125000 8 409 31.0 31 516
0 62500 16 97 54.0 54 1056
```
Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.
External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.
New functionality, no backport needed.
Closesscylladb/scylladb#26215
* github.com:scylladb/scylladb:
test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
test/cluster: add test_bti_index.py
test: prepare bypass_cache_test.py for `ms` sstables
sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
db/config: expose "ms" format to the users via database config
test: in Python tests, prepare some sstable filename regexes for `ms`
sstables: add `ms` to `all_sstable_versions`
test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
test/lib/index_reader_assertions: skip some row index checks for BTI indexes
test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
test/resource: add `ms` sample sstable files for relevant tests
test/boost/sstable_compaction_test: prepare for `ms` sstables.
test/boost/index_reader_test: prepare for `ms` sstables
test/boost/bloom_filter_tests: prepare for `ms` sstables
test/boost/sstable_datafile_test: prepare for `ms` sstables
test/boost/sstable_test: prepare for `ms` sstables.
sstables: introduce `ms` sstable format version
tools/scylla-sstable: default to "preferred" sstable version, not "highest"
sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
sstables/mx: make Index and Summary components optional
sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
sstables: implement an alternative way to rebuild bloom filters for sstables without Index
utils/bloom_filter: add `add(const hashed_key&)`
sstables: adapt estimated_keys_for_range to sstables without Summary
sstables: make `sstable::estimated_keys_for_range` asynchronous
sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
replica/database: add table::estimated_partitions_in_range()
sstables/mx: implement sstable::has_partition_key using a regular read
sstables: use BTI index for queries, when present and enabled
sstables/mx/writer: populate BTI index files
sstables: create and open BTI index files, when enabled
sstables: introduce Partition and Rows component types
sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`
`SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables.
We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now.
Fixes https://github.com/scylladb/scylladb/issues/26258
backports: not needed (a new feature)
Closesscylladb/scylladb#26284
* github.com:scylladb/scylladb:
cql_test_env.cc: log exception when callback throws
lwt: prohibit for tablet-based views and cdc logs
tablets: disallow chains of colocated tables
database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda
The test looks at metrics to confirm whether queries
hit the row cache, the index cache or the disk,
depending on various settings.
BIG index readers use a two level, read-through index cache,
where the higher layer stores parsed "index pages"
of Index.db, while the lower layer is a cache of
raw 4kiB file pages of Index.db.
Therefore, if we want to count index cache hits,
the appropriate metric to check in this case is
`scylla_sstables_index_page_hits",
which counts hits in the higher layer.
This is done by the test.
However, BTI index readers don't have an equivalent of the higher cache
layer. Their cache only stores the raw 4 kiB pages, and the hits are
counted in `scylla_sstables_index_page_cache_hits`. (The same metric
is incremented by the lower layer of the BIG index cache).
Before this commit, the test would fail with `ms` sstables,
because their reads don't increment `scylla_sstables_index_page_hits`.
In this commit we adapt the test so that it instead checks
`scylla_sstables_index_page_cache_hits` for `ms` sstables.
add new test cases checking view consistency when writing to a table
with MV and generating view updates while data is migrated.
one case has tablet migrations while writing to the table. The other
case does the equivalent for vnode keyspaces - it adds a new node.
The tests reproduce issue scylladb/scylladb#24292
Before this patch, the test `cluster/test_alternator::test_localnodes_joining_nodes` was one of the slowest tests in the test/cluster framework, taking over two minutes to run.
As comments in the test already acknowledged, there was no good reason why this test had to be so slow. The test needed to, intentionally, boot a server which took a long time (2 minutes) to fail its boot. But it didn't really need to wait for this failure - the right thing to do was to just kill the server at the end of the test. But we just didn't have the test-framework API to do it. So in this series, the first patch introduces the missing API, and the second patch uses it to fix test_localnodes_joining_nodes to kill the (unsuccessfully) booting server.
After this patch, the test takes just 7 seconds to run.
This is a test speedup only, so no real need to backport it - old release anyway get fewer test runs and the latency of these runs is less important.
Closesscylladb/scylladb#25312
* github.com:scylladb/scylladb:
test/cluster: greatly speed up test_localnodes_joining_nodes
test/pylib: add the ability to stop currently-starting servers
Set `lsa-timing` logger log level to `debug`. This will help with
the analysis of the whole spectrum of memory reclaim operation
times and memory sizes.
Refs #25097
Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py.
Add license header.
Disable it for `debug`, `dev`, and `release` mode.
Refs #25097
As a part of the porting process remove unused markers.
Explicitly enable auto snapshots for the test, as they are required for it.
Enable the test in suite.yaml (run in dev mode only)
Closesscylladb/scylladb#26248
* github.com:scylladb/scylladb:
test.py: dtest: make cfid_test.py run using test.py As a part of the porting process remove unused markers.
test.py: dtest: copy unmodified cfid_test.py
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.
This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.
Fixes#25195.
Closesscylladb/scylladb#26003
* github.com:scylladb/scylladb:
test/cluster: Add tests for invalid SSTable compression options
test/boost: Add tests for SSTable compression config options
main: Validate SSTable compression options from config
db/config: Add SSTable compression options for user tables
db/config: Prepare compression_parameters for config system
compressor: Validate presence of sstable_compression in parameters
compressor: Add missing space in exception message
SELECT commands with SERIAL consistency level are historically allowed
for vnode-based views, even though they don't provide linearizability
guarantees. We prohibit LWTs for tablet-based views, but preserve old
behavior for vnode-based view for compatibility. Similar logic is
applied to CDC log tables.
Fixesscylladb/scylladb#26258
Complementary to the previous patch. It triggers semantic validation
checks in `compression_parameters::validate()` and expects the server to
exit. The tests examine both command line and YAML options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The test compares the ranges that are returned from /describe_ring and
/range_to_endpoint_map with the information obtained from system.tablets
and makes sure that
- the number of ranges
- the boundary tokens
- the target replicas (nodes only)
match.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test cluster/test_alternator::test_localnodes_joining_nodes takes
a whopping 2 minutes and 9 seconds to run before this patch. After this
patch, it takes just 7 seconds.
The slowness of this test was caused by booting a second node that hangs
during boot for 2 minutes, deliberately. We never intended for this boot
to finish (the whole point of this test is to run before it finishes),
but unfortunately had to wait for it to avoid all sort of nasty problems
with unwaited futures.
As comments already explained in the code, the solution to this problem
is to *kill* the server at the end of the test - after we kill it, we
can wait for it - this wait will very quickly notice that the server
addition failed, and not need to wait 2 minutes. But until the previous
patch, we had no API to find the server which is starting (not yet
running), or to kill it. After the previous patch, we do have such an
API, and can now use it, and see this test finish in 7 seconds instead
of 2 minutes and 9 seconds.
The commitlog in the tests with big mutations were corrupted by overwriting 10 chunks of 1KB with random data, which could not be enough due to randomness and the big size of the commitlog (~65MB).
- change `corrupt_file` to overwrite a based on a percentage of the file's size instead of fixed number of chunks
- fix typos
- cleanup comments for clarity
Closes: #25627Closesscylladb/scylladb#25979
* github.com:scylladb/scylladb:
test: cleanup big mutation commitlog tests
test: fix test_one_big_mutation_corrupted_on_startup
As a part of the porting process remove unused markers.
Explicitly enable auto snapshots for the test, as they are required for it.
Enable the test in suite.yaml (run in dev mode only)
After https://github.com/scylladb/scylladb/pull/22034, staging status of sstables streamed
via file streaming was ignored and view updates were never generated.
This patch fixes it and now staging sstables are registered to
`view_building_worker`. Then, the worker create view building tasks
for those sstables, so the view building coordinator can schedule them
once the tablet migration is finished.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4572
This fix affects only views on tablets, which are still experimental, so no backport needed.
Closesscylladb/scylladb#25776
* github.com:scylladb/scylladb:
test/test_view_building_coordinator: add reproducer for file streaming
streaming/stream_blob: register staging sstables to process them
When a node is being replaced, it enters a "left" state while still
owning tokens. Before this patch, this is also the time when we start
draining hints targeted to this node, so the hints may get sent before
the token ownership gets migrated to another replica, and these hints
may get lost.
In this patch we postpone the hint draining for the "left" nodes to
the time when we know that the target nodes no longer hold ownership
of any tokens - so they're no longer referenced in topology. I'm
calling such nodes "released".
Before this change, when we were starting draining hints, we knew the
IP addresses of the target nodes. We lose this information after entering
"left" stage, so when draining hints after a node is "released", we
can't drain the hints targeted to a specific IP instead of host_id.
We may have hints targeted to IPs if the migration rom IP-based to
host_ID-based hints didn't happen yet. The migration happens
when enabling a cluster feature since 2024.2.0, so such hints can
only exist if we perform a direct upgrade from a version before
2024.2.0 to a version that has this change (2025.4.0+).
To avoid losing hints completely when such an upgrade is combined
with a node removal/replacement, we still drain hints when the node
enters a "left" state and the migration of hints to host_id wasn't
performed yet. For these drains, the problematic scenario can't
occur because it only affects tablets, and when upgrading from
a version before 2024.2.0, no tablets can exist yet.
If we perform such a drain, we no longer need to drain hints when
entering the "released" state, so we only drain when entering that
state if the migration was already completed.
With this setup, we'll always drain hints at least once when a node
is leaving. However, if the migration to host_ids finishes between
entering the "left" state and the "released" state, we'll attempt
to drain the hints twice. This shouldn't be problem though because
each `drain_for()` is performed with the `_drain_lock` and after
a `hint_endpoint_manger` is drained, it's removed, so we won't try
to drain it twice.
This patch also includes a test for verifying that hints are properly
replayed after a node replace.
Fixes https://github.com/scylladb/scylladb/issues/24980Closesscylladb/scylladb#24981
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.
The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.
Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
- new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.
Remove nightly mark.
Fixes: #26148.
Closesscylladb/scylladb#26209
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.
Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.
Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to reballance the cluster (plan making
only, sum of all iterations, no streaming):
Before: 16 min 25 s
After: 0 min 25 s
Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):
testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)
The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.
After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls.
Fixes#26016Closesscylladb/scylladb#26017
* github.com:scylladb/scylladb:
test: perf: perf-load-balancing: Add parallel-scaleout scenario
test: perf: perf-load-balancing: Convert to tool_app_template
tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
The test reproduces issue scylladb/scylla-enterprise#4572.
Before the fix, file-streaming didn't respect staging status of a
sstable and view updates weren't generated, leading to base-view
inconsistency.
The test creates the inconsistency in the view, triggers file-streaming
of staging sstables and verifies that the updates are generated.
`test_long_query_timeout_erm` is slow because it has many parameterized
variants, and it verifies timeout behavior during ERM operations, which
are slow by nature.
This change speeds the test up by roughly 3× (319s -> 114s) by:
- Removing two of the five scenarios that were near duplicates.
- Shortening timeout values to reduce waiting time.
- Parallelizing waiting on server_log with asyncio.TaskGroup().
The two removed scenarios (`("SELECT", True, False)`,
`("SELECT_WHERE", True, False)`) were near duplicates to
`("SELECT_COUNT_WHERE", True, False)` scenario, because all three
scenarios use non-mapreduce query and triggers basically the same
system behavior. It is sufficient to keep only one of them, so the test
verifies three cases:
- One with nodes shutdown
- One with mapreduce query
- One with non-mapreduce query
Fixes: scylladb/scylla#24127Closesscylladb/scylladb#25987
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.
This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.
This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.
This patch also includes a test for this exact scenario, which is enforced by an injection.
Fixes https://github.com/scylladb/scylladb/issues/21492Closesscylladb/scylladb#24396
* github.com:scylladb/scylladb:
mv: handle mismatched base/view replica count caused by RF change
mv: save the nodes used for pairing calculations for later reuse
mv: move the decision about simple rack-aware pairing later
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.
To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.
A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.
Fixesscylladb/scylladb#24806Closesscylladb/scylladb#26057
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.
Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.
Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to rebalance the cluster (plan making
only, sum of all iterations, no streaming):
Before: 16 min 25 s
After: 0 min 25 s
Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):
Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)
The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.
After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making).
The plan is twice as large because it combines the output of two subsequent (pre-patch)
plan-making calls.
Fixes#26016
All three tests could hit
https://github.com/scylladb/python-driver/issues/295. We use the
standard workaround for this issue: reconnecting the driver after
the rolling restart, and before sending any requests to local tables
(that can fail if the driver closes a connection to the node that
restarted last).
All three tests perform two rolling restarts, but the latter ones
already have the workaround.
Fixes#26005Closesscylladb/scylladb#26056
`line_to_row` is a test function that converts `syslog` audit log to
the format of `table` audit log so tests can use the same checks
for both types of audit. Because `syslog` audit doesn't have `date`
information, the field was filled with the current date. This behavior
broke the tests running at 23:59:59 because `line_to_row` returned
different results on different days.
Fixes: scylladb/scylladb#25509Closesscylladb/scylladb#26101
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.
This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.
This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.
This patch also includes a test for this exact scenario, which is enforced by an injection.
Fixes https://github.com/scylladb/scylladb/issues/21492
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixes https://github.com/scylladb/scylladb/issues/22989
backport not needed - the issue is probably not common and there's a workaround
Closesscylladb/scylladb#25790
* github.com:scylladb/scylladb:
test: mv: add a test for view build interrupt during registration
view_builder: register view on all shards atomically
In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table.
This PR moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started.
The PR also includes a reproducer test.
Fixesscylladb/scylladb#25912
View building coordinator hasn't been released yet, so no backport is needed.
Closesscylladb/scylladb#26144
* github.com:scylladb/scylladb:
test/test_view_building_coordinator: add reproducer
topology_coordinator: abort view building a bit later in case of tablet migration
Add a new test that reproduces issue #22989. The test starts view
building and interrupts it by restarting the node while some shards
registered their status and some didn't.
Adds a test which reproduces the issue described
in scylladb/scylladb#25912.
The test creates a situation where a single tablet is replicated across
multiple DCs / racks, and all those tablet replicas are eligible for
migration. The tablet load balancer is unpaused at that moment which
currently causes it to attempt to generate multiple migrations for
different tablet replicas of the same tablet. Before the fix for #25912,
this used to confuse the view build coordinator which would react to
each migration attempt, pausing view building work for each tablet
replica for which there was an attempt to migrate but only unpausing it
for the tablet replica that was actually migrated. After the fix, the
view build coordinator only reacts to the migration that has "won" so
the test successfully passes.
test_streaming_deadlock_removenode starts 10240 inserts at once,
overloading a node. Due to that test fails with timeout.
Limit inserts concurrency.
Fixes: #25945.
Closesscylladb/scylladb#26102
In the current scenario, the shard receiving the remove node REST api request performs condional lock depending on whether raft is enabled or not. Since non-zero shard returns false for `raft_topology_change_enabled()`, the requests routed to non zero shards are prone to this lock which is unnecessary and hampers the ability to perform concurrent operations, which is possible for raft enabled nodes.
This pr modifies the conditional lock logic and orchestrates the remove node execution logic directly to the shard0, hence the `raft_topology_change_enabled()` is now checked on the shard0 and execution is performed accordingly.
Earlier, `storage_service::find_raft_nodes_from_hoeps` code threw error upon observing any non topology member present in ignore_nodes. Since we are performing concurrent remove node operations, the timing can lead to one node being fully removed before the other node remove op begins processing which can lead to runtime error in storage_service::find_raft_nodes_from_hoeps. This error throw was added to prevent users from putting random non existent nodes in ignore_nodes list. Hence made changes in that function to account for already removed nodes and ignore those nodes instead of throwing error.
A test is also added to confirm the new behaviour, where concurrent remove node operations are now being performed seamlessly.
This pr doesn't fix a critical bug. No need to backport it.
Fixes: scylladb/scylladb#24737Closesscylladb/scylladb#25713
* https://github.com/scylladb/scylladb:
raft_topology: Modify the conditional logic in remove node operation to enhance concurrency for raft enabled clusters.
storage_service: remove assumptions and checks for ignore_nodes to be normal.
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.
Refs: scylladb/scylladb#24411