The order of entries in the ERROR_INJECTIONS list determines test
repeatability for a given random seed.
To allow removing error injections without affecting the order of the
remaining ones, removed injections are now renamed with a "REMOVED_"
prefix instead of being deleted.
This ensures they are ignored by the tests, while the sequence of active
injections—and thus test reproducibility—remains unchanged.
Similarly to view builder, view building worker needs to be drained
in `storage_service::do_drain()`.
Storage service drain is happening at the same beginning of shutdown
procedure. Before this patch, the worker was still building views
after the storage service was drained and this caused errors like:
`Error applying view update to (named_gate_closed_exception)` and
`locator::no_such_tablet_map`.
Fixesscylladb/scylladb#25908Closesscylladb/scylladb#25984
In the DynamoDB API, when "a" is a list attribute, a[999] returns the
1000th element. But if the list isn't that long (e.g., it only has 5
elements), a[999] returns nothing - it's not an error.
But it turns out that when the index is so long that it can't even be
parsed as an integer, e.g., 99999999999999, DynamoDB does report an
error:
Invalid ProjectionExpression: List index is not within the
allowable range; index: [99999999999999]
Before this patch, Alternator also returned an error in this case,
with the right type (ValidationException), but with a strange low-level
error text:
Failed parsing ProjectionExpression 'a[99999999999999]':
std::out_of_range (stoi)
The problem was that the code (in alternator/expressions.g) ran stoi()
without converting its std::out_of_range exception to a better user-facing
message. We do this in this patch, and the error message now looks like:
Failed parsing ProjectionExpression 'a[99999999999999]':
list index out of integer range
This patch also includes a test reproducing this error, which passes
on DynamDB and on Alternator it fails before this patch and passes with
the patch.
Fixes#25947
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25951
Currently, if a new sstable is created during repair/streaming,
we invalidate its whole token range in cache. If the sstable
is sparse, we unnecessarily clear too much data.
Modify cache invalidation, so that only the partitions present
in the sstable are cleared.
To check whether a partition is present in the sstable, we use bloom
filters. Bloom filters may return false positives and show that
an sstable contains a partition, even though it does not. Due to that
we may invalidate a bit more than we need to, but the cache will be
in valid state.
An issue arises when we do not invalidate two consecutive partitions
that are continuous. The sstable may contain a token that falls
between these partitions, breaking the continuity. To check that, we
would need to scan sstable index. However, such a change would
noticeably complicate the invalidation, both performance and code.
In this change, sstable index reader isn't used. Instead, the continuity
flag is unset for all scanned partitions. This comes at a cost of
heavier reads, as we will need to verify continuity when reading more
than one partition from cache.
Fixes: https://github.com/scylladb/scylladb/issues/9136.
Closesscylladb/scylladb#25996
If a bloom filter was built with a bad partition estimate, it is rebuilt
right before the sstable is sealed. The rebuild is already skipped if
the current bitset size results in a false-positive rate within 75%–125%
of the configured value.
This patch adds additional conditions to prevent rebuilds when the
savings are minimal. It also skips rebuilding for garbage collected
sstables, since they will be dropped soon anyway.
Also updated and added more test cases to cover these new criteria for
bloom filter rebuilds.
Fixes#25464Fixes#25468
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#25968
Since creating the vector index does not lead to creation of a view table [#24438] (whose version info had been logged in `system_schema.scylla_tables`) we lacked the information about the version of the index.
The solution we arrived at is to add the version as a field in options column of `system_schema.indexes`.
It requires few changes and seems unintruitive for existing infrastructure.
This patch implements the solution described above.
Refs: VECTOR-142
Closesscylladb/scylladb#25614
* github.com:scylladb/scylladb:
cqlpy/test_vector_index: add vector index version test
vector_index, index_prop_defs: add version to index options
create_index_statement: rename `validator` to `custom_index_factory`
custom index: rename `custom_index_option_name`
vector_index: rename `supported_options` to `vector_index_options`
The test started hitting #21779 recently. We deflake it in this commit
by disabling the tablet load balancing before dropping the keyspace at
the end of the test.
We still have to understand why the test started hitting #21779, so we
keep #25938 open.
Refs #25938
The test was flaky only on master, so no backport needed.
Closesscylladb/scylladb#25975
* github.com:scylladb/scylladb:
test: enable load balancing on a single node in test_restart_leaving_replica_during_cleanup
test: deflake test_restart_leaving_replica_during_cleanup
The test started hitting #21779 recently. We deflake it in this commit
by disabling the tablet load balancing before dropping the keyspace at
the end of the test.
We still have to understand why the test started hitting #21779, so we
keep #25938 open.
Refs #25938
We don't have such a test, and we could add a group 0 quorum requirement
on the restart path by mistake.
A new test, no backport.
Closesscylladb/scylladb#25623
The interface suggests the whole sstable cleanup is aborted with
'nodetool stop CLEANUP', but it is currently stopping only the
ongoing cleanup task, and the compaction manager will retry the
task since the error is not propagated all the way back to the
caller. With raft topology, the coordinator should retry it though
since cleanup became mandatory with automatic cleanup. So it's
only fixing the usage where cleanup is issued manually.
The stop exception is only propagated to the caller of cleanup.
When stopping tasks during shutdown, the exception is swallowed
and the error only returned to the caller.
Fixes#20823.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24996
The test is flaky since it tries to corrupt the commitlog
in a non-deterministic way that sometimes allows
the tested mutation to escape and be replayed anyhow.
refs: #25627Closesscylladb/scylladb#25950
Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance.
Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers.
The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same.
transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved.
cql3 module changes do the same as transport server module.
Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query.
Command line used:
```
./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null
```
The only thing changed across runs is `--workload write`/`--workload read`.
Built and run on `release` target.
<details>
```
throughput:
mean= 36946.04 standard-deviation=1831.28
median= 37515.49 median-absolute-deviation=1544.52
maximum=39748.41 minimum=28443.36
instructions_per_op:
mean= 108105.70 standard-deviation=965.19
median= 108052.56 median-absolute-deviation=53.47
maximum=124735.92 minimum=107899.00
cpu_cycles_per_op:
mean= 70065.73 standard-deviation=2328.50
median= 69755.89 median-absolute-deviation=1250.85
maximum=92631.48 minimum=66479.36
⏱ real=5:11.08 user=2:00.20 sys=2:25.55 cpu=85%
```
```
throughput:
mean= 40718.30 standard-deviation=2237.16
median= 41194.39 median-absolute-deviation=1723.72
maximum=43974.56 minimum=34738.16
instructions_per_op:
mean= 117083.62 standard-deviation=40.74
median= 117087.54 median-absolute-deviation=31.95
maximum=117215.34 minimum=116874.30
cpu_cycles_per_op:
mean= 58777.43 standard-deviation=1225.70
median= 58724.65 median-absolute-deviation=776.03
maximum=64740.54 minimum=55922.58
⏱ real=5:12.37 user=27.461 sys=3:54.53 cpu=83%
```
```
throughput:
mean= 37107.91 standard-deviation=1698.58
median= 37185.53 median-absolute-deviation=1300.99
maximum=40459.85 minimum=29224.83
instructions_per_op:
mean= 108345.12 standard-deviation=931.33
median= 108289.82 median-absolute-deviation=55.97
maximum=124394.65 minimum=108188.37
cpu_cycles_per_op:
mean= 70333.79 standard-deviation=2247.71
median= 69985.47 median-absolute-deviation=1212.65
maximum=92219.10 minimum=65881.72
⏱ real=5:10.98 user=2:40.01 sys=1:45.84 cpu=85%
```
```
throughput:
mean= 38353.12 standard-deviation=1806.46
median= 38971.17 median-absolute-deviation=1365.79
maximum=41143.64 minimum=32967.57
instructions_per_op:
mean= 117270.60 standard-deviation=35.50
median= 117268.07 median-absolute-deviation=16.81
maximum=117475.89 minimum=117073.74
cpu_cycles_per_op:
mean= 57256.00 standard-deviation=1039.17
median= 57341.93 median-absolute-deviation=634.50
maximum=61993.62 minimum=54670.77
⏱ real=5:12.82 user=4:10.79 sys=11.530 cpu=83%
```
This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes.
Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second * 300 seconds = 11.4m ops.
Update:
I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage.
run count: 5 times before and 5 times after the patch
duration: 300 seconds
Average write throughput median before patch: 41155.99
Average write throughput median after patch: 42193.22
Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350.
</details>
Built and run on `release` target.
<details>
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 14910.90 standard-deviation=477.72
median= 14956.73 median-absolute-deviation=294.16
maximum=16061.18 minimum=13198.68
instructions_per_op:
mean= 659591.63 standard-deviation=495.85
median= 659595.46 median-absolute-deviation=324.91
maximum=661184.94 minimum=658001.49
cpu_cycles_per_op:
mean= 213301.49 standard-deviation=2724.27
median= 212768.64 median-absolute-deviation=1403.85
maximum=225837.15 minimum=208110.12
⏱ real=5:19.26 user=5:00.22 sys=15.827 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 93345.45 standard-deviation=4499.00
median= 93915.52 median-absolute-deviation=2764.41
maximum=104343.64 minimum=79816.66
instructions_per_op:
mean= 65556.11 standard-deviation=97.42
median= 65545.11 median-absolute-deviation=71.51
maximum=65806.75 minimum=65346.25
cpu_cycles_per_op:
mean= 34160.75 standard-deviation=803.02
median= 33927.16 median-absolute-deviation=453.08
maximum=39285.19 minimum=32547.13
⏱ real=5:03.23 user=4:29.46 sys=29.255 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 206982.18 standard-deviation=15894.64
median= 208893.79 median-absolute-deviation=9923.41
maximum=232630.14 minimum=127393.34
instructions_per_op:
mean= 35983.27 standard-deviation=6.12
median= 35982.75 median-absolute-deviation=3.75
maximum=36008.24 minimum=35952.14
cpu_cycles_per_op:
mean= 17374.87 standard-deviation=985.06
median= 17140.81 median-absolute-deviation=368.86
maximum=26125.38 minimum=16421.99
⏱ real=5:01.23 user=4:57.88 sys=0.124 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 16198.26 standard-deviation=902.41
median= 16094.02 median-absolute-deviation=588.58
maximum=17890.10 minimum=13458.74
instructions_per_op:
mean= 659752.73 standard-deviation=488.08
median= 659789.16 median-absolute-deviation=334.35
maximum=660881.69 minimum=658460.82
cpu_cycles_per_op:
mean= 216070.70 standard-deviation=3491.26
median= 215320.37 median-absolute-deviation=1678.06
maximum=232396.48 minimum=209839.86
⏱ real=5:17.33 user=4:55.87 sys=18.425 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 97067.79 standard-deviation=2637.79
median= 97058.93 median-absolute-deviation=1477.30
maximum=106338.97 minimum=87457.60
instructions_per_op:
mean= 65695.66 standard-deviation=58.43
median= 65695.93 median-absolute-deviation=37.67
maximum=65947.76 minimum=65547.05
cpu_cycles_per_op:
mean= 34300.20 standard-deviation=704.66
median= 34143.92 median-absolute-deviation=321.72
maximum=38203.68 minimum=33427.46
⏱ real=5:03.22 user=4:31.56 sys=29.164 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 223495.91 standard-deviation=6134.95
median= 224825.90 median-absolute-deviation=3302.09
maximum=234859.90 minimum=193209.69
instructions_per_op:
mean= 35981.41 standard-deviation=3.16
median= 35981.13 median-absolute-deviation=2.12
maximum=35991.46 minimum=35972.55
cpu_cycles_per_op:
mean= 17482.26 standard-deviation=281.82
median= 17424.08 median-absolute-deviation=143.91
maximum=19120.68 minimum=16937.43
⏱ real=5:01.23 user=4:58.54 sys=0.136 cpu=99%
```
</details>
Fixes: #24567
This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported.
Closesscylladb/scylladb#25408
* github.com:scylladb/scylladb:
test/cqlpy: add protocol exception tests
test/cqlpy: `test_protocol_exceptions.py` refactor message frame building
test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code
transport: replace `make_frame` throw with return result
cql3: remove throwing `protocol_exception`
transport: replace throw in validate_utf8 with result_with_exception_ptr return
transport: replace throwing protocol_exception with returns
utils: add result_with_exception_ptr
test/cqlpy: add unknown compression algorithm test case
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25506/
Next part: plugging the BTI index readers and writers into sstable readers and writers.
The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.
This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md.
Caveats:
1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change.
2. These readers and writers are intended to be compatible with Cassandra, but I did *NOT* do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers.
3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now.
No backports needed, new functionality.
Closesscylladb/scylladb#25626
* github.com:scylladb/scylladb:
test/manual: add bti_cassandra_compatibility_test
test/lib/random_schema: add some constraints for generated uuid and time/date values
test/lib/random_utils: add a variant of get_bytes which takes an `engine&`
test/boost: add bti_index_test
sstables/writer: add an accessor for the current write position in Data.db
sstables/trie: introduce bti_index_reader
sstables/trie: add bti_partition_index_writer.cc
sstables/trie: add bti_row_index_writer.cc
utils/bit_cast: add a new overload of write_unaligned()
sstables/trie: add trie_writer::add_partial()
sstables/consumer: add read_56()
sstables/trie: make bti_node_reader::page_ptr copy-constructible
sstables: extract abstract_index_reader from index_reader.hh to its own header
sstables/trie: add an accessor to the file_writer under bti_node_sink
sstables/types: make `deletion_time::operator tombstone()` const
sstables/types: add sstables::deletion_time::make_live()
sstables/trie: fix a special case in max_offset_from_child
sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding
sstables/trie: rewrite lcb_mismatch to handle fragment invalidation
test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this PR.
Fixes#25809
This PR improves CI stability and changes only tests, so it should be
backported to all supported branches.
Closesscylladb/scylladb#25927
* github.com:scylladb/scylladb:
test: cluster: deflake consistency checks after decommission
test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
Test if the index version is the same as the base table version before
the index was created.
Test if recreating the index with the same parameters changes the version.
Test if altering the base table does not change the version.
Test if the user cannot specify the index version option by themself.
Adds a heavy test which tests compatibility of BTI index files
between Cassandra and Scylla.
It's composed from a C++ part, used to read and write BTI files
with Scylla's readers and writers,
and a Python part which uses a Cassandra node and the C++
executable to make them read and write each other's files.
The stages of the test are:
1. Use the C++ part to generate a random BIG sstable, and matching
BTI index files.
2. Import the BIG files into Cassandra, let it generate its own
BTI index files.
3. Read both Scylla's BTI and Cassandra's BTI index files using
the C++ part. Check that they return the right positions
and tombstones for each partition and row.
4. Sneakily swap Cassandra's BTI files for Scylla's BTI files,
and query Cassandra (via CQL) for each row. Check that each query returns
the right result.
Not much can be inferred about the index via CQL queries,
so the check we are doing on Cassandra is relatively weak.
But in conjunction with the checks done on the Scylla part,
it's probably good enough.
The test is weird enough, and with heavy-enough dependencies
(it uses a podman container to run the Cassandra) that
ith has been put in test/manual.
To run the test, build
`build/$build_mode/test/manual/bti_cassandra_compatibility_test_g`,
and run `python test/manual/bti_cassandra_compatibility_test.py`.
Note: there's a lot of things that could go wrong in this test.
(E.g. file permission issues or port mapping issues due to the container
usage, incompatibilities between the Python driver and the random CQL
values generated by generate_random_mutations, etc).
I hope it works everywhere, but I only tested it on my machine,
running it inside the dbuild container.
We are moving away from integer generations, so stop using them.
Also drop the --generation command-line parameter, UUID generations
don't have be provided by the caller, because random UUIDs will not
collide with each other. To help the caller still know what generation
the output sstable has (previously they provided it via --generation),
print the generation to stdout.
Closesscylladb/scylladb#25166
We have an xfailing test test_secondary_index.py::test_limit_partition
which reproduces a Scylla bug in LIMIT when scanning a secondary index
(Refs #22158). The point of such a reproducer is to demonstrate the bug
by passing on Cassandra but failing on Scylla - yet this specific test
doesn't pass on Cassandra because it expects the wrong 3 out of 4
results to be returned:
The test begins with LIMIT 1 and sees the first result is (2,1), so we
expect when we increase the LIMIT to 3 to see more results from the same
partition (2) - and yet the test mistakenly expected the next results
to come from partition 1, which is not a reasonable expectation, and
doesn't happen in Cassandra (I checked both Cassandra 5 and 4).
After this patch, the test passes on Cassandra (I tried 4 and 5), and
continues to fail on Scylla - which returns 4 rows despite the LIMIT 3.
Note that it is debatable whether this test should insist at all
on which 3 items are returned by "LIMIT 3" - In Cassandra the ordering of
a SELECT with a secondary index is not well defined (see discussion in
Refs #23392). So an alternative implementation of this test would be
to just check that LIMIT 3 returns 3 items without insisting which:
# In Cassandra the ordering of a SELECT with a secondary index is not
# defined (see discussion in #23392), so we don't know which three
# results to expect - just that it must be a 3-item subset.
rows = list(rs)
assert len(rows) == 3
assert set(rows).issubset({(1,1), (1,2), (2,1), (2,2)})
However, as of yet, I did not modify this test to do this. I still
believe there is value in secondary index scans having the same order
as a scan without a secondary index has - and not an undefined order,
and if both Scylla and Cassandra implement that in practice, it's
useful for tests to validate this so we'll know if this guarantee is
ever broken.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25676
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.
Fixes https://github.com/scylladb/scylladb/issues/25655Closesscylladb/scylladb#25720
The test could trigger gossiper API calls before the API was properly registered, causing intermittent 404 errors. Previously the test waited for the "init - starting gossiper" log, but this appears before API registration completes.
Add explicit wait for gossiper API registration to ensure the endpoint is available before making requests, eliminating test flakiness.
Fixes: scylladb/scylladb#25582
No backport needed: Issue only observed in master so far.
Closesscylladb/scylladb#25583
* https://github.com/scylladb/scylladb:
test: improve async execution in test_long_join
test: fix race condition in test_long_join
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this
commit.
Fixes#25809
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't
handle such a case; it only handles cases where the token ring is
updated later. We fix this in this commit.
We rely on the new implementation of
`wait_for_token_ring_and_group0_consistency` in the following commit to
fix flakiness of some tests.
We also update the obsolete docstring in this commit.
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.
To fix this, let's make split compaction always work with the state when it started, not a global state.
Fixes#24153.
All 2025.* versions are vulnerable, so fix must be backported to them.
Closesscylladb/scylladb#25690
* github.com:scylladb/scylladb:
replica: Fix split compaction when tablet boundaries change
replica: Futurize split_compaction_options()
We modify the logic to make sure that all of the keyspaces that the test
creates are RF-rack-valid. For that, we distribute the nodes across two
DCs and as many racks as the provided replication factor.
That may have an effect on the load balancing logic, but since this is
a performance test and since tablet load balancing is still taking place,
it should be acceptable.
This commit also finishes work in adjusting perf tests to pass with
the `rf_rack_valid_keyspaces` configuration option enabled. The remaining
tests either don't attempt to create keyspaces or they already create
RF-rack-valid keyspaces.
We don't need to explicitly enable the configuration option. It's already
enabled by default by `cql_test_config`. The reason why we haven't run into
any issue because of that is that performance tests are not part of our CI.
Fixesscylladb/scylladb#25127Closesscylladb/scylladb#25728
Add another level of verbosity: quiet.
Before this it was used as a default one, but it provides not enough information.
These changes should be coupled with pytest-sugar plugin to have an intended information for each level.
Invoke the pytest as a module, instead of a separate process, to get access to the terminal to be able to it interactively.
Framework change only, so backporting in to 2025.3
Fixes: #25403Closesscylladb/scylladb#25698
* github.com:scylladb/scylladb:
test.py: add additional level of verbosity for output
test.py: start pytest as a module instead of subprocess
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.
Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:
- `regular`: This is the default mode. It performs a standard
incremental repair, processing only unrepaired sstables and skipping
those that are already repaired. The repair state (`repaired_at`,
`sstables_repaired_at`) is updated.
- `full`: This mode forces the repair to process all sstables, including
those that have been previously repaired. This is useful when a full
data validation is needed without disabling the incremental repair
feature. The repair state is updated.
- `disabled`: This mode completely disables the incremental repair logic
for the current repair operation. It behaves like a classic
(pre-incremental) repair, and it does not update any incremental
repair state (`repaired_at` in sstables or `sstables_repaired_at` in
the system.tablets table).
The implementation includes:
- Adding the `incremental_mode` parameter to the
`/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.
Fixes#25605Closesscylladb/scylladb#25693
Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart.
Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash.
This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map.
Fixes https://github.com/scylladb/scylladb/issues/25831
Fixes https://github.com/scylladb/scylladb/issues/25803
Fixes https://github.com/scylladb/scylladb/issues/25702
Fixes https://github.com/scylladb/scylladb/issues/25621
Ref https://github.com/scylladb/scylla-enterprise/issues/5613
Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3.
Closesscylladb/scylladb#25849
* github.com:scylladb/scylladb:
gossiper: fix empty initial local node state
gossiper: add test for a race condition in start_gossiping
gossiper: check for a race condition in `do_apply_state_locally`
test/gossiper: add reproducible test for race condition during node decommission
The test could trigger gossiper API calls before the API was properly
registered, causing intermittent 404 errors. Previously the test waited
for the "init - starting gossiper" log, but this appears before API
registration completes.
Add explicit wait for gossiper API registration to ensure the endpoint
is available before making requests, eliminating test flakiness.
Fixes: scylladb/scylladb#25582
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.
Fixes: scylladb/scylladb#25831
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.
Test for scylladb/scylladb#25831
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.
This change
1. adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
nodes that sent an empty host ID.
This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.
Fixes: scylladb/scylladb#25702Fixes: scylladb/scylladb#25621
Ref: scylladb/scylla-enterprise#5613
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
Sometimes `vector_store_client_test_ann_request` test hangs up. It is hard to
reproduce.
It seems that the problem is that tests are unreliable in case of stalled
requests. This patch attaches a timer to the abort_source to ensure that
the test will finish with a timeout at least.
Fixes: VECTOR-150
Fixes: #25234Closesscylladb/scylladb#25301
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.
This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.
To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.
This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write
This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.
We also add a test case reproducing the issue.
Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/19835Closesscylladb/scylladb#25590
Interval's copy and move constructors are full of branches since the two payload T:s are
optional and therefore have to be optionally-constructed. This can be eliminated for
trivially copyable types (like dht::token) by eliminating interval's user-defined special member
functions (constructors etc) in that special case.
In turn, this enables optimizations in the standard library (and our own containers) that
convert moves/copies of spans of such types into memcpy().
Minor optimization, not a candidate to backport.
Closesscylladb/scylladb#25841
* github.com:scylladb/scylladb:
test: nonwrapping_interval_test: verify an interval of tokens is trivial
interval: specialize interval_data<T> for trivial types
interval: split data members into new interval_data class
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail
because it works with the global tablet map, rather than the
map when the compaction started. With the global state changing
under its feet, on merge, the mutation splitting writer will
think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where
split initiated by the sstable writer failed, split completed,
and the unsplit sstable is left in the table dir, causing
problems in the restart.
To fix this, let's make split compaction always work with
the state when it started, not a global state.
Fixes#24153.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
I want to write a test which generates a random table (random schema,
random data) and uses the Python driver to query it.
But it turns out that some values generated by test/lib/random_schema
can't be deserialized by the Python driver.
For example, it doesn't unknown uuid versions, dates before year 1
of after year 9999, or `time` values greater or equal to the number
of nanoseconds in a day.
AFAIK those "driver-illegal" values aren't particularly interesting
for tests which use `random_schema`, so we can just not generate
them.
Adds a fat unit test (integration test?) for BTI index writers and readers.
The test generates a small random dataset for the index writer, writes
it to a BTI file, and then attempts to run all possible (and legal)
sequences (up to a certain length) of index reader operations and check
that the results match the expectation (dictated by a "simple"
reference index reader).
It is currently the only test for this new feature, but I think it's reasonably
good. (I validated the coverage by looking at LLVM coverage reports
and by manually adding bugs in various places and checking that the test
catches it after a reasonably small number of runs).
(Note that in a later series, when we hook up BTI indexes
to Data.db readers/writers, the feature will also be indirectly
tested by the corresponding integration tests).
This is a randomized test. As such, its power grows with the number of runs.
In particular, a single run has a decently high probability of
not exercising parts of the code at all. (E.g. the generated dataset
might have no clustering keys).
Also this is a slow test. (With the default parameters,
~1s in release mode on my PC, several seconds in
debug mode. (And note that this is after a custom parameter
downsizing for debug mode, because this test is slowed extremely
badly by debug mode, due to the forced preemption after every future)).
For those two reasons, I'm not glad to include it in the test suite that runs in CI.
Instead of running it once in every CI run, I'd very much rather
have it run 10000 times after the tested feature is touched,
and before releases. Unfortunately we don't have a precedent for
something like that.
A `writer_node` contains a chain of multiple BTI nodes.
`writer_node::_node_size` is the size (in bytes) of the
entire chain.
But the parent of the `writer_node` wants to know the offset
to the rootmost node in the chain. This can be deduced from the
`writer_node::_transition_length` and the `writer_node::_node_size`.
But there's an error in the logic for that, for the special case
when there are two nodes in the chain.
The rootmost node will be SINGLE_NOPAYLOAD_12 if and only if
*the leafmost node* is smaller than 16 bytes, which is
true if and only if `_node_size` is smaller than 19 bytes.
But the current logic compares `_node_size` against 16.
That's incorrect. This patch fixes that.
There was a test for this branch, but it was not good enough.
It only tested payloads of size 1 and 20, but the problem
is only revealed by payloads of size 13-14.
The test has been extended to cover all sizes between 1 and 20.
As far as I know, positions outside the clustering region are never
passed to sstable indexes. But since they are representable within
the argument type of lazy_comparable_bytes_from_clustering_position,
let's make it handle them.
Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created.
It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not the base table, on which the user had SELECT permission automatically granted on CREATE.
This patch matches the behavior of querying the CDC log to the one implemented for Materialized Views:
1. No new permissions are granted on CREATE.
2. When querying SELECT, the permissions on base table SELECT are checked.
Fixes: https://github.com/scylladb/scylladb/issues/19798
Fixes: VECTOR-151
Closes scylladb/scylladb#25797
* github.com:scylladb/scylladb:
cqlpy/test_permissions: run the reproducer tests for #19798
select_statement: check for access to CDC base table
The branch inside the `if constexpr (debug)` contains a piece
of template code that doesn't typecheck properly.
(I used this code before committing it, but apparently
I let it become outdated when some changes around it
happened).
Fix that.