When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixes https://github.com/scylladb/scylladb/issues/22989
backport not needed - the issue is probably not common and there's a workaround
Closesscylladb/scylladb#25790
* github.com:scylladb/scylladb:
test: mv: add a test for view build interrupt during registration
view_builder: register view on all shards atomically
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.
This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.
This change must be included before the next release.
Otherwise, users will be affected by a breaking change.
References: VECTOR-187
Closesscylladb/scylladb#26033
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixesscylladb/scylladb#22989
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.
Most of the patch is generated with
for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done
and a small manual change in test/perf/perf.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26136
When the monitor is started, the first disk utilization value is
obtained from the actual host filesystem and not from the fake
space source function.
Thus, register a fake space source function before the monitor
is started.
Fixes: https://github.com/scylladb/scylladb/issues/26036
Backport is not required. The test has been added recently.
Closesscylladb/scylladb#26054
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: introducing the new components, Partitions.db and Rows.db
This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller.
This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details.
The changes are for the sake of new and unreleased code. No backporting should be done.
Closesscylladb/scylladb#26075
* github.com:scylladb/scylladb:
sstables/mx/reader: remove mx::make_reader_with_index_reader
test/boost/bti_index_test: fix indentation
sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file
sstables/trie: support reader_permit and trace_state properly
sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached
sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header
sstables/trie/bti_index_reader: support BYPASS CACHE
test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate
sstables/trie: change the signature of bti_partition_index_writer::finish
sstables/bti_index: improve signatures of special member functions in index writers
streaming/stream_transfer_task: coroutinize `estimate_partitions()`
types/comparable_bytes: add a missing implementation for date_type_impl
sstables: remove an outdated FIXME
storage_service: delete `get_splits()`
sstables/trie: fix some comment typos in bti_index_reader.cc
sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone
Before this change, new connections were handled in a default
scheduling group (`main`), because before the user is authenticated
we do not know which service level should be used. With the new
`sl:driver` service level, creation of new connections can be moved to
`sl:driver`.
We switch the service level as early as possible, in `do_accepts`.
There is a possibility, that `sl:driver` will not exist yet, for
instance, in specific upgrade cases, or if it was removed. Therefore,
we also switch to `sl:driver` after a connection is accepted.
Refs: scylladb/scylladb#24411
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
According to the changes in Vector Store API (VECTOR-148) the `embedding` term
should be changed to `vector`. As `vector` term is used for STL class the
internal type or variable names would be changed to `vs_vector` (aka vector
store vector). This patch changes also the HTTP ann json request payload
according to the Vector Store API changes.
Fixes: VECTOR-229
Closesscylladb/scylladb#26050
As requested in #22099, moved the files and fixed other includes and build system.
Moved files:
- cache_temperature.hh
- cell_locking.hh
Fixes: #22099Closesscylladb/scylladb#25079
Upcoming changes in Seastar cause `rest::simple_send` to move the
`http::request` into `seastar::http::experimental::client::make_request`
when called multiple times. This leaves the original request in an
invalid state. Specifically, the `_version` field becomes empty,
causing request validation to fail. This patch ensures `version` is
explicitly set to prevent such failures.
Fixes: https://github.com/scylladb/scylladb/issues/26018Closesscylladb/scylladb#26066
compaction/scrub: register sstables for compaction before validation
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.
This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.
Fixes#23363
This reported scrub failure occurs on all versions that have the
checksum/digest validation feature for uncompressed sstables.
So, backport it to older versions.
Closesscylladb/scylladb#26034
* github.com:scylladb/scylladb:
compaction/scrub: register sstables for compaction before validation
compaction/scrub: handle exceptions when moving invalid sstables to quarantine
Read the CDC stream metadata from the internal system tables, and store
it in the cdc metadata data structures.
The metadata is stored in the tables as diffs which is more storage
efficient, but when in-memory we store it as full stream sets for each
timestamp. This is more useful because we need to be able to find a
stream given timestamp and token.
When `mx::make_reader` is used to construct an sstable reader,
it constructs its own index reader internally.
`mx::make_reader_with_index_reader` was originally added
as a variant of `mx::make_reader` which can be used to inject
a custom `index_reader` for testing that the mx Data reader
tolerates inexact indexes.
But now we want the ability to choose between BIG index readers
and BTI index readers if both are present. And at this point,
it seems to me that it makes sense to just construct the index
reader in the caller and pass it via argument to `mx::make_reader`
instead of putting the index selection inside it.
So that's what we do in this patch. And we remove `mx::make_reader_with_index_reader`
because it's no longer different from `mx::make_reader`.
Before this patch, `bti_index_reader::last_block_offset()` returns the
offset of the last block within the file.
But the old `index_reader::last_block_offset()` returns the offset
within the partition, and that's what the callers (i.e. reversed
sstable reader) expect.
Fix `bti_index_reader::last_block_offset()` (and the corresponding
comment and test) to match `index_reader::last_block_offset()`.
Before this patch, `reader_permit` taken by `bti_index_reader`.
wasn't actually being passed down to disk reads. In this patch,
we fix this FIXME by propagating the permit down to the I/O
operations on the `cached_file`.
Also, it didn't take `trace_state_ptr` at all.
In this patch, we add a `trace_state_ptr` argument and propagate
it down to disk reads.
(We combine the two changes because the permit and the trace state
are passed together everywhere anyway).
Before this patch, `bti_index_reader` doesn't have a good
way to implement BYPASS CACHE.
In this patch we add a way, similar to what `index_reader` does:
we allow the caller to pass in the `cached_file` via a shared pointer.
If the caller wants the loads done by the index reader to remain cached,
he can pass in the `cached_file` owned by the `sstable`, shared by all
caching index readers.
If the caller doesn't want the loads to remain cached, he can pass
in a fresh `cached_file` which will be privately owned by the index
reader, and will be evicted when the index reader dies.
Let's return `bti_partitions_db_footer` so that it can be directly
saved to `sstables::shareable_components` after the index write is
finished, without re-reading the footer from the file.
Let's take `const sstables::key&` arguments instead of
`disk_string_view<uint16_t>`, that's more natural.
This patch overrides the antlr3 function that allocates the missing
tokens that would eventually leak. The override stores these tokens in
a vector, ensuring memory is freed whenever the parser is destroyed.
Solution is copied from CQL implementation.
A unit test to reproduce the issue is added - leak would be reported
by ASAN, when running this test in debug mode - the test passed but
the leak is discovered when the test file exits.
Fixes#25878Closesscylladb/scylladb#25930
As discussed in https://github.com/scylladb/scylladb/pull/24606#discussion_r2281870939
clear_gently of shared pointers should release the wrapped
object reference and when the object's use_count reaches 1,
the object itself would be cleared_gently, before it's destroyed.
This behavior is similar to the way we clear gently containers
like arrays or vectors, and so it is extended in this patch
to smart pointers like unique_ptr and foreign_ptr.
The unit tests are adjusted respectively to expect the
smart pointers to be reset after clear_gently, plus
the use of `reset()` for `foreign_ptr<shared_ptr<>>` was
replaced by `clear_gently().get()` which now ensures the
reference to a shared object is released, and awaited for,
if it happens on a foreign owner shard, unlike reset of
a foreign_ptr that kicks off destroy of that shared object
in the background on the owner shard - causing flakiness.
Fixes#25723
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#25759
As requested in #22120, moved the files and fixed other includes and build system.
Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh
Fixes: #22120
This is a cleanup, no need to backport
Closesscylladb/scylladb#25105
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.
This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.
Fixes#23363
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Currently, if a new sstable is created during repair/streaming,
we invalidate its whole token range in cache. If the sstable
is sparse, we unnecessarily clear too much data.
Modify cache invalidation, so that only the partitions present
in the sstable are cleared.
To check whether a partition is present in the sstable, we use bloom
filters. Bloom filters may return false positives and show that
an sstable contains a partition, even though it does not. Due to that
we may invalidate a bit more than we need to, but the cache will be
in valid state.
An issue arises when we do not invalidate two consecutive partitions
that are continuous. The sstable may contain a token that falls
between these partitions, breaking the continuity. To check that, we
would need to scan sstable index. However, such a change would
noticeably complicate the invalidation, both performance and code.
In this change, sstable index reader isn't used. Instead, the continuity
flag is unset for all scanned partitions. This comes at a cost of
heavier reads, as we will need to verify continuity when reading more
than one partition from cache.
Fixes: https://github.com/scylladb/scylladb/issues/9136.
Closesscylladb/scylladb#25996
If a bloom filter was built with a bad partition estimate, it is rebuilt
right before the sstable is sealed. The rebuild is already skipped if
the current bitset size results in a false-positive rate within 75%–125%
of the configured value.
This patch adds additional conditions to prevent rebuilds when the
savings are minimal. It also skips rebuilding for garbage collected
sstables, since they will be dropped soon anyway.
Also updated and added more test cases to cover these new criteria for
bloom filter rebuilds.
Fixes#25464Fixes#25468
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#25968
Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance.
Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers.
The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same.
transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved.
cql3 module changes do the same as transport server module.
Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query.
Command line used:
```
./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null
```
The only thing changed across runs is `--workload write`/`--workload read`.
Built and run on `release` target.
<details>
```
throughput:
mean= 36946.04 standard-deviation=1831.28
median= 37515.49 median-absolute-deviation=1544.52
maximum=39748.41 minimum=28443.36
instructions_per_op:
mean= 108105.70 standard-deviation=965.19
median= 108052.56 median-absolute-deviation=53.47
maximum=124735.92 minimum=107899.00
cpu_cycles_per_op:
mean= 70065.73 standard-deviation=2328.50
median= 69755.89 median-absolute-deviation=1250.85
maximum=92631.48 minimum=66479.36
⏱ real=5:11.08 user=2:00.20 sys=2:25.55 cpu=85%
```
```
throughput:
mean= 40718.30 standard-deviation=2237.16
median= 41194.39 median-absolute-deviation=1723.72
maximum=43974.56 minimum=34738.16
instructions_per_op:
mean= 117083.62 standard-deviation=40.74
median= 117087.54 median-absolute-deviation=31.95
maximum=117215.34 minimum=116874.30
cpu_cycles_per_op:
mean= 58777.43 standard-deviation=1225.70
median= 58724.65 median-absolute-deviation=776.03
maximum=64740.54 minimum=55922.58
⏱ real=5:12.37 user=27.461 sys=3:54.53 cpu=83%
```
```
throughput:
mean= 37107.91 standard-deviation=1698.58
median= 37185.53 median-absolute-deviation=1300.99
maximum=40459.85 minimum=29224.83
instructions_per_op:
mean= 108345.12 standard-deviation=931.33
median= 108289.82 median-absolute-deviation=55.97
maximum=124394.65 minimum=108188.37
cpu_cycles_per_op:
mean= 70333.79 standard-deviation=2247.71
median= 69985.47 median-absolute-deviation=1212.65
maximum=92219.10 minimum=65881.72
⏱ real=5:10.98 user=2:40.01 sys=1:45.84 cpu=85%
```
```
throughput:
mean= 38353.12 standard-deviation=1806.46
median= 38971.17 median-absolute-deviation=1365.79
maximum=41143.64 minimum=32967.57
instructions_per_op:
mean= 117270.60 standard-deviation=35.50
median= 117268.07 median-absolute-deviation=16.81
maximum=117475.89 minimum=117073.74
cpu_cycles_per_op:
mean= 57256.00 standard-deviation=1039.17
median= 57341.93 median-absolute-deviation=634.50
maximum=61993.62 minimum=54670.77
⏱ real=5:12.82 user=4:10.79 sys=11.530 cpu=83%
```
This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes.
Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second * 300 seconds = 11.4m ops.
Update:
I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage.
run count: 5 times before and 5 times after the patch
duration: 300 seconds
Average write throughput median before patch: 41155.99
Average write throughput median after patch: 42193.22
Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350.
</details>
Built and run on `release` target.
<details>
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 14910.90 standard-deviation=477.72
median= 14956.73 median-absolute-deviation=294.16
maximum=16061.18 minimum=13198.68
instructions_per_op:
mean= 659591.63 standard-deviation=495.85
median= 659595.46 median-absolute-deviation=324.91
maximum=661184.94 minimum=658001.49
cpu_cycles_per_op:
mean= 213301.49 standard-deviation=2724.27
median= 212768.64 median-absolute-deviation=1403.85
maximum=225837.15 minimum=208110.12
⏱ real=5:19.26 user=5:00.22 sys=15.827 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 93345.45 standard-deviation=4499.00
median= 93915.52 median-absolute-deviation=2764.41
maximum=104343.64 minimum=79816.66
instructions_per_op:
mean= 65556.11 standard-deviation=97.42
median= 65545.11 median-absolute-deviation=71.51
maximum=65806.75 minimum=65346.25
cpu_cycles_per_op:
mean= 34160.75 standard-deviation=803.02
median= 33927.16 median-absolute-deviation=453.08
maximum=39285.19 minimum=32547.13
⏱ real=5:03.23 user=4:29.46 sys=29.255 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 206982.18 standard-deviation=15894.64
median= 208893.79 median-absolute-deviation=9923.41
maximum=232630.14 minimum=127393.34
instructions_per_op:
mean= 35983.27 standard-deviation=6.12
median= 35982.75 median-absolute-deviation=3.75
maximum=36008.24 minimum=35952.14
cpu_cycles_per_op:
mean= 17374.87 standard-deviation=985.06
median= 17140.81 median-absolute-deviation=368.86
maximum=26125.38 minimum=16421.99
⏱ real=5:01.23 user=4:57.88 sys=0.124 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 16198.26 standard-deviation=902.41
median= 16094.02 median-absolute-deviation=588.58
maximum=17890.10 minimum=13458.74
instructions_per_op:
mean= 659752.73 standard-deviation=488.08
median= 659789.16 median-absolute-deviation=334.35
maximum=660881.69 minimum=658460.82
cpu_cycles_per_op:
mean= 216070.70 standard-deviation=3491.26
median= 215320.37 median-absolute-deviation=1678.06
maximum=232396.48 minimum=209839.86
⏱ real=5:17.33 user=4:55.87 sys=18.425 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 97067.79 standard-deviation=2637.79
median= 97058.93 median-absolute-deviation=1477.30
maximum=106338.97 minimum=87457.60
instructions_per_op:
mean= 65695.66 standard-deviation=58.43
median= 65695.93 median-absolute-deviation=37.67
maximum=65947.76 minimum=65547.05
cpu_cycles_per_op:
mean= 34300.20 standard-deviation=704.66
median= 34143.92 median-absolute-deviation=321.72
maximum=38203.68 minimum=33427.46
⏱ real=5:03.22 user=4:31.56 sys=29.164 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 223495.91 standard-deviation=6134.95
median= 224825.90 median-absolute-deviation=3302.09
maximum=234859.90 minimum=193209.69
instructions_per_op:
mean= 35981.41 standard-deviation=3.16
median= 35981.13 median-absolute-deviation=2.12
maximum=35991.46 minimum=35972.55
cpu_cycles_per_op:
mean= 17482.26 standard-deviation=281.82
median= 17424.08 median-absolute-deviation=143.91
maximum=19120.68 minimum=16937.43
⏱ real=5:01.23 user=4:58.54 sys=0.136 cpu=99%
```
</details>
Fixes: #24567
This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported.
Closesscylladb/scylladb#25408
* github.com:scylladb/scylladb:
test/cqlpy: add protocol exception tests
test/cqlpy: `test_protocol_exceptions.py` refactor message frame building
test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code
transport: replace `make_frame` throw with return result
cql3: remove throwing `protocol_exception`
transport: replace throw in validate_utf8 with result_with_exception_ptr return
transport: replace throwing protocol_exception with returns
utils: add result_with_exception_ptr
test/cqlpy: add unknown compression algorithm test case
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25506/
Next part: plugging the BTI index readers and writers into sstable readers and writers.
The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.
This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md.
Caveats:
1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change.
2. These readers and writers are intended to be compatible with Cassandra, but I did *NOT* do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers.
3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now.
No backports needed, new functionality.
Closesscylladb/scylladb#25626
* github.com:scylladb/scylladb:
test/manual: add bti_cassandra_compatibility_test
test/lib/random_schema: add some constraints for generated uuid and time/date values
test/lib/random_utils: add a variant of get_bytes which takes an `engine&`
test/boost: add bti_index_test
sstables/writer: add an accessor for the current write position in Data.db
sstables/trie: introduce bti_index_reader
sstables/trie: add bti_partition_index_writer.cc
sstables/trie: add bti_row_index_writer.cc
utils/bit_cast: add a new overload of write_unaligned()
sstables/trie: add trie_writer::add_partial()
sstables/consumer: add read_56()
sstables/trie: make bti_node_reader::page_ptr copy-constructible
sstables: extract abstract_index_reader from index_reader.hh to its own header
sstables/trie: add an accessor to the file_writer under bti_node_sink
sstables/types: make `deletion_time::operator tombstone()` const
sstables/types: add sstables::deletion_time::make_live()
sstables/trie: fix a special case in max_offset_from_child
sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding
sstables/trie: rewrite lcb_mismatch to handle fragment invalidation
test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.
Fixes https://github.com/scylladb/scylladb/issues/25655Closesscylladb/scylladb#25720
Sometimes `vector_store_client_test_ann_request` test hangs up. It is hard to
reproduce.
It seems that the problem is that tests are unreliable in case of stalled
requests. This patch attaches a timer to the abort_source to ensure that
the test will finish with a timeout at least.
Fixes: VECTOR-150
Fixes: #25234Closesscylladb/scylladb#25301
Adds a fat unit test (integration test?) for BTI index writers and readers.
The test generates a small random dataset for the index writer, writes
it to a BTI file, and then attempts to run all possible (and legal)
sequences (up to a certain length) of index reader operations and check
that the results match the expectation (dictated by a "simple"
reference index reader).
It is currently the only test for this new feature, but I think it's reasonably
good. (I validated the coverage by looking at LLVM coverage reports
and by manually adding bugs in various places and checking that the test
catches it after a reasonably small number of runs).
(Note that in a later series, when we hook up BTI indexes
to Data.db readers/writers, the feature will also be indirectly
tested by the corresponding integration tests).
This is a randomized test. As such, its power grows with the number of runs.
In particular, a single run has a decently high probability of
not exercising parts of the code at all. (E.g. the generated dataset
might have no clustering keys).
Also this is a slow test. (With the default parameters,
~1s in release mode on my PC, several seconds in
debug mode. (And note that this is after a custom parameter
downsizing for debug mode, because this test is slowed extremely
badly by debug mode, due to the forced preemption after every future)).
For those two reasons, I'm not glad to include it in the test suite that runs in CI.
Instead of running it once in every CI run, I'd very much rather
have it run 10000 times after the tested feature is touched,
and before releases. Unfortunately we don't have a precedent for
something like that.
A `writer_node` contains a chain of multiple BTI nodes.
`writer_node::_node_size` is the size (in bytes) of the
entire chain.
But the parent of the `writer_node` wants to know the offset
to the rootmost node in the chain. This can be deduced from the
`writer_node::_transition_length` and the `writer_node::_node_size`.
But there's an error in the logic for that, for the special case
when there are two nodes in the chain.
The rootmost node will be SINGLE_NOPAYLOAD_12 if and only if
*the leafmost node* is smaller than 16 bytes, which is
true if and only if `_node_size` is smaller than 19 bytes.
But the current logic compares `_node_size` against 16.
That's incorrect. This patch fixes that.
There was a test for this branch, but it was not good enough.
It only tested payloads of size 1 and 20, but the problem
is only revealed by payloads of size 13-14.
The test has been extended to cover all sizes between 1 and 20.
As far as I know, positions outside the clustering region are never
passed to sstable indexes. But since they are representable within
the argument type of lazy_comparable_bytes_from_clustering_position,
let's make it handle them.
The branch inside the `if constexpr (debug)` contains a piece
of template code that doesn't typecheck properly.
(I used this code before committing it, but apparently
I let it become outdated when some changes around it
happened).
Fix that.
Allows testing using either local mock server (installed or using docker),
or real GCP project (not tested as of writing this).
v2: Try podman if docker unavail
v3: Ensure we check log output on fake-gcs, because when using podman, the
published port will be connectible even though the actual server is not
up yet.
v4: Use ephermal port forward in docker/podman to allow us running parallel
instances. Also adjust credentials and port finding in test.
v5: Re-ensure no parallel tests for this: We seem to time out in podman
trying to fetch image for X parallel tests
v6: Remove the ephermal port stuff. Because of course this does not work
with our podman-in-podman. Do brute-force port speculation instead.
v7: Up timeout for server start to allow docker pull.
v8: Fix string check error
v9: Add explicit docker image version
This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types.
This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md
Refs https://github.com/scylladb/scylladb/issues/19407
New feature - backport not required.
Closesscylladb/scylladb#25603
* github.com:scylladb/scylladb:
types/comparable_bytes: add compatibility testcases for collection types
types/comparable_bytes: update compatibility testcase to support collection types
types/comparable_bytes: support empty type
types/comparable_bytes: support reversed types
types/comparable_bytes: support vector cql3 type
types/comparable_bytes: support tuple and UDT cql3 type
types/comparable_bytes: support map cql3 type
types/comparable_bytes: support set and list cql3 types
types/comparable_bytes: introduce encode/decode_component
types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
- includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
- applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations
The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.
Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`.
When not provided, tests are skipped.
Test scenarios:
1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2
The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).
The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.
`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:
Read path (before)
```
instructions_per_op:
mean= 39661.51 standard-deviation=34.53
median= 39655.39 median-absolute-deviation=23.33
maximum=39708.71 minimum=39622.61
```
Read path (after)
```
instructions_per_op:
mean= 39691.68 standard-deviation=34.54
median= 39683.14 median-absolute-deviation=11.94
maximum=39749.32 minimum=39656.63
```
Write path (before):
```
instructions_per_op:
mean= 50942.86 standard-deviation=97.69
median= 50974.11 median-absolute-deviation=34.25
maximum=51019.23 minimum=50771.60
```
Write path (after):
```
instructions_per_op:
mean= 51000.15 standard-deviation=115.04
median= 51043.93 median-absolute-deviation=52.19
maximum=51065.81 minimum=50795.00
```
Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871
No backport, as it is a new feature.
Closesscylladb/scylladb#23917
* github.com:scylladb/scylladb:
tests/cluster: Add new storage tests
test/scylla_cluster: Override workdir when passed via cmdline
streaming: Reject incoming migrations
storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
repair_service: Add a facility to disable the service
compaction_manager: Subscribe to out of space controller
compaction_manager: Replace enabled/disabled states with running state
database: Add critical_disk_utilization mode database can be moved to
disk_space_monitor: add subscription API for threshold-based disk space monitoring
docs: Add feature documentation
config: Add critical_disk_utilization_level option
replica/exceptions: Add a new custom replica exception
Fixes#25709
If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).
Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.
To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).
Closesscylladb/scylladb#25719
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.
The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.
The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
For detailed description check: `docs/dev/view-building-coordinator.md`
Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930
---
This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942Closesscylladb/scylladb#23760
* github.com:scylladb/scylladb:
test/cluster: add view build status tests
test/cluster: add view building coordinator tests
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
test: adjust existing tests
utils/error_injection: add injection with `sleep_abortable()`
db/view/view_builder: ignore `no_such_keyspace` exception
docs/dev: add view building coordinator documentation
db/view/view_building_worker: work on `process_staging` tasks
db/view/view_building_worker: register staging sstable to view building coordinator when needed
db/view/view_building_worker: discover staging sstables
db/view/view_building_worker: add method to register staging sstable
db/view/view_update_generator: add method to process staging sstables instantly
db/view/view_update_generator: extract generating updates from staging sstables to a method
db/view/view_update_generator: ignore tablet-based sstables
db/view/view_building_coordinator: update view build status on node join/left
db/view/view_building_coordinator: handle tablet operations
db/view: add view building task mutation builder
service/topology_coordinator: run view building coordinator
db/view: introduce `view_building_coordinator`
db/view/view_building_worker: update built views locally
db/view: introduce `view_building_worker`
db/view: extract common view building functionalities
db/view: prepare to create abstract `view_consumer`
message/messaging_service: add `work_on_view_building_tasks` RPC
service/topology_coordinator: make `term_changed_error` public
db/schema_tables: create/cleanup tasks when an index is created/dropped
service/migration_manager: cleanup view building state on drop keyspace
service/migration_manager: cleanup view building state on drop view
service/migration_manager: create view building tasks on create view
test/boost: enable proxy remote in some tests
service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
service/migration_manager: coroutinize `prepare_new_view_announcement()`
service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
service: reload `view_building_state_machine` on group0 apply()
service/vb_coordinator: add currently processing base
db/system_keyspace: move `get_scylla_local_mutation()` up
db/system_keyspace: add `view_building_tasks` table
db/view: add view_building_state and views_state
db/system_keyspace: add method to get view build status map
db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
db/system_keyspace: move `internal_system_query_state()` function earlier
db/view: ignore tablet-based views in `view_builder`
gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
The `abstract_type::from_string()` method used to parse the input data
doesn't support collections yet. So the collection testdata will be
passed as JSON strings to the testcase. This patch updates the testcase
to adapt to this workaround.
Also, extended the testcase to verify that Scylla's implementation can
successfully decode the byte comparable output encoded by Cassandra.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
A reversed type is first encoded using the underlying type and then all
the bits are flipped to ensure that the lexicographical sort order is
reversed. During decode, the bytes are flipped first and then decoded
using the underlying type.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL vector type encoding is similar to the lists, where each element
is transformed into a byte-comparable format and prefixed with a
component marker. The sequence is terminated with a terminator marker to
indicate the end of the collection.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>