When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixes https://github.com/scylladb/scylladb/issues/22989
backport not needed - the issue is probably not common and there's a workaround
Closesscylladb/scylladb#25790
* github.com:scylladb/scylladb:
test: mv: add a test for view build interrupt during registration
view_builder: register view on all shards atomically
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.
This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.
This change must be included before the next release.
Otherwise, users will be affected by a breaking change.
References: VECTOR-187
Closesscylladb/scylladb#26033
Add a new test that reproduces issue #22989. The test starts view
building and interrupts it by restarting the node while some shards
registered their status and some didn't.
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixesscylladb/scylladb#22989
Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.
Most of the patch is generated with
for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done
and a small manual change in test/perf/perf.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26136
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0
(in order to create a view building task) and back (to be eventually processed).
Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards.
This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0).
Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards.
Fixes https://github.com/scylladb/scylladb/issues/25859
View building coordinator isn't present in any release yet, no backport needed.
Closesscylladb/scylladb#25832
* github.com:scylladb/scylladb:
db/view/view_building_worker: fix indent
db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`
db/view/view_building_worker: use table id in `register_staging_sstable_tasks()`
db/view/view_building_worker: move helper functions higher
Previously the sharded abort_sources was stopped at the end of `batch::do_work()`, which is working in parallel to view building worker main loop.
This leads to races because the worker may call `batch::abort()`, which access the abort_sources.
This patch solves this be changing `sharded<abort_source>` into `abort_source`.
Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard.
The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively.
Fixes https://github.com/scylladb/scylladb/issues/25805
Fixes https://github.com/scylladb/scylladb/issues/26045
View building coordinator hasn't been released yet, so no backport needed.
Closesscylladb/scylladb#26059
* github.com:scylladb/scylladb:
db/view/view_building_worker: fix indents
db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source`
db/view/view_building_worker: execute entire `batch::do_work` on tasks shard
db/view/view_building_worker: store reference to sharded worker in batch
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.
A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.
Refs: scylladb/scylladb#24411
Previously the sharded abort_sources was stopped at the end of batch::do_work(),
which is working in parallel to view building worker main loop.
This leads to races because the worker may call batch::abort(),
which access the abort_sources.
This patch solves this be changing `sharded<abort_source>` into
`abort_source`.
Since now `batch::do_work()` is executed on tasks' shard,
all abort source checks are also done on tasks' shard.
The only place where shard0 uses the abort source is `batch::abort()`,
but this method now does `smp::submit_to(replica.shard, [request abort])`,
so the abort source is used on tasks' shard exclusively.
Fixesscylladb/scylladb#25805Fixesscylladb/scylladb#26045
Change reference to view building worker in batch to sharded container.
In next commits, I'm going to execute `do_work()` exclusively on tasks
target shard and sharded reference will be more useful.
When a staging sstable is registered to view building worker,
it needs to make a round trip from its original shard to shard 0
(in order to create a view building task) and back (to be eventually
processed).
Until now this was done using plain `sstables::shared_sstable`
(= `lw_shared_ptr`) which is not safe to be moved between shards.
This patch fixes this by wrapping the pointer in `foreign_ptr` and
obtains necessary informations (owner shard, last token) on the original
shard (instead of on shard0).
Then all of those objects are put into freshly introduced structure
`staging_sstable_task_info`, which can be safely moved between shards.
Fixesscylladb/scylladb#25859
Currently, during cache invaldation we check if we need to preempt
only after the partition gets invaldaited. This may lead to stalls
if we have a chain of filtered out partitions.
Check for preemption even if the partition does not get invaldated.
Refs: https://github.com/scylladb/scylladb/issues/9136.
Optimization; no backport
Closesscylladb/scylladb#26053
* github.com:scylladb/scylladb:
db: fix indentation
db: cache: consider preempting after each partition
Define two new virtual tables in system keyspace: cdc_timestamps and
cdc_streams. They expose the internal cdc metadata for tablets-enabled
keyspace to be consumed by users consuming the CDC log.
cdc_timestamps lists all timestamps for a table where a stream change
occured.
cdc_streams list additionally the current streams sets for each
table and timestamp, as well as difference - closed and opened streams -
from the previous stream set.
Add new group0-based tables in system keyspace to be used for cdc with tablets:
* cdc_streams_state - describing "base" state of CDC streams for each
table - an initial timestamp and a stream set.
* cdc_streams_history - describing following committed stream sets by
diffs (opened / closed streams) from the previous set.
The view building batch lives on shard0 but it might be doing
work on shard which owns the tablet replica.
Until now the batch data was accessed from multiple shards (shard0 and
where the batch was executed).
This patch fixes this by splitting tasks execution into:
- preparation which is always happening on shard0
- actual execution of the tasks on relevant shard, but all necessary
data is copied to the shard and batch object isn't accessed.
Fixes https://github.com/scylladb/scylladb/issues/25804
View building coordinator hasn't been released yet, so no backport needed.
Closesscylladb/scylladb#26058
* github.com:scylladb/scylladb:
db/view/view_building_worker: move try-catch outside `invoke_on()`
db/view/view_building_worker: split batch's data preparation and execution
The view building batch lives on shard0 but it might be doing
work on shard which owns the tablet replica.
Until now the batch data was accessed from multiple shards (shard0 and
where the batch was executed).
This patch fixes this by splitting tasks execution into:
- preparation which is always happening on shard0
- actual execution of the tasks on relevant shard, but all necessary
data is copied to the shard and batch object isn't accessed.
Fixesscylladb/scylladb#25804
As requested in #22120, moved the files and fixed other includes and build system.
Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh
Fixes: #22120
This is a cleanup, no need to backport
Closesscylladb/scylladb#25105
This change disables caching for raft log table due to the following reasons:
* Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls.
* Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls.
Fixesscylladb/scylladb#26027Closesscylladb/scylladb#26031
Currently, during cache invaldation we check if we need to preempt
only after the partition gets invaldaited. This may lead to stalls
if we have a chain of filtered out partitions.
Check for preemption even if the partition does not get invaldated.
Refs: #9136.
Currently, if a new sstable is created during repair/streaming,
we invalidate its whole token range in cache. If the sstable
is sparse, we unnecessarily clear too much data.
Modify cache invalidation, so that only the partitions present
in the sstable are cleared.
To check whether a partition is present in the sstable, we use bloom
filters. Bloom filters may return false positives and show that
an sstable contains a partition, even though it does not. Due to that
we may invalidate a bit more than we need to, but the cache will be
in valid state.
An issue arises when we do not invalidate two consecutive partitions
that are continuous. The sstable may contain a token that falls
between these partitions, breaking the continuity. To check that, we
would need to scan sstable index. However, such a change would
noticeably complicate the invalidation, both performance and code.
In this change, sstable index reader isn't used. Instead, the continuity
flag is unset for all scanned partitions. This comes at a cost of
heavier reads, as we will need to verify continuity when reading more
than one partition from cache.
Fixes: https://github.com/scylladb/scylladb/issues/9136.
Closesscylladb/scylladb#25996
`tags_extension` constructor unnecesarily takes `std::map` by const ref,
forcing a copy. This patch removes const ref for performance reasons.
Closesscylladb/scylladb#25977
Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance.
Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers.
The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same.
transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved.
cql3 module changes do the same as transport server module.
Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query.
Command line used:
```
./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null
```
The only thing changed across runs is `--workload write`/`--workload read`.
Built and run on `release` target.
<details>
```
throughput:
mean= 36946.04 standard-deviation=1831.28
median= 37515.49 median-absolute-deviation=1544.52
maximum=39748.41 minimum=28443.36
instructions_per_op:
mean= 108105.70 standard-deviation=965.19
median= 108052.56 median-absolute-deviation=53.47
maximum=124735.92 minimum=107899.00
cpu_cycles_per_op:
mean= 70065.73 standard-deviation=2328.50
median= 69755.89 median-absolute-deviation=1250.85
maximum=92631.48 minimum=66479.36
⏱ real=5:11.08 user=2:00.20 sys=2:25.55 cpu=85%
```
```
throughput:
mean= 40718.30 standard-deviation=2237.16
median= 41194.39 median-absolute-deviation=1723.72
maximum=43974.56 minimum=34738.16
instructions_per_op:
mean= 117083.62 standard-deviation=40.74
median= 117087.54 median-absolute-deviation=31.95
maximum=117215.34 minimum=116874.30
cpu_cycles_per_op:
mean= 58777.43 standard-deviation=1225.70
median= 58724.65 median-absolute-deviation=776.03
maximum=64740.54 minimum=55922.58
⏱ real=5:12.37 user=27.461 sys=3:54.53 cpu=83%
```
```
throughput:
mean= 37107.91 standard-deviation=1698.58
median= 37185.53 median-absolute-deviation=1300.99
maximum=40459.85 minimum=29224.83
instructions_per_op:
mean= 108345.12 standard-deviation=931.33
median= 108289.82 median-absolute-deviation=55.97
maximum=124394.65 minimum=108188.37
cpu_cycles_per_op:
mean= 70333.79 standard-deviation=2247.71
median= 69985.47 median-absolute-deviation=1212.65
maximum=92219.10 minimum=65881.72
⏱ real=5:10.98 user=2:40.01 sys=1:45.84 cpu=85%
```
```
throughput:
mean= 38353.12 standard-deviation=1806.46
median= 38971.17 median-absolute-deviation=1365.79
maximum=41143.64 minimum=32967.57
instructions_per_op:
mean= 117270.60 standard-deviation=35.50
median= 117268.07 median-absolute-deviation=16.81
maximum=117475.89 minimum=117073.74
cpu_cycles_per_op:
mean= 57256.00 standard-deviation=1039.17
median= 57341.93 median-absolute-deviation=634.50
maximum=61993.62 minimum=54670.77
⏱ real=5:12.82 user=4:10.79 sys=11.530 cpu=83%
```
This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes.
Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second * 300 seconds = 11.4m ops.
Update:
I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage.
run count: 5 times before and 5 times after the patch
duration: 300 seconds
Average write throughput median before patch: 41155.99
Average write throughput median after patch: 42193.22
Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350.
</details>
Built and run on `release` target.
<details>
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 14910.90 standard-deviation=477.72
median= 14956.73 median-absolute-deviation=294.16
maximum=16061.18 minimum=13198.68
instructions_per_op:
mean= 659591.63 standard-deviation=495.85
median= 659595.46 median-absolute-deviation=324.91
maximum=661184.94 minimum=658001.49
cpu_cycles_per_op:
mean= 213301.49 standard-deviation=2724.27
median= 212768.64 median-absolute-deviation=1403.85
maximum=225837.15 minimum=208110.12
⏱ real=5:19.26 user=5:00.22 sys=15.827 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 93345.45 standard-deviation=4499.00
median= 93915.52 median-absolute-deviation=2764.41
maximum=104343.64 minimum=79816.66
instructions_per_op:
mean= 65556.11 standard-deviation=97.42
median= 65545.11 median-absolute-deviation=71.51
maximum=65806.75 minimum=65346.25
cpu_cycles_per_op:
mean= 34160.75 standard-deviation=803.02
median= 33927.16 median-absolute-deviation=453.08
maximum=39285.19 minimum=32547.13
⏱ real=5:03.23 user=4:29.46 sys=29.255 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 206982.18 standard-deviation=15894.64
median= 208893.79 median-absolute-deviation=9923.41
maximum=232630.14 minimum=127393.34
instructions_per_op:
mean= 35983.27 standard-deviation=6.12
median= 35982.75 median-absolute-deviation=3.75
maximum=36008.24 minimum=35952.14
cpu_cycles_per_op:
mean= 17374.87 standard-deviation=985.06
median= 17140.81 median-absolute-deviation=368.86
maximum=26125.38 minimum=16421.99
⏱ real=5:01.23 user=4:57.88 sys=0.124 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 16198.26 standard-deviation=902.41
median= 16094.02 median-absolute-deviation=588.58
maximum=17890.10 minimum=13458.74
instructions_per_op:
mean= 659752.73 standard-deviation=488.08
median= 659789.16 median-absolute-deviation=334.35
maximum=660881.69 minimum=658460.82
cpu_cycles_per_op:
mean= 216070.70 standard-deviation=3491.26
median= 215320.37 median-absolute-deviation=1678.06
maximum=232396.48 minimum=209839.86
⏱ real=5:17.33 user=4:55.87 sys=18.425 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 97067.79 standard-deviation=2637.79
median= 97058.93 median-absolute-deviation=1477.30
maximum=106338.97 minimum=87457.60
instructions_per_op:
mean= 65695.66 standard-deviation=58.43
median= 65695.93 median-absolute-deviation=37.67
maximum=65947.76 minimum=65547.05
cpu_cycles_per_op:
mean= 34300.20 standard-deviation=704.66
median= 34143.92 median-absolute-deviation=321.72
maximum=38203.68 minimum=33427.46
⏱ real=5:03.22 user=4:31.56 sys=29.164 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 223495.91 standard-deviation=6134.95
median= 224825.90 median-absolute-deviation=3302.09
maximum=234859.90 minimum=193209.69
instructions_per_op:
mean= 35981.41 standard-deviation=3.16
median= 35981.13 median-absolute-deviation=2.12
maximum=35991.46 minimum=35972.55
cpu_cycles_per_op:
mean= 17482.26 standard-deviation=281.82
median= 17424.08 median-absolute-deviation=143.91
maximum=19120.68 minimum=16937.43
⏱ real=5:01.23 user=4:58.54 sys=0.136 cpu=99%
```
</details>
Fixes: #24567
This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported.
Closesscylladb/scylladb#25408
* github.com:scylladb/scylladb:
test/cqlpy: add protocol exception tests
test/cqlpy: `test_protocol_exceptions.py` refactor message frame building
test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code
transport: replace `make_frame` throw with return result
cql3: remove throwing `protocol_exception`
transport: replace throw in validate_utf8 with result_with_exception_ptr return
transport: replace throwing protocol_exception with returns
utils: add result_with_exception_ptr
test/cqlpy: add unknown compression algorithm test case
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.
Fixes https://github.com/scylladb/scylladb/issues/25655Closesscylladb/scylladb#25720
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.
This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.
To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.
This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write
This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.
We also add a test case reproducing the issue.
Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/19835Closesscylladb/scylladb#25590
Fixes#25683
Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.
Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).
Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.
Small unit test included.
Closesscylladb/scylladb#25699
The gossiper can call `storage_service::on_change` frequently (see scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.
This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.
This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.
Fixesscylladb/scylladb#25660
backport: this patch needs to be backported to all supported versions (2025.1/2/3).
Closesscylladb/scylladb#25658
* github.com:scylladb/scylladb:
storage_service: move get_host_id_to_ip_map to system_keyspace
system_keyspace: use peers cache in get_ip_from_peers_table
storage_service: move get_ip_from_peers_table to system_keyspace
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
- includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
- applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations
The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.
Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`.
When not provided, tests are skipped.
Test scenarios:
1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2
The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).
The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.
`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:
Read path (before)
```
instructions_per_op:
mean= 39661.51 standard-deviation=34.53
median= 39655.39 median-absolute-deviation=23.33
maximum=39708.71 minimum=39622.61
```
Read path (after)
```
instructions_per_op:
mean= 39691.68 standard-deviation=34.54
median= 39683.14 median-absolute-deviation=11.94
maximum=39749.32 minimum=39656.63
```
Write path (before):
```
instructions_per_op:
mean= 50942.86 standard-deviation=97.69
median= 50974.11 median-absolute-deviation=34.25
maximum=51019.23 minimum=50771.60
```
Write path (after):
```
instructions_per_op:
mean= 51000.15 standard-deviation=115.04
median= 51043.93 median-absolute-deviation=52.19
maximum=51065.81 minimum=50795.00
```
Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871
No backport, as it is a new feature.
Closesscylladb/scylladb#23917
* github.com:scylladb/scylladb:
tests/cluster: Add new storage tests
test/scylla_cluster: Override workdir when passed via cmdline
streaming: Reject incoming migrations
storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
repair_service: Add a facility to disable the service
compaction_manager: Subscribe to out of space controller
compaction_manager: Replace enabled/disabled states with running state
database: Add critical_disk_utilization mode database can be moved to
disk_space_monitor: add subscription API for threshold-based disk space monitoring
docs: Add feature documentation
config: Add critical_disk_utilization_level option
replica/exceptions: Add a new custom replica exception
Fixes#25709
If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).
Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.
To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).
Closesscylladb/scylladb#25719
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.
The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.
The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
For detailed description check: `docs/dev/view-building-coordinator.md`
Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930
---
This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942Closesscylladb/scylladb#23760
* github.com:scylladb/scylladb:
test/cluster: add view build status tests
test/cluster: add view building coordinator tests
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
test: adjust existing tests
utils/error_injection: add injection with `sleep_abortable()`
db/view/view_builder: ignore `no_such_keyspace` exception
docs/dev: add view building coordinator documentation
db/view/view_building_worker: work on `process_staging` tasks
db/view/view_building_worker: register staging sstable to view building coordinator when needed
db/view/view_building_worker: discover staging sstables
db/view/view_building_worker: add method to register staging sstable
db/view/view_update_generator: add method to process staging sstables instantly
db/view/view_update_generator: extract generating updates from staging sstables to a method
db/view/view_update_generator: ignore tablet-based sstables
db/view/view_building_coordinator: update view build status on node join/left
db/view/view_building_coordinator: handle tablet operations
db/view: add view building task mutation builder
service/topology_coordinator: run view building coordinator
db/view: introduce `view_building_coordinator`
db/view/view_building_worker: update built views locally
db/view: introduce `view_building_worker`
db/view: extract common view building functionalities
db/view: prepare to create abstract `view_consumer`
message/messaging_service: add `work_on_view_building_tasks` RPC
service/topology_coordinator: make `term_changed_error` public
db/schema_tables: create/cleanup tasks when an index is created/dropped
service/migration_manager: cleanup view building state on drop keyspace
service/migration_manager: cleanup view building state on drop view
service/migration_manager: create view building tasks on create view
test/boost: enable proxy remote in some tests
service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
service/migration_manager: coroutinize `prepare_new_view_announcement()`
service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
service: reload `view_building_state_machine` on group0 apply()
service/vb_coordinator: add currently processing base
db/system_keyspace: move `get_scylla_local_mutation()` up
db/system_keyspace: add `view_building_tasks` table
db/view: add view_building_state and views_state
db/system_keyspace: add method to get view build status map
db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
db/system_keyspace: move `internal_system_query_state()` function earlier
db/view: ignore tablet-based views in `view_builder`
gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
Replace throwing `protocol_exception` with returning it as a result
or an exceptional future in the transport server module. The goal is
to improve performance.
Most of the `protocol_exception` throws were made from
`fragmented_temporary_buffer` module, by passing `exception_thrower()`
to its `read*` methods. `fragmented_temporary_buffer` is changed so
that it now accepts an exception creator, not exception thrower.
`fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced
`fragmented_temporary_buffer_concepts::ExceptionThrower` and all
methods that have been throwing now return failed result of type
`utils::result_with_exception_ptr`. This change is then propagated to the callers.
The scope of this patch is `protocol_exception`, so commitlog just calls
`.value()` method on the result. If the result failed, that will throw the
exception from the result, as defined by `utils::result_with_exception_ptr_throw_policy`.
This means that the behavior of commitlog module stays the same.
transport server module handles results gracefully. All the caller functions
that return non-future value `T` now return `utils::result_with_exception_ptr<T>`.
When the caller is a function that returns a future, and it receives
failed result, `make_exception_future(std::move(failed_result).value())`
is returned. The rest of the callstack up to the transport server `handle_error`
function is already working without throwing, and that's how zero throws is
achieved.
Fixes: #24567
The option defines the threshold at which the defensive mechanisms
preventing nodes from running out of space, e.g. rejecting user
writes shall be activated.
Its default value is 98% of the disk capacity.
Reimplemented the function to use the peers cache. It could be replaced
with get_ip_from_peers_table, but that would create a coroutine frame for
each call.
The storage_service::on_change method can be called quite often
by the gossiper, see scylladb/scylla-enterprise#5613. In this commit
we introduce a temporal cache for system.peers so that we don't have
to go to the storage each time we need to resolve host_id -> ip.
We keep the cache only for a small amount of time to handle the
(unlikely) scenario when the user wants to update system.peers table
from CQL.
Fixesscylladb/scylladb#25660
We plan to add a cache to get_ip_from_peers_table in upcoming commits.
It's more convenient to do this from system_keyspace, since the only two
methods that mutate system.peers (remove_endpoint and update_peers_info)
are already there.
Fixes#25682
Refs scylla-enterprise#5580
If the truncation table is large in entries, we might create a
huge parallel execution, quite possibly consuming loads of resources
doing something quite trivial.
Limit concurrency to a small-ish number
Closesscylladb/scylladb#25678
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros
Closes#25182