Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped.
It works with tablets, but is not safe. A concurrent replication process may bring back old data.
This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request.
Backporting is not needed.
Fixes#16411Closesscylladb/scylladb#19789
* github.com:scylladb/scylladb:
docs: docs: topology-over-raft: Document truncate_table request
storage_proxy: fix indentation and remove empty catch/rethrow
test: add tests for truncate with tablets
storage_proxy: use new TRUNCATE for tablets
truncate: make TRUNCATE a global topology operation
storage_service: move logic of wait_for_topology_request_completion()
RPC: add truncate_with_tablets RPC with frozen_topology_guard
feature_service: added cluster feature for system.topology schema change
system.topology_requests: change schema
storage_proxy: propagate group0 client and TSM dependency
This commit makes storage_proxy::remote dependent on raft_group0_client
and topology_state_machine. storage_proxy::remote gets references to these via
the call to start_remote(). These references will be needed to call
storage_service::truncate_table_with_tablets().
With commits ed7d352e7d and bb1867c7c7, we now have input streams for both compressed and uncompressed SSTables that provide seamless checksum and digest checking. The code for these was based on `validate_checksums()`, which implements its own validation logic over raw streams. This has led to some duplicate code.
This PR deduplicates the uncompressed case by modifying `validate_checksums()` to use a checksummed input stream instead of a raw stream. The same cannot be done for compressed SSTables though. The reason is that `validate_checksums()` needs to examine the whole data file, even if an invalid chunk is encountered. In the checksummed case we support that by offloading the error handling logic from the data source via a function parameter. In the compressed data source we cannot do that because it needs to return decompressed data and decompression may fail if the data are invalid.
This PR also enables `validate_checksums()` to partially verify SSTables with just the per-chunk checksums if the digest is missing.
In more detail, this PR consists of:
* Port of some integrity checks from `do_validate_uncompressed()` to the checksummed data source. It should now be able to detect corruption due to truncated or appended chunks (expected number of chunks is retrieved from the CRC component).
* Introduction of `error_handler` parameter in checksummed data source and `data_stream()`.
* Refactoring of `validate_checksums()`. The JSON response of `sstable validate-checksums` was also modified to report a missing digest.
* Tests for `validate_checksums()` against SSTables with truncated data, appended data, invalid digests, or no digest.
Refs #19058.
This PR is a hybrid of cleanup and feature. No backport is needed.
Closesscylladb/scylladb#20933
* github.com:scylladb/scylladb:
tools/scylla-sstable: Rename valid_checksums -> valid
test: Check validate_checksums() with missing digest
sstables: Allow validate_checksums() to report missing digests
sstables: Refactor validate_checksums() to use checksummed data stream
sstables: Add error_handler parameter to data_stream()
sstables: Add error handler in checksummed data source
sstables: Check for excessive chunks in checksummed data source
sstables: Check for premature EOF in checksummed data source
test: test_validate_checksums: Check SSTable with invalid digest
test: test_validate_checksums: Check SSTable with appended data
test: test_validate_checksums: Complement test for truncated SSTable
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).
The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.
Fixes: scylladb/scylladb#6403
perf-simple-query -- smp 1 -m 1G output
Before:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41291 insns/op, 24485 cycles/op, 0 errors)
62669.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41277 insns/op, 24695 cycles/op, 0 errors)
69172.12 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41326 insns/op, 24463 cycles/op, 0 errors)
56706.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41143 insns/op, 24513 cycles/op, 0 errors)
56416.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41186 insns/op, 24851 cycles/op, 0 errors)
throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70
After:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 40733 insns/op, 23145 cycles/op, 0 errors)
59283.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40624 insns/op, 23948 cycles/op, 0 errors)
70851.03 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40625 insns/op, 23027 cycles/op, 0 errors)
70549.61 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40650 insns/op, 23266 cycles/op, 0 errors)
68634.96 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40622 insns/op, 22935 cycles/op, 0 errors)
throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/
Tested mixed cluster manually.
"
* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
group0: drop unused field from replace_info struct
test: rename raft_address_map_test to address_map_test and move if from raft tests
raft_address_map: remove raft address map
topology coordinator: do not modify expire state for left/new nodes any more in raft address map
topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
group0: drop raft address map dependency from raft_rpc
group0: move raft_ticker_type definition from raft_address_map.hh
storage_service: do not update raft address map on gossiper events
group0: drop raft address map dependency from raft_server_with_timeouts
group0: move group0 upgrade code to host ids
repair: drop raft address map dependency
group0: remove unused raft address map getter from raft_group0
group0: drop raft address map from group0_state_machine dependency since it is not used there any more
group0: remove dependency on raft address map from group0_state_id_handler
gossiper: add get_application_state_ptr that searches by host_id
gossiper: change get_live_token_owners to return host ids
view: move view building to host id
hints: use host id to send hints
storage_proxy: remove id_vector_to_addr since it is no longer used
db: consistency_level: change is_sufficient_live_nodes to work on host ids
...
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.
in this change, we:
- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
`std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
to feed `std::views::transform()` a view range.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
limitations:
there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21700
Raft address map is not use any longer to resolve addresses anyway, so
drop dependency on it from raft_ip_address_updater and rename it to
reflect that it is no longer raft address map specific.
The function looks up provided host id in gossip_address_map and throws
unknown_address if the mapping is not available. Otherwise it sends the
message by IP found.
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::find_if`.
in this change, we:
- replace `boost::find_if` with `std::ranges::find_if`
- remove all `#include <boost/range/algorithm/find_if.hpp>`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Our "sstring_view" is an historic alias for the standard std::string_view.
The test/ directory used this old alias in a few of random places, let's
change them to use the standard type name.
Refs #4062.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The later includes the former and in addition to `seastar::format()`,
`print.hh` also provides helpers like `seastar::fprint()` and
`seastar::print()`, which are deprecated and not used by scylladb.
Previously, we include `seastar/core/print.hh` for using
`seastar::format()`. and in seastar 5b04939e, we extracted
`seastar::format()` into `seastar/core/format.hh`. this allows us
to include a much smaller header.
In this change, we just include `seastar/core/format.hh` in place of
`seastar/core/print.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21574
In scylladb/scylladb#19745, view_builder was migrated to group0
and since then it is dependent on group0_service.
Because of this, group0_service should be initialized/destroyed
before/after view_builder.
Fixesscylladb/scylladb#20772
Co-authored-by: Dawid Mędrek <dawid.medrek@scylladb.com>
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.
Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::transform`.
in this change, we:
- replace `boost::transform` with `std::ranges::transform`
- update affected code to work with `std::ranges::transform`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21318
Fixesscylladb/scylladb#21159
When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.
Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.
Closesscylladb/scylladb#21379
To reduce the dependency load, replace use of boost ranges
with the std equivalent.
Files that lost the indirect boost dependency have it added as a
direct dependency.
Continue standardization on std::ranges. Since compound contains a custom
iterator, we first have to upgrade it to C++20 iterator concepts.
Cleanup / minor refactoring, so no backport.
Closesscylladb/scylladb#21320
* github.com:scylladb/scylladb:
compound: replace boost ranges with std ranges
compound: upgrade iterator to be an std::forward_iterator
before this change, these
[convenience libraries](https://www.gnu.org/software/automake/manual/html_node/Libtool-Convenience-Libraries.html)
were implicitly built as static libraries by default,
but weren't explicitly marked as STATIC in CMake. While this worked
with default settings, it could cause issues if `BUILD_SHARED_LIBS` is
enabled.
So before we are ready for building these components as shared
libraries, let's mark all convenience libraries as STATIC for
consistency and to prevent potential issues before we properly support
shared library builds.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21274
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.
There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.
The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.
We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately. This is especially important
for single-row reads, where the bounds are around the same key. In
this case we want to read the data file range which belongs to a
single promoted index block. It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache. Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.
Fixes#10030.
The change was tested with perf-fast-forward.
I populated the data set with `column_index_size_in_kb` set to 1
scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1
Test run:
build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0
This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.
Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.
Also, the first read which misses in cache is still faster after the change.
Before:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4%
500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0%
```
After:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1%
500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1%
```
Backports: none, not a regression
Closesscylladb/scylladb#20522
* github.com:scylladb/scylladb:
perf: perf_fast_forward: Add test case for querying missing rows
perf-fast-forward: Allow overriding promoted index block size
perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
sstables: bsearch_clustered_cursor: Add more tracing points
sstables: reader: Log data file range
sstables: bsearch_clustered_cursor: Unify skip_info logging
sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
sstables: bsearch_clustered_cursor: Skip even to the first block
test: sstables: sstable_3_x_test: Improve failure message
sstables: mx: writer: Never include partition_end marker in promoted index block width
sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
sstables: clustered_cursor: Track current block
Standardize on the standard range library.
The serialize_value(initializer_list) overload is disambiguated
not to call itself. Apparently it wasn't called before.
Since std::ranges::subrange does not provide operator==, replace
it with std::ranges::equals().
compound::iterator isn't far from a forward_iterator, and if we want
to use it with std::ranges, we have to upgrade it. This is because
std::ranges::subrange() only provides front() for forward ranges, and
we do use this front(). Boost apparently isn't as strict.
To make it a forward_range, we have to drop operator-> and make
operator* return a value (similar to std::views::tranform), since
forward iterators require that pointers and references be stable,
and this iterator returns a pointer to one of its members.
We also add an iterator_concept member to declare the compatibility
to std::ranges.
now that we are allowed to use C++23. we now have the luxury of using
`std::views::values`.
in this change, we:
- replace `boost::adaptors::map_values` with `std::views::values`
- update affected code to work with `std::views::values`
- the places where we use `boost::join()` are not changed, because
we cannot use `std::views::concat` yet. this helper is only
available in C++26.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21265
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction.
This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload.
The issue of view updates causing increased latencies most often occurs in the following scenario:
* we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows
* we have any read workload
* one replica is slower or is handling more writes due to an imbalance of data distribution
* we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it.
* each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued
* the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies
An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario:
* the tables were prepared satisfying the conditions above
* we use a medium write workload and a very low read workload
* the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update
* to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond.
In the test above:
* without the fix, the latency of reads increases over 50s
* with the fix, the latency of reads stays below 20ms
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
The patch is not that small and it isn't fixing a regression, so no backports
Closesscylladb/scylladb#20887
* github.com:scylladb/scylladb:
test: add test for high view update concurrency causing bad_allocs
test: add test for high view update concurrency degrading read latency
mv: add a dedicated read concurrency semaphore for view update read before writes
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.
in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that we are allowed to use C++23. we now have the luxury of using
`std::views::keys`.
in this change, we:
- replace `boost::adaptors::map_keys` with `std::views::keys`
- update affected code to work with `std::views::keys`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21198
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
This includes way too much, including <boost/regex.hpp>, which is huge.
Drop includes of adaptors.hpp and replace by what is needed.
Closesscylladb/scylladb#21187
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.
After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.
Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2
Closesscylladb/scylladb#21058
* github.com:scylladb/scylladb:
test/boost/mutation_test: add test for multishard permit safety
test/lib/reader_lifecycle_policy: add semaphore factory to constructor
test/lib/reader_lifecycle_policy: rename factory_function
repair/row_level: drop now unneeded release_base_resource() calls
readers/multishard: make multishard reader safe to create with admitted permits
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.
To reduce dependency load, use std ranges instead of boost ranges.
The std::ranges::{lower,upper}_bound don't support heterogeneous lookup,
but a more natural solution is to use a projection to search for the name,
so we use that and the custom comparator is removed.
Many callers are converted as well due to poor interoperability between
boost ranges and std ranges.
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.
fixes#20043Closesscylladb/scylladb#21020
* github.com:scylladb/scylladb:
tablets: Validate system.tablets update
group0_client: Introduce change validation
group0_client: Add shared_token_metadata dependency
This is sort of continuation of the previous patch. The partition key in
the registry is now table_id, not string, and is better called "owner",
not "location". This patch is s/location/owner/ over specific places
that include field name in the schema, argument names in registry
maintenance classes and tests accessing the selected row fields by name.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today, the system.sstables schema uses string as partition key. Callers,
in turn, use table's datadir value to reference entries in it. That's
wrong, S3-backed sstables don't have any local paths to work with. The
table's ID is better in this role.
This patch only changes the field type to be table_id and fixes the
callers to provide one. In particular, see init_table_storage() change
-- instead of generating a datadir string, it sets table.id() as the
options' location. Other fixed places are tests. Internally, this id
value is propagated via s3_storage::owner() method, that's fixed as
well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Describing S3 storage for an sstables nowadays has two options -- via
sstables registry entry and by using the direct prefix string. The
former is used when putting a keyspace on S3. In this case each sstable
has the corresponding entry in the system.sstables table. The latter is
used by "restore from object storage" code. In that case, sstables don't
have entries in the registry, but are accessed by a specific S3 object
path.
This patch reflects this difference by making s3_options::location be
variant of string prefix and table_id owner. The owner needs more
explanation, here it is.
Today, the system.sstables schema defines partition key to be "string
location" and clustering key to be "UUID generation". The partition key
is table's datadir string, but it's wrong to use it this way. Next
patches will change the partition key to be table's ID (there's table_id
type for it), and before doing it storage options must be prepared to
carry it onboard. This patch does it, but the table_id alternative of
the location is still unused, the rest of the code keeps using the
string location to reference a row in the registry table. Next patches
will eventually make use of the table_id value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema module (everything in schema/) is supposed to be towards the
leafs in the ScyllaDB inter-module dependency graph. In other words, it
should not depend on many other modules. On the other hand, almost the
entire codebase depends on the schema module itself.
Currently there is a circular dependency between schema and
replica::database, as the latter is a required argument for
schema::describe(). This is bad, not just because of the dependency mess
it introduces, but also because now schema::describe() can only be used
by code which has a reference to the database handy.
This patch breaks this circular dependency, by introducing the
schema_describe_helper interface and providing an implementation for it
in database.hh.
There is another circular dependency: schema <-> replica::table. This is
not addressed by this patch.
Closesscylladb/scylladb#20893
There are two issues in it. First, listing the registry with a consumer callback passes wrong argument to the consumer. Second, the primary key of the registry is wrong. Both issues don't show up, because existing tests that use mock don't read from it, only write. Tests that read from registry are python tests that start scylla and thus use real registry.
Closesscylladb/scylladb#20946
* github.com:scylladb/scylladb:
test: Use corrcet key in sstables registry mock
test: Pass entry status to mock registry consumer