Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.
There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.
The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.
We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately. This is especially important
for single-row reads, where the bounds are around the same key. In
this case we want to read the data file range which belongs to a
single promoted index block. It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache. Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.
Fixes#10030.
The change was tested with perf-fast-forward.
I populated the data set with `column_index_size_in_kb` set to 1
scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1
Test run:
build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0
This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.
Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.
Also, the first read which misses in cache is still faster after the change.
Before:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4%
500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0%
```
After:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1%
500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1%
```
Backports: none, not a regression
Closesscylladb/scylladb#20522
* github.com:scylladb/scylladb:
perf: perf_fast_forward: Add test case for querying missing rows
perf-fast-forward: Allow overriding promoted index block size
perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
sstables: bsearch_clustered_cursor: Add more tracing points
sstables: reader: Log data file range
sstables: bsearch_clustered_cursor: Unify skip_info logging
sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
sstables: bsearch_clustered_cursor: Skip even to the first block
test: sstables: sstable_3_x_test: Improve failure message
sstables: mx: writer: Never include partition_end marker in promoted index block width
sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
sstables: clustered_cursor: Track current block
In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.
Fixes: scylladb/scylladb#19849Fixes: scylladb/scylladb#17857Closesscylladb/scylladb#20909
* github.com:scylladb/scylladb:
fix nodetool status to show zero-token nodes
test: move `wait_for_first_completed` to pylib/util.py
token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
before this change, we only record the exception returned
by `upload_file()`, and rethrow the exception. but the exception
thrown by `update_file()` not populated to its caller. instead, the
exceptional future is ignored on pupose -- we need to perform
the uploads in parallel. this is why the task is not marked fail
even if some of the uploads performed by it fail.
in this change, we
- coroutinize `backup_task_impl::do_backup()`. strictly speaking,
this is not necessary to populate the exception. but, in order
to ensure that the possible exception is captured before the
gate is closed, and to reduce the intentation, the teardown
steps are performed explicitly.
- in addition to note down the exception in the logging message,
we also store it in a local variable, which it rethrown
before this function returns.
Fixesscylladb/scylladb#21248
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21254
Currently, test_repair_succeeds_with_unitialized_bm checks whether
repair finishes successfully and the error is properly handled
if batchlog_manager isn't initialized. Error handling depends on
logs, making the test fragile to external conditions and flaky.
Drop the error handling check, successful repair is a sufficient
passing condition.
Fixes: #21167.
Closesscylladb/scylladb#21208
The skipped ranges should be multiplied by the number of tables
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174Closesscylladb/scylladb#21252
* github.com:scylladb/scylladb:
test: Add test_node_ops_metrics.py
repair: Make the ranges more consistent in the log
repair: Fix finished ranges metrics for removenode
now that we are allowed to use C++23. we now have the luxury of using
`std::views::values`.
in this change, we:
- replace `boost::adaptors::map_values` with `std::views::values`
- update affected code to work with `std::views::values`
- the places where we use `boost::join()` are not changed, because
we cannot use `std::views::concat` yet. this helper is only
available in C++26.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21265
Currently, running the `nodetool compactionhistory` command or using the rest api `curl -X GET --header "Accept: application/json" "http://localhost:10000/compaction_manager/compaction_history"` return compaction history without the `row_merged` field.
The series computes rows merged during compaction and provides this information to users via both the nodetool command and the rest api. The `rows_merged` field contains information on merged clustering keys across multiple sstable files. For instance, compacting two sstables of a table consisting of 7 rows where two rows are part of the both sstables, the output would have the following format: {1: 5, 2: 2}.
No backport is required. It extends the existing compaction history output.
Fixes https://github.com/scylladb/scylladb/issues/666Closesscylladb/scylladb#20481
* github.com:scylladb/scylladb:
test/rest_api: Add tests for compactionhistory
nodetool: Add rows merged stats into compactionhistory output
compaction: Update compaction history with collected histogram
compaction: Remove const qualifier from methods creating sstable readers
sstable_set: Add optional statistics to make_local_shard_sstable_reader
make_combined_reader: Add optional parameter, combined_reader_statistics
reader_selector: Extend with maximum reader count
mutation_fragment_merger: Create histogram while consuming mutation fragment batches
A worry was raised that an unprivileged user might be able to read the
system.roles table - which contains the Alternator secret keys (and also
CQL's hashed passwords). This patch adds tests that show that this worry
is unjustified - and acts as a regression test to ensure it never
becomes justified. The tests show that an unprivileged user cannot read
the system.roles table using either CQL or Alternator APIs.
More specifically, the two tests in this patch demonstrate that:
* The Alternator API does not allow an unprivileged user to read ANY system
table, unless explicitly granted permissions for that table.
* The CQL API whitelists (see service::client_state::has_access) specific
system tables - e.g., system_schema.tables - that are made readable to any
unprivileged user. But the system.auth table is NOT whitelisted in this
way - and is unreadable to unprivileged users unless explicitly granted
permissions on that table.
The new tests passes on both Scylla and Casssandra.
Refs #5206 (that issue is about removing the Alternator secret keys from
the roles table - but stealing CQL salted hashes is still pretty bad, so
it's good to know that unprivileged users can't read them).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21215
While documenting materialized view in a new document (Refs #16569)
I encountered a few questions on how various CQL operations work on
a table that has views, and this patch contains tests that clarify their
answer - and can later guarantee that the answer doesn't unintentionally
change in the future. The questions that these tests answer are:
1. That TRUNCATE on a base table also TRUNCATEs its views. This is just
a basic test, with no attempt to reproduce issue #17635 (which is
about the truncation of the base and views not being atomic).
2. That DROP TABLE is *not allowed* on a base table that has views.
3. That DROP KEYSPACE is allowed, even if there are tables with views.
4. Test that ALTER TABLE tbl DROP is never allowed in Cassandra, but
allowed in some cases by Scylla
5. Test that ALTER TABLE tbl ADD is allowed, and "SELECT *" expands to
select the new column into the materialized view as well.
All the new tests pass on both Scylla and Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21142
There's a `missing_summary_first_last_sane` test case that uses some very specific way of modifying an sstable -- it loads one from resources, then tries to "write" the loaded stuff elsewhere. For that it uses a special purpose test::store() helper and a bunch of auxiliary ones from the same class. Those aux helpers are not used anywhere else and are also very special for this test case, so it make sense to keep this whole functionality in a single helper.
Closesscylladb/scylladb#21255
* github.com:scylladb/scylladb:
test: Squash test::change_generation_number() into test::store()
test: Squash test::change_dir() into test::store()
test: Coroutinize sstables::test::store()
This commit adds a new test case 'test_group_by_static_column_and_tombstones'
to verify the behavior of GROUP BY queries with static columns. The test is
adapted from Cassandra's test suite and aims to reproduce issue #21267.
Original, larger test:
cassandra_tests/validation/operations/select_group_by_test.py::testGroupByWithPaging()
Closesscylladb/scylladb#21270
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.
The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.
Fixes: scylladb/scylladb#19849
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.
Fixesscylladb/scylladb#21044Closesscylladb/scylladb#21045
We introduce an auxiliary type representing a service level for making
it easier to adjust the tests in Enterprise. We move the responsibility
of producing create statements for service levels to the class, so we
only need to modify the code in one place when necessary.
All existing relevant tests have been adjusted to this change.
Closesscylladb/scylladb#21230
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.
Fixes: scylladb/scylladb#21102
Should be backported to every 6.x branch, as it may lead to a crash.
Closesscylladb/scylladb#21121
* github.com:scylladb/scylladb:
test: add UT to test retrying ALTER tablets KEYSPACE
cql/tablets: fix indentation in `rf_change` event handler
cql/tablets: fix retrying ALTER tablets KEYSPACE
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
* Add `--max-failures` flag to test.py, which will stop the execution after number of failures
* Helps with "fails-fast" approach and can be used to improve CI speed, especially the 100times run
* Adds the number of cancelled tests to both summary and junit xml. I did not include them in boost, since it does not contain any statistics.
* Removes unnecessary list creation in test.py
* Completely unrelated change, but it is small enough that I feel it can be included as part of this one. If this is an issue I can create separate PR for it
* Add `Test.started` property
* Helps with determining the current status of the Test and differentiating cancelled/not started tests.
* Add `Test.failed` and `Test.did_not_run` read-only computed properties
* Helper methods to determine status, instead of using `Test.success`, which does not tell the entire story
* Fix `ScyllaClusterManager.stop()` method, so it doesn't fail when ran multiple times
* This happens when tasks are cancelled, not sure yet why, it almost certainly non-wanted behaviour but this behaviour was already there and with this fix it no longer causes errors
I will use backport/None for now as it is a new feature.
Fixes https://github.com/scylladb/qa-tasks/issues/1714Closesscylladb/scylladb#21098
* github.com:scylladb/scylladb:
test.py: Add option to fail after number of failures
test.py: Add started, failed and did_not_run properties to Test
test.py: Remove unnecessary list creation
test: lib: Fix ScyllaClusterManager.stop()
When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction.
This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload.
The issue of view updates causing increased latencies most often occurs in the following scenario:
* we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows
* we have any read workload
* one replica is slower or is handling more writes due to an imbalance of data distribution
* we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it.
* each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued
* the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies
An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario:
* the tables were prepared satisfying the conditions above
* we use a medium write workload and a very low read workload
* the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update
* to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond.
In the test above:
* without the fix, the latency of reads increases over 50s
* with the fix, the latency of reads stays below 20ms
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
The patch is not that small and it isn't fixing a regression, so no backports
Closesscylladb/scylladb#20887
* github.com:scylladb/scylladb:
test: add test for high view update concurrency causing bad_allocs
test: add test for high view update concurrency degrading read latency
mv: add a dedicated read concurrency semaphore for view update read before writes
Before python 3.12 formatted strings couldn't have reused quotes.
Change the type of quotation mark in get_cgroup so it could be
used with earlier python versions.
Closesscylladb/scylladb#21209
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.
For a table with NullCompactionStrategy and
TimeWindowCompactionStrategy, the test
- inserts a bunch of data and flushes the table
- deletes/update some data, delete a range of data and flushes
the table
- Triggers a major compaction and calls for compactionhistory
to retrieve and validate the histogram
Incorporate rows merged statistics into the output of the compactionhistory
command. Depending on the requested format type, the output has
different form.
For instance, compacting two sstables of a table consisting of 7
rows where two rows are part of the both sstables, the output
would have the following format:
text: {1: 5, 2: 2}
json: [{"key":1,"value":5},{"key":2,"value":1}]}
yaml: - key: 1
value: 5
- key: 2
value: 1
The maximum reader count allows to predict the number of readers
that can be created with create_new_readers(). This helps to
correctly allocate a vector size in the rows_merged statistics
when a combiner reader is created via make_combined_reader.
before this change, we build some tests as if they are Seastar tests.
but after 415c83fa, these tests failed to link. because the
Seastar::seastar_testing does not expose `-DSEASTAR_TESTING_MAIN` in
its cflags. the behavior of the Seastar::seastar_testing is expected.
because a test linking against this library is not necessarily driven
by the `main()` provided by `testing/seastar_test.hh`.
so, in this change, we correct the `KIND` parameter of these tests,
so that they use `KIND BOOST`, as these tests can be driven by the
`main()` provided by Boost.Test's driver. also there are some tests
driven by Boost.Test's `main()`, but in the meanwhile, they utilize
seastar_testing, so let's add `Seastar::seastar_testing` to their
`LIBRARIES`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21183
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.
in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This commit add a test for checking whether a large view update workload
can cause Scylla to run out of memory.
In the test, we keep writing to a table table with a materialized view
with a limited number of rows, causing overwrites which require reading
from the table to perform view updates.
Currently, due to the unlimited concurrency of view update reads, we may
use too much memory which can lead to bad_allocs, causing Scylla to fail.
To reach the failing state more consistently, we use add a sleep after
reading the old value of the base row, to keep the reader concurrency
semaphore units longer. At the same time, we use high concurrency and
large row size to use up all Scylla's memory quickly.
The test fails if Scylla runs out of memory and aborts, and succeeds
otherwise.
This commit add a test for checking whether a large view update workload
impacts the latency of other user reads.
In the test, we first create a table for reads and another table with
a materialized view. We then start writing to the table with the view
with a limited number of rows - when overwriting, we need to read the
previous value of the row to prepare a delete of the old row in the view.
This should not impact the latency of the read workload from the other
table that we start at the same time. The test fails if any of the reads
times out.
To reach the failing state more consistantly, we use add a sleep after
reading the old value of the base row, to keep the reader concurrency
semaphore units longer. At the same time, we use a lower threshold for
queueing reads on the semaphore, to see the impact of view update reads
earlier.
Because of the high load, the writes may timeout, but that's expected
- we fail the test only if the user reads time out.
now that we are allowed to use C++23. we now have the luxury of using
`std::views::keys`.
in this change, we:
- replace `boost::adaptors::map_keys` with `std::views::keys`
- update affected code to work with `std::views::keys`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21198
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
std::ranges::to<>() has a little protocol with containers to
allow them to optimize their construction from ranges. Implement it
for small_vector. It optimizes ranges that can have their size determined
quickly, or that can be traversed twice to determine the size by reserving
up front. Single-pass ranges (std::ranges::input_range) use the less
efficient push_back method.
A unit test (which fails without the new constructor) is added.
Closesscylladb/scylladb#21094
This includes way too much, including <boost/regex.hpp>, which is huge.
Drop includes of adaptors.hpp and replace by what is needed.
Closesscylladb/scylladb#21187
now that we are able to use ranges library provided by the C++
standard library. there is no need to use the homebrew `ranges::to()`.
in this change, we switch to `std::ranges::to()` in favor of
`ranges::to()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 787ea4b1, we construct a new `storage_options` for each sstable
to be restored. the `location` of the new `storage_option` instances
is composed of the configured `prefix` and the dirname of each toc
component. but instead of separating them with "/", we just concatenate
them. this breaks the test if the specified key representing toc
components includes "dirname" in them.
in this change
- data_directory: instead of using "{prefix}{dirname}", we use
"{prefix}/{dirname}".
- test/object_store: update the existing test to add a suffix
in the keys of the toc objects to mimic the typical use case.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21170
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.
After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.
Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2
Closesscylladb/scylladb#21058
* github.com:scylladb/scylladb:
test/boost/mutation_test: add test for multishard permit safety
test/lib/reader_lifecycle_policy: add semaphore factory to constructor
test/lib/reader_lifecycle_policy: rename factory_function
repair/row_level: drop now unneeded release_base_resource() calls
readers/multishard: make multishard reader safe to create with admitted permits
Fixes#18903
Adds a "transitional" internode encryption mode, under which all _outgoing_ RPC connections will use TLS, but we will still accept any incoming non-tls connection.
This allows an operator to perform a move to TLS RPC without cluster downtime:
1. For each server, add certificate etc options to server_encryption_options + internode_encryption=none + set ssl_storage_port + restart (rolling)
2. For each server, set internode_encryption=transitional + RR
3. For each server, set internode_encryption=all + RR
Closesscylladb/scylladb#18939
* github.com:scylladb/scylladb:
test::topology: Add test for TLS upgrade and downgrade of internode encryption
docs: Add internode_encryption=transitional documentation
messaging_service: Add "transitional" internode encryptipn mode
messaging_service: Create TLS connector even if internode_enc=none when certs set
before this change, scylla's CMake-based system consumes Seastar
library by including it directly. but this failed to address the needs
of linking against Seastar shared libraries in Debug and Dev builds, while
linking against the static libraries in other builds. because Seastar
uses `BUILD_SHARED_LIBS` CMake variable to determine if it builds
shared libraries. and we cannot assign different values to this
CMake variable based on current configure type -- CMake does not
support. see https://gitlab.kitware.com/cmake/cmake/-/issues/19467
in order to address this problem, we have a couple possible
solutions:
- to enable Seastar to build both shared and static libraries in a
pass. without sacrificing the performance, we have to build
all object files twice: once with -fPIC, once without. in order
to accompolish this goal, we need to develop a machinary to
populate the same settings to these two builds. this would
complicate the design of Seastar's building system further.
- to build Seastar libraries twice in scylla, we could use
the ExternalProject module to implement this. but it'd be
complicate to extract the compile options, and link options
previously populated by Seastar's targets with CMake --
we would have to replicate all of them in scylla. this is
out of the question.
- to build Seastar libraries twice before building scylla,
and let scylla to consume them using CMake config files or
.pc files. this is a compromise. it enables scylla to
drive the build of Seastar libraries and to consume
the compile options and link options. the downside is:
* the generated compilation database (compile_commands.json)
does not include the commands building Seastar anymore.
* the building system of scylla does not have finer graind
control on the building process of seastar. for instance,
we cannot specify the build dependency to a certain seastar
library, and just build it instead of building the whole
seastar project.
turns out the last approach is the best one we can have
at this moment. this is also the approach used by the existing
`configure.py`.
in this change, we
- add FindSeastar.cmake to
* detect the preconfigured Seastar builds, and
* extract the build options from .pc files
* expose library targets to be consumed by parent project
- add Seastar as an external project, so we can build it from
the parent project.
this is atypical compared to standard ExternalProject usage:
- Seastar's build system should already be configured at this point.
- We maintain separate project variants for each configuration type.
Benefits of this approach:
- Allows the parent project to consume the compile options exposed by
.pc file. as the compile options vary from one config to another.
- Allows application of config-specific settings
- Enables building Seastar within the parent project's build system
- Facilitates linking of artifacts with the external project target,
establishing proper dependencies between them
we will update `configure.py` to merge the compilation database
of scylla and seastar.
Refs scylladb/scylladb#2717
---
this is a CMake-related change, hence no need to backport.
Closesscylladb/scylladb#21131
* github.com:scylladb/scylladb:
build: cmake: use GENERATOR_IS_MULTI_CONFIG property to detect mult-config
build: cmake: consume Seastar using its .pc files
build: do not use `mode` as the index into `modes`
build: cmake: detect and link against GnuTLS library
build: cmake: detect and link against yaml-cpp
build: cmake: link Seastar with Seastar::<COMPONENT>
build: cmake: define CMake generate helper funcs in scylla
To reduce dependency load, change uses of boost ranges to std::ranges.
The first patch is preparation, replacing a construct that isn't easy to support with std ranges with something simpler.
No backport as this is a code cleanup.
Closesscylladb/scylladb#21122
* github.com:scylladb/scylladb:
schema: replace boost ranges with std ranges
schema: precompute all_columns_in_select_order()
The statistics_rewrite test case copies an sstable from resources two
times:
- first time -- explicitly by listing resource components and copying
files to the test temp dir
- second time -- implicitly, by calling create_links() linking copied
files by new set in the staging/ subdirectory
The 2nd step is not needed and the history of changes justifies that.
The test itself appeared with 70b793e4d3 and it only contained the 2nd
"copying" -- test linked files from resource directory and then worked
in the newly created set.
Later, commit 59c57861ae added the first step and copied the files
from resource into test temp dir. At this point linking copied files
because pointless, but was preserved. Let's remove it now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21097
before this change, we link against the targets defined in Seastar's
source tree. but these targets are not part of Seastar's public
interface -- they are not exposed by Seastar's CMake config files.
so, let link against the target names qualified by the library module
name. this also prepares for the transition to using Seastar without
including it directly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.