Commit Graph

3988 Commits

Author SHA1 Message Date
Kefu Chai
9f8adcd207 backup_task: track the first failure uploading sstables
before this change, we only record the exception returned
by `upload_file()`, and rethrow the exception. but the exception
thrown by `update_file()` not populated to its caller. instead, the
exceptional future is ignored on pupose -- we need to perform
the uploads in parallel.  this is why the task is not marked fail
even if some of the uploads performed by it fail.

in this change, we

- coroutinize `backup_task_impl::do_backup()`. strictly speaking,
  this is not necessary to populate the exception. but, in order
  to ensure that the possible exception is captured before the
  gate is closed, and to reduce the intentation, the teardown
  steps are performed explicitly.
- in addition to note down the exception in the logging message,
  we also store it in a local variable, which it rethrown
  before this function returns.

Fixes scylladb/scylladb#21248
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21254
2024-10-28 12:54:27 +03:00
Kefu Chai
24d14b601b treewide: s/boost::adaptors::map_values/std::views::values/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::values`.

in this change, we:

- replace `boost::adaptors::map_values` with `std::views::values`
- update affected code to work with `std::views::values`
- the places where we use `boost::join()` are not changed, because
  we cannot use `std::views::concat` yet. this helper is only
  available in C++26.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21265
2024-10-27 21:32:45 +02:00
Kefu Chai
e7d6ab576b backup_task: remove unused member variable
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21258
2024-10-25 11:49:06 +03:00
Nadav Har'El
5fd3177057 Merge 'mv: add a dedicated read concurrency semaphore for view update read before writes' from Wojciech Mitros
When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload.

This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency.

The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory.

This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue.

The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction.

This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload.

The issue of view updates causing increased latencies most often occurs in the following scenario:
* we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows
* we have any read workload
* one replica is slower or is handling more writes due to an imbalance of data distribution
* we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it.
* each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued
* the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies

An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario:
* the tables were prepared satisfying the conditions above
* we use a medium write workload and a very low read workload
* the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update
* to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond.

In the test above:
* without the fix, the latency of reads increases over 50s
* with the fix, the latency of reads stays below 20ms

Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805

The patch is not that small and it isn't fixing a regression, so no backports

Closes scylladb/scylladb#20887

* github.com:scylladb/scylladb:
  test: add test for high view update concurrency causing bad_allocs
  test: add test for high view update concurrency degrading read latency
  mv: add a dedicated read concurrency semaphore for view update read before writes
2024-10-22 22:17:23 +03:00
Avi Kivity
847c850034 schema: add accessors for primary key columns and non-primary-key columns
It's somewhat common to ask for the partition key and clustering key
columns, or for the static and regular columsn. Provide accessors for them
rather than requiring the user to glue them.

Some callers are converted.

Closes scylladb/scylladb#21191
2024-10-22 15:01:14 +02:00
Kefu Chai
6ead5a4696 treewide: move log.hh into utils/log.hh
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.

in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 06:54:46 +03:00
Wojciech Mitros
f2c740710c test: add test for high view update concurrency degrading read latency
This commit add a test for checking whether a large view update workload
impacts the latency of other user reads.
In the test, we first create a table for reads and another table with
a materialized view. We then start writing to the table with the view
with a limited number of rows - when overwriting, we need to read the
previous value of the row to prepare a delete of the old row in the view.
This should not impact the latency of the read workload from the other
table that we start at the same time. The test fails if any of the reads
times out.
To reach the failing state more consistantly, we use add a sleep after
reading the old value of the base row, to keep the reader concurrency
semaphore units longer. At the same time, we use a lower threshold for
queueing reads on the semaphore, to see the impact of view update reads
earlier.
Because of the high load, the writes may timeout, but that's expected
- we fail the test only if the user reads time out.
2024-10-21 12:34:55 +02:00
Kefu Chai
5cd619a60c treewide: s/boost::adaptors::map_keys/std::views::keys/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::keys`.

in this change, we:

- replace `boost::adaptors::map_keys` with `std::views::keys`
- update affected code to work with `std::views::keys`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21198
2024-10-21 12:47:52 +03:00
Wojciech Mitros
242079d70b mv: add a dedicated read concurrency semaphore for view update read before writes
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.

This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.

The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.

This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.

The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.

Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
2024-10-21 11:02:06 +02:00
Avi Kivity
c3be2489ce treewide: drop includes of <boost/range/adaptors.hpp>
This includes way too much, including <boost/regex.hpp>, which is huge.
Drop includes of adaptors.hpp and replace by what is needed.

Closes scylladb/scylladb#21187
2024-10-20 17:17:11 +03:00
Botond Dénes
568b767ec3 Merge 'schema: convert from boost ranges to std ranges' from Avi Kivity
To reduce dependency load, change uses of boost ranges to std::ranges.

The first patch is preparation, replacing a construct that isn't easy to support with std ranges with something simpler.

No backport as this is a code cleanup.

Closes scylladb/scylladb#21122

* github.com:scylladb/scylladb:
  schema: replace boost ranges with std ranges
  schema: precompute all_columns_in_select_order()
2024-10-18 08:42:50 +03:00
Avi Kivity
d58cd262ca interval: change boost ranges to std ranges
Reduce dependency load.

size_estimates_virtual_reader is adjusted due to poor boost ranges
and std ranges interoperability.
2024-10-16 13:21:43 +03:00
Avi Kivity
3a75efd6d4 size_estimates_virtual_reader: make virtual_row_iterator more conforming
To work with std::ranges, an iterator has to have a default constructor,
and be assignable. Add the default constructor and convert references
to pointers to support this.
2024-10-16 13:21:25 +03:00
Avi Kivity
820509026f schema: replace boost ranges with std ranges
To reduce dependency load, use std ranges instead of boost ranges.

The std::ranges::{lower,upper}_bound don't support heterogeneous lookup,
but a more natural solution is to use a projection to search for the name,
so we use that and the custom comparator is removed.

Many callers are converted as well due to poor interoperability between
boost ranges and std ranges.
2024-10-15 16:42:54 +03:00
Avi Kivity
db14a01901 Merge 'Use table id as system.sstables partition key' from Pavel Emelyanov
The system.sstables (a.k.a. sstables registry) primary key is "string location" as partition key and "uuid generation" as clustering one. The "location" part was taken from table.config.datadir value which, in turn, a string containing path to on-disk files if the table was located locally, e.g. /var/lib/scylla/data/ks/cf-abc123 one. Recently [1] the datadir was moved from table config onto storage options, but this string is still used as registry key.

Other than being owned by a table with ID, sstables are accessed by restore-from-object-storage code [2]. To make it work, both storage driver and sstable_directory helper class maintain two formats of object prefixes for sstables components. For S3-backed sstables having a record in registry, the path used is s3://bucket/generation/component. For restore code there are user-provided prefixes that do not match the aforementioned pattern. The selection between those two is now made by checking sstable state, which is not obvious and may cause troubles for tiered storage driver.

This patch changes  the registry schema so that partition key becomes "uuid owner" and is set to be table.id() value. This is to stop using the local path by S3 backed sstables. Also this change makes it possible for storage driver and sstable directory to rely on the storage options only to tell different bucket prefixes formats from each other.

As a side effect, the make_s3_object_name() helper, that generates the proper object name, becomes explicit for restore-from-S3 usage. Now it relies on the sstable::filename() calling this->prefix() behind the scenes and the latter to return the user-provided prefix, which is pretty fragile construction.

No need to backport (and it's not going to be easy to do it), storage options feature is still experimental

Refs #20675 [1]
Refs #20305 [2]

Closes scylladb/scylladb#20998

* github.com:scylladb/scylladb:
  sstables: Flatten S3 object name making
  sstable_directory: Flatten directory lister creation
  treewide: Rename sstable registry location field to be owner
  system_keyspace: Change sstables registry partition key type
  sstables: Keep location variant on s3 backend too
  storage_options: Use variant on S3 options
  sstables: Split sstable::filename() helper
  sstables: Add s3_storage::owner() helper
2024-10-13 20:08:43 +03:00
Kamil Braun
96070bb5b3 Merge 'storage_proxy: Add conditions checking to avoid UB in speculating read executors.' from Sergey Zolotukhin
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in  filter_for_query(): the map is considered incorrect if the list  of replicas contains a node from a data center whose replication factor is 0.

 Please note: This PR does not fix the issue found in scylladb/scylladb#20282;   it only adds condition checks to prevent undefined behavior in cases of  inconsistent inputs.

Refs scylladb/scylladb#20625

As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.

Closes scylladb/scylladb#20851

* github.com:scylladb/scylladb:
  Add conditions checking for get_read_executor
  Avoid an extra call to block_for in db::filter_for_query.
  Improve code readability in consistency_level.cc and storage_proxy.cc
  tools: Add build_info header with functions providing build type information
  tests: Add tests for alter table with RF=1 to RF=0
2024-10-11 15:02:02 +02:00
Pavel Emelyanov
031893259a treewide: Rename sstable registry location field to be owner
This is sort of continuation of the previous patch. The partition key in
the registry is now table_id, not string, and is better called "owner",
not "location". This patch is s/location/owner/ over specific places
that include field name in the schema, argument names in registry
maintenance classes and tests accessing the selected row fields by name.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 14:11:28 +03:00
Pavel Emelyanov
3315e3a2a9 system_keyspace: Change sstables registry partition key type
Today, the system.sstables schema uses string as partition key. Callers,
in turn, use table's datadir value to reference entries in it. That's
wrong, S3-backed sstables don't have any local paths to work with. The
table's ID is better in this role.

This patch only changes the field type to be table_id and fixes the
callers to provide one. In particular, see init_table_storage() change
-- instead of generating a datadir string, it sets table.id() as the
options' location. Other fixed places are tests. Internally, this id
value is propagated via s3_storage::owner() method, that's fixed as
well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:48:09 +03:00
Kamil Braun
4d99cd2055 Merge 'raft: fast tombstone GC for group0-managed tables' from Emil Maskovsky
Add the gossip state for broadcasting the nodes state_id.

Implemented the Group0 state broadcaster (based on the gossip) that will broadcast the state id of each node and check the minimal state id for the tombstone GC.

When there is a change in the tombstone GC minimal state id, the state broadcaster will update the tombstone GC time for the group0-managed tables.

The main component of the change is the newly added `group0_state_id_handler` that keeps track, broadcasts and receives the last group0 state_ids across all nodes and sets the tombstone GC deletion time accordingly:
* on each group0 change applied, the state_id handler broadcasts the state_id as a gossip state (only if the value has changed)
* the handler checks for the node state ids every refresh period (configurable, 1h by default)
* on every check, the handler figures out the lowest state_id (timeuuid), which is state_id that all of the nodes already have
* the timestamp of this minimum state_id is then used to set the tombstone GC deletion time
* the tombstone GC calculation then uses that deletion time to provide the GC time back to the callers, e.g. when doing the compaction
* (as the time for tombstone GC calculation has the 1s granularity we actually deduce 1s from the determined timestamp, because it can happen that there were some newer mutations received in the same second that were not distributed across the nodes yet)

This change introduces a new flag to the static schema descriptor (`is_group0_table`) that is being checked for this newly added mode in the tombstone GC. We also add a check (in non-release builds only) on every group0 modification that the table has this flag set.

The group0 tombstone GC handling is similar to the "repair" tombstone GC mode in a sense (that the tombstone GC time is determined according to a reconciliation action), however it is not explicitly visible to (nor editable by) the user. And also the tombstone GC calculation is much simpler than the "repair" mode calculation - for example, we always use the whole range (as opposed to the "repair" mode that can have specific repair times set for specific ranges).

We use the group0 configuration to determine the set of nodes (both current and previous in case of joint configuration) - we need to make sure that we account for all the group0 nodes (if any node didn't provide the state_id yet, the current check round will be skipped, i.e. no GC will be done until all known nodes provide their state_id timestamp value).

Also note that the group0 state_id handling works on all nodes independently, i.e. each node might have its own (possibly different) state depending on the gossip application state propagation. This is however not a problem, as some nodes might be behind, but they will catch up eventually, and this solution has the benefit of being distributed (as opposed to having a central point to handle the state, like for example the topology coordinator that has been considered in the early stages of the design).

Fixes: scylladb/scylla#15607

New feature, should not be backported.

Closes scylladb/scylladb#20394

* github.com:scylladb/scylladb:
  raft: add the check for the group0 tables
  raft: fast tombstone GC for group0-managed tables
  tombstone_gc: refactor the repair map
  raft: flag the group0-managed tables
  gossip: broadcast the group0 state id
  raft/test: add test for the group0 tombstone GC
  treewide: code cleanup and refactoring
2024-10-11 11:52:27 +02:00
Sergey Zolotukhin
8db6d6bd57 Avoid an extra call to block_for in db::filter_for_query. 2024-10-11 09:38:25 +02:00
Sergey Zolotukhin
ad93cf5753 Improve code readability in consistency_level.cc and storage_proxy.cc
Add const correctness and rename some variables to improve code readability.
2024-10-11 09:38:25 +02:00
Benny Halevy
d34878e96c view: check_needs_view_update_path: get token_metadata_ptr
check_needs_view_update_path is async and might yield
so the token_metadata reference passed to it must be kept
alive throughout the call.

Fixes scylladb/scylladb#20979

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20980
2024-10-09 20:56:21 +03:00
Emil Maskovsky
a03e98d6e8 raft: fast tombstone GC for group0-managed tables
Set the tombstone GC time for group0-managed tables to the minimal state
id of the group0 nodes.

The check is being done based on a timer, iterating through each node
(according to the group0 topology configuration) and taking the minimum
across all nodes.

This miminum timestamp is then be used to set the tombstone GC time
for the tombstone GC of all the group0-managed tables.

Fixes: scylladb/scylla#15607
2024-10-08 21:07:30 +02:00
Emil Maskovsky
22471410e7 raft: flag the group0-managed tables
Add the schema flag to indicate the group0-managed tables.

This is to be used to identify and list the group0-managed tables.
2024-10-08 20:53:54 +02:00
Kefu Chai
ee36358a60 db: remove unused includes
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.

please note, since we have `using seastar::shared_ptr` in
`seastarx.h`, this renders `#include <seastar/core/shared_ptr.hh>`
unnecessary if we don't need the full definition of `seastar::shared_ptr`.

so, in this change, all the unused includes are removed. but there are
some headers which are actually used, while still being identified by
this tool. these includes are marked with "IWYU pragma: keep".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-04 20:48:18 +08:00
Kamil Braun
e67016540c Merge 'Node replace and remove operations: Add deprecate IP addresses usage warning.' from Sergey Zolotukhin
- As part of deprecation of IP address usage, warning messages were added when IP addresses specified in the `ignore-dead-nodes` and `--ignore-dead-nodes-for-replace` options for scylla and nodetool.
- Slight optimizations for `utils::split_comma_separated_list`, ` host_id_or_endpoint lists` and `storage_service` remove node operations, replacing `std::list` usage with `std::vector`.

Fixes scylladb/scylladb#19218

Backport: 6.2 as it's not yet released.

Closes scylladb/scylladb#20756

* github.com:scylladb/scylladb:
  config: Add a warning about use of IP address for join topology and replace operations.
  nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
  tests: Fix incorrect UUIDs in test_nodeops
  utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
2024-10-03 11:08:28 +02:00
Kefu Chai
f9091066b7 treewide: replace boost::irange with std::views::iota where possible
when building scylla with the standard library from GCC-14.2, shipped by
fedora 41, we have following build failure:

```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/init.cc.o -MF CMakeFiles/scylla-main.dir/Debug/init.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/init.cc.o -c /home/kefu/dev/scylladb/init.cc
In file included from /home/kefu/dev/scylladb/init.cc:12:
In file included from /home/kefu/dev/scylladb/db/config.hh:20:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                              ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                                      ^
3 errors generated.
[16/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/keys.cc.o
[17/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/counters.cc.o
[18/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/partition_slice_builder.cc.o
[19/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
FAILED: CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -MF CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -c /home/kefu/dev/scylladb/mutation_query.cc
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:11:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                              ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                                      ^
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:37:
In file included from /home/kefu/dev/scylladb/db/snapshot-ctl.hh:20:
/home/kefu/dev/scylladb/tasks/task_manager.hh:403:54: error: no member named 'irange' in namespace 'boost'
  403 |         co_await coroutine::parallel_for_each(boost::irange(0u, smp::count), [&tm, id, &res, &func] (unsigned shard) -> future<> {
      |                                               ~~~~~~~^
4 errors generated.
```

so let's take the opportunity to switch from `boost::irange` to
`std::views::iota`.

in this change, we:

- switch from boost::irange to std::views::iota for better standard library compatibility
- retain boost::irange where step parameter is used, as std::views::iota doesn't support it
- this change partially modernizes our range usage while maintaining
- existing functionality

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20924
2024-10-03 10:33:33 +03:00
Sergey Zolotukhin
3b9033423d utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
- utils::split_comma_separated_list now accepts a reference to sstring instead
  of a copy to avoid extra memory allocations. Additionally, the results of
  trimming are moved to the resulting vector instead of being copied.
- service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps,
  parse_node_list and api/storage_service::set_storage_service were changed to use
  std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as
  std::vector is a more cache-friendly structure,  resulting in better performance.
2024-10-02 11:56:59 +02:00
Benny Halevy
5a0f3889e0 treewide: use std::ranges sort functions rather than boost
Using the standard library is preffered over boost.

In cql3/expr/expression.cc to_sorted_vector got more of a
face-list and was modernized to use also std::unique
and while at it, to move its input range in the uniquely sorted
result vector.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-01 14:19:05 +03:00
Pavel Emelyanov
24598848a9 Merge 'virtual_tables: snapshots: include all snapshots' from Benny Halevy
Use database::get_snapshot_details to get the details
of all snapshots on disk, in particular those of
deleted tables.

Add test_snapshots_dropped_table to test listing
of snapshots of a deleted table.
And harden the existing test cases to use a unique
snapshot tag and to delete it when the test ends.

Fixes #18313

* No backport required at this time since this is rather minor UX issue that weren't hit in the field AFAIK

Closes scylladb/scylladb#20869

* github.com:scylladb/scylladb:
  cql-pytest: test_virtual_tables: add test_snapshots_multiple_keyspaces
  virtual_tables: snapshots: include all snapshots
2024-10-01 13:56:13 +03:00
Avi Kivity
fb8743b2d6 Merge 'sstables: Fix use-after-free on page cache buffer when parsing promoted index entries across pages' from Tomasz Grabiec
This fixes a use-after-free bug when parsing clustering key across
pages.

Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).

Details of the first problem:

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.

A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

Details of the solution:

We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.

We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.

Fixes #20766

Closes scylladb/scylladb#20837

* github.com:scylladb/scylladb:
  sstables: bsearch_clustered_cursor: Add trace-level logging
  sstables: bsearch_clustered_cursor: Move definitions out of line
  test, sstables: Verify parsing stability when allocating section is retried
  test, sstables: Verify parsing stability when buffers cross page boundary
  sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
  cached_file: Adapt page_view to ContiguousSharedBuffer
  cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
  sstables, utils: Allow parsers to work with different buffer types
  sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
  sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
  sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
2024-10-01 00:02:55 +03:00
Calle Wilund
b5d167699c commitlog: Fix buffer_list_bytes not updated correctly
Fixes #20862

With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.

Test included.

Closes scylladb/scylladb#20886
2024-09-30 18:04:00 +03:00
Benny Halevy
906de3444b virtual_tables: snapshots: include all snapshots
Use database::get_snapshot_details to get the details
of all snapshots on disk, in particular those of
deleted tables.

Add test_snapshots_dropped_table to test listing
of snapshots of a deleted table.
And harden the existing test cases to use a unique
snapshot tag and to delete it when the test ends.

Fixes scylladb/scylladb#18313

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-29 14:16:11 +03:00
Avi Kivity
df3ee94467 db: system_keyspace: extract auth_version_t into its own header
Users of auth_version_t shouldn't need to include the heavyweight
system_keyspace.hh.
2024-09-28 16:31:50 +03:00
Tomasz Grabiec
c0fa49bab5 sstables, utils: Allow parsers to work with different buffer types
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.

In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.

It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.
2024-09-27 01:24:54 +02:00
Kefu Chai
54858b8242 backup_task: fix the indent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-25 09:11:26 +08:00
Kefu Chai
d663b6c13b treewide: add "table" parameter to "backup" API
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.

in this change:

* api/storage_service: add "table" parameter to "backup" API.
* snapshot_ctl: compose the full path of the snapshot directory in
  `snapshot_ctl::start_backup`. since we have all the information
  for composing the snapshot directory, and what the `backup_task_impl`
  class is interested is but the snapshot directory, we just pass
  the path to it instead the individual components of the directory.
* backup_task_impl: instead of scan the whole keyspace recursively,
  only scan the specified snapshot directory.

Fixes scylladb/scylladb#20636
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-25 09:11:26 +08:00
Marcin Maliszkiewicz
2df8eefd67 db: schema_applier: simplify merge_keyspaces function
- removes uneccesary temporary sets/vectors
- removes auto&&
- moves return value instead of copying
- instead adds diff references to keep readability
- create and alter logic is almost the same, now it's visible better
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
7225538845 db: schema_applier: remove unnecessary read in merge_keyspaces
read_schema_partition_for_keyspace() is already called for every
changing keyspace by get_schema_complete_view() and stored in _after
field so we can reuse this data.
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
f49822f78d db: schema_tables: move scylla specific code into create keyspace function
Since extract_scylla_specific_keyspace_info() was always coupled
with create_keyspace_from_schema_partition() there is no value
in separating them. By moving first into the latter we:
- reduce number of exported functions
- simplify arguments of create_keyspace_from_schema_partition
- simplify caller's code
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
9792d720c9 db: move schema merging code into a separate unit
It's mostly self containted and it's easier to
maintain reasonably sized files. Also splitting
better shows boundaries between schema and
schema merging code.
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
208050f190 db: schema_tables: export some schema management functions
In subseqent commits schema merging code will be separated from
db/schema_tables.cc but code which manages schema will remain intact.
So those two translation units will share some amount of code.

It's similar case as with replica/database.cc which creates schema
on startup, it calls functions from db/schema_tables.cc.

Struct qualified_name got moved to header as it's used as
read_table_mutations() argument.
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
4630864b58 db: remove unused flush arg from do_merge_schema func 2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
4cce9c8b5a db: remove unused read_arg_values function 2024-09-23 12:01:36 +02:00
Ernest Zaslavsky
924325fd25 treewide: add "prefix" parameter to backup API
Allow the caller to pass the prefix when performing backup and restore

Fixes scylladb/scylladb#20335

Closes scylladb/scylladb#20413
2024-09-18 08:25:00 +03:00
Calle Wilund
b789361091 commitlog: Fix assertion in oversized_alloc
Fixes #20633

Cannot assert on actual request_controller when releasing permit, as the
release, if we have waiters in queue, will subtract some units to hand to them.
Instead assert on permit size + waiter status (and if zero, also controller value)

* v2 - use SCYLLA_ASSERT

Closes scylladb/scylladb#20654
2024-09-18 08:22:28 +03:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Aleksandra Martyniuk
31ea74b96e db: system_keyspace: change version of topology_requests schema
In 880058073b a new column (request_type)
was added to topology_requests table, but the table's schema version
wasn't changed. Due to that during cluster upgrade, the old and the new
versions occur but they are not distinguishable.

Add offset to schema version of topology_requests table if it contains
request_type column.

Fixes: #20299.

Closes scylladb/scylladb#20402
2024-09-11 16:36:35 +03:00
Piotr Dulikowski
d98708013c Merge 'view: move view_build_status to group0' from Michael Litvak
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.

The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.

New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.

The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.

When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.

Fixes scylladb/scylladb#15329

dtest https://github.com/scylladb/scylla-dtest/pull/4827

Closes scylladb/scylladb#19745

* github.com:scylladb/scylladb:
  view: test view_build_status table with node replace
  test/pylib: use view_build_status_v2 table in wait_for_view
  view_builder: common write view_build_status function
  view_builder: improve migration to v2 with intermediate phase
  view: delete node rows from view_build_status on node removal
  view: sanitize view_build_status during migration
  view: make old view_build_status table a virtual table
  replica: move streaming_reader_lifecycle_policy to header file
  view_builder: test view_build_status_v2
  storage_service: add view_build_status to raft snapshot
  view_builder: migration to v2
  db:system_keyspace: add view_builder_version to scylla_local
  view_builder: read view status from v2 table
  view_builder: introduce writing status mutations via raft
  view_builder: pass group0_client and qp to view_builder
  view_builder: extract sys_dist status operations to functions
  db:system_keyspace: add view_build_status_v2 table
2024-09-11 13:02:58 +02:00
Botond Dénes
de81388edb Merge 'commitlog: Handle oversized entries' from Calle Wilund
Refs #18161

Yet another approach to dealing with large commitlog submissions.

We handle oversize single mutation by adding yet another entry
typo: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.

Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
                 *
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.

On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.

Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.

This approach bumps the file format (docs to come).

To ensure "atomicity" we both force synchronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.

Closes scylladb/scylladb#19472

* github.com:scylladb/scylladb:
  commitlog/database: Make some commitlog options updatable + add feature listener
  features/config: Add feature for fragmented commitlog entries
  docs: Add entry on commitlog file format v4
  commitlog_test: Add more oversized cases
  commitlog_replayer: Replay segments in order created
  commitlog_replayer: Use replay state to support fragmented entries
  commitlog_replayer: coroutinize partly
  commitlog: Handle oversized entries
2024-09-10 17:15:46 +03:00