Commit Graph

45272 Commits

Author SHA1 Message Date
Botond Dénes
e2344e28b6 replica/database: make_multishard_streaming_reader(): expose buffer_hint parameter
Expose the buffer hint functionality added by the previous commits, to
callers of make_multishard_streaming_reader(). All callers disable it
currently, it will be used in the next patch.
2024-11-07 02:47:46 -05:00
Botond Dénes
3c25e6fcb4 db/config: introduce enable_repair_multishard_reader_buffer_hint
Allows enabling/disabling the multishard reader buffer hint
optimization. Not wired yet.
2024-11-06 08:51:00 -05:00
Botond Dénes
b052c5df62 readers/multishard: multishard_reader: pass hint to shard_reader
Calculate a buffer fill hint and pass it to
shard_reader_v2::fill_buffer(), so the underlying buffer-fill can be
optimized to avoid multiple cross shard round-trips, as well as possible
evict-recreate cycles.
The buffer hint mechanism is opt-in, enabled via the new
multishard_reader_buffer_hint parameter.
2024-11-06 08:51:00 -05:00
Botond Dénes
912b4dfba3 readers/multishard: shard_reader_v2::fill_reader_buffer(): respect the hint
When the hint is provided, respect it: make sure the returned buffer is
of the requested size, stopping early if the stop_token is seen.
To reduce the amount of possible eviction-recreate cycles while the
buffer is filled, disable auto-pause for the duration of the
fill_reader_buffer() call. For this purpose, auto_pause_disable_guard is
added to evictable_reader_v2.
2024-11-06 08:51:00 -05:00
Botond Dénes
8d5283f036 readers/multishard: propagate fill_buffer_hint to shard_reader:fill_reader_buffer()
The hint will tell the shard reader exactly how much data to produce, to
avoid multiple cross-shard round-trips and possible evict-recreate
cycles.

The hint is neither used yet or calculated yet, this is coming in the
next patches.
2024-11-06 08:51:00 -05:00
Botond Dénes
ee7ecb9155 readers/multishard: shard_reader: extract buffer-fill into its own method
It is about to get a bit more complicated, so worth to extract into a
method so it can be shared by the two call-sites.
2024-11-06 08:51:00 -05:00
Avi Kivity
ee92784098 serialization: replace boost::type with std::type_identity
Recently, seastar rpc started accepting std::type_identity in addition
to boost::type as a type marker (while labeling the latter with an
ominous deprecation warning). Reduce our depedendency on boost
by switching to std::type_identity.
2024-11-05 00:43:27 +01:00
Avi Kivity
075b13597d serializer: drop dependency on boost ranges
The call to boost::range::for_each is easily replaced with ranged for.

Closes scylladb/scylladb#21422
2024-11-04 17:48:17 +02:00
Avi Kivity
b706e3e9e4 Merge 'sstables/index_reader: avoid unnecessary index page reads in single-partition reads' from Michał Chojnowski
Terminology note: in the context of this series, "index page" means an contiguous segment of the index file starting (inclusive) at a key corresponding to a summary entry and ending (exclusive) before the key corresponding to the next summary entry. "Index pages" are not related to filesystem pages.

---

In a single-partition read, if the searched partition key is the first key in its index page, we start scanning the index for that key starting at the previous index page (inclusive), even though we could start directly from the key's page. Similarly, if the searched partition key is absent from the sstable and lies after all other keys in its appropriate page, we additionally scan the next page, even though it's known from the summary that it can't possibly contain the key.

Those cases are wasteful. It's worse than it might seem at first glance. When partitions are small, only a small fraction of search keys fulfills those conditions (i.e. "first key in its page" or "an absent key greater than the last key in its page"), so the waste doesn't matter much. But when partitions are big enough, every index page contains only one partition key (and a promoted index for that partition), which directly means that *all* search keys fulfill the conditions, which means that total index reading work is two times bigger than what it should be.

In addition, there is a secondary performance bug which, when the aforementioned conditions are fulfilled, causes *additional* I/O to happen *past* the index reads which are actually parsed and used. In effect, the index I/O in single-partition reads might be not just doubled, but even tripled (that's for IOPS — throughput might be multiplied even more), all because of a slight inaccuracy in the edge cases.

This series fixes those inefficiencies by tightening the edge cases and ensuring that single-partition reads always read only a single index page.

Here's an example where we query the first row (i.e. `LIMIT 1`) of a certain partition key, in a table with large (1 MB) promoted indexes. Before the patch, the lookup of the lower bound involves 3 serialized disk reads (as described above) to subsequent index pages, and even the lookup of the upper bound involves 2 disk reads:

```
                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38359040 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38391808 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38359040, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38391808, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41390080 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41422848 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41390080, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41422848, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

After the patch, the lookup of each bound involves 1 read:
```

                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

Doesn't have to be backported, since the problem only affects performance, not correctness, and it has been present since forever.

Closes scylladb/scylladb#20897

* github.com:scylladb/scylladb:
  index_reader: remove a piece of misguided code involved in single-partition reads
  index_reader: in single-partition reads, don't read more than one page
  index_reader: fix unnecessary reads of preceding index pages
2024-11-04 14:28:27 +02:00
Avi Kivity
2531dc2d80 schema_registry: stop including replica/database.hh
database.hh is a hotspot that changes often (or its dependencies
do). Avoid including it to reduce recompilations.

Closes scylladb/scylladb#21407
2024-11-04 13:16:27 +01:00
Avi Kivity
7cb1ad8c87 Merge 'compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors' from Benny Halevy
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.

stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.

Fixes scylladb/scylladb#21159

* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body

Closes scylladb/scylladb#21299

* github.com:scylladb/scylladb:
  compaction_manager: stop: await _stop_future if engaged
  compaction_manager: really_do_stop:  assert that no tasks are left behind
  compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
  compaction/compaction_manager: stop_tasks(): unlink stopped tasks
  compaction/compaction_manager: make _tasks an intrusive list
2024-11-04 13:54:16 +02:00
Pavel Emelyanov
f3f956841f sstables: Remove unused mp_row_consumer_m::range_tombstone_start
It's only used by its operator<< so remove it as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21380
2024-11-03 16:40:02 +02:00
Avi Kivity
704ea9d3b4 Merge 'api: Remove foreach_column_family() helper' from Pavel Emelyanov
There's a whole lot of helpers and wrappers in api/ that help handlers manipulate keyspaces and tables. One of those is foreach_column_family which calls the provided callable on a table on each shard. There's exactly the same (but a bit more flexible) helper nearby. While at it, this helper gets a better name.

Closes scylladb/scylladb#21398

* github.com:scylladb/scylladb:
  api: Rename set_tables -> for_tables_on_all_shards
  api: Remove foreach_column_family() helper
2024-11-03 15:46:27 +02:00
Avi Kivity
856489ded1 cql3: remove unused request_validations methods
These methods are not used and therefore removed.

Closes scylladb/scylladb#21392
2024-11-03 13:17:32 +02:00
Benny Halevy
6cce67bec8 compaction_manager: stop: await _stop_future if engaged
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental.  What matters is that `_stop_future` is engaged.

While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-03 10:53:35 +02:00
Benny Halevy
a7a55298ea compaction_manager: really_do_stop: assert that no tasks are left behind
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-03 10:53:34 +02:00
Benny Halevy
c08ba8af68 compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.

Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.

Fixes scylladb/scylladb#21298

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-03 10:52:58 +02:00
Botond Dénes
d8500472b3 compaction/compaction_manager: stop_tasks(): unlink stopped tasks
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).
2024-11-03 10:17:11 +02:00
Botond Dénes
e942c074f2 compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.

Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
2024-11-03 10:17:11 +02:00
Avi Kivity
39b55bd3a0 Update seastar submodule
* seastar f821bda19...fba36a3d1 (13):
  > build: do not include -DBoost_TEST_DYN_LINK in seastar_testing_cflags
  > doc: compatibility: update the notes on supported GCC versions
  > docker: bump up to clang {18,19} and gcc {13,14}
  > rpc: optimize small tuple deserialization
  > rpc: switch rpc::type from boost to std
  > thread: do not use fortify source
  > build: suppress CMake warning about CMP0057
  > core/units: remove space before literal identifier
  > signal.md: describe auto signal handling
  > build: persist Seastar options in SeastarConfig.cmake
  > sharded.hh: seperate invoke_on decls from defs
  > test: Add perf test for http client
  > gate: check: mark as const

Closes scylladb/scylladb#21390
2024-11-02 13:58:45 +02:00
Botond Dénes
19a43b5859 Merge 'repair: Reduce hints and batchlog flush' from Asias He
The hints and batchlog flush requests are issued to all nodes for each repair request when tombstone_gc repair mode is used.

The amount of such flush requests is high when all nodes in the cluster run repair. It is observed it takes a long time, up to 15s, for a repair request to finish such a flush request.

To reduce overhead of the flush, each node caches the flush and only executes the real flush when some time has passed. It is safe to do so before the real flush_time is returned. Repair uses the smallest flush_time from peers as the repair time.

The nice thing about the cache on the receiver side is that all senders can hit the cache. It is better than cache on the sender side.

A slightly smaller flush_time compared to the real flush time will be used with the benefits of significantly dropped hints and batchlog flush. The tradeoff is reasonable.

Fixes #20259

Performance improvement. No backports.

Closes scylladb/scylladb#20260

* github.com:scylladb/scylladb:
  test/test_repair.py: Add test_batchlog_flush_in_repair
  repair: Reduce hints and batchlog flush
  db/batchlog_manager: Add add_delay_to_batch_replay
  db/batchlog_manager: Add get_last_replay
  db/batchlog_manager: wire in batchlog_replay_cleanup_after_replays
  db/config: introduce batchlog_replay_cleanup_after_replays
  db/batchlog_manager: do_batch_log_replay(): add cleanup flag
2024-11-01 14:23:27 +02:00
Pavel Emelyanov
292fd52a60 Merge 'utils: chunked_vector: various constructor improvements' from Avi Kivity
Optimize the various constructors a little, and add an std::from_range_t
constructor.

Minor improvement, so no backports.

Closes scylladb/scylladb#21399

* github.com:scylladb/scylladb:
  utils: chunked_vector: add from_range_t constructor
  utils: chunked_vector: optimize initializer_list constructor
  utils: chunked_vector: iterator constructor: copy spanwise
  utils: chunked_vector: reserve for forward iterators, not just random access iterators, on construction
2024-11-01 15:02:56 +03:00
Botond Dénes
4bafaee523 Merge 'tasks: improve task_manager::lookup_virtual_task' from Aleksandra Martyniuk
Currently, to find the operation with given id, all operations tracked by a virtual task are listed. This isn't necessary, since we only need info regarding one particular operation.

Add a method to check whether a virtual task tracks the operation with the given id.

No backport needed

Closes scylladb/scylladb#20769

* github.com:scylladb/scylladb:
  tasks: delete virtual_task::get_ids method as it is unused
  tasks: improve task_manager::lookup_virtual_task
2024-11-01 13:44:04 +02:00
Kefu Chai
1b8446f92d compaction: fix the indent
in 38ce2c605d, we left a TODO for
reindent the code.

in this change, we reindent the code to address this TODO.

Refs 38ce2c605d
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21383
2024-11-01 12:55:47 +03:00
Avi Kivity
b5e46077df sstables: generation_type: replace boost ranges with std ranges
Reduce dependency load.

Closes scylladb/scylladb#21402
2024-11-01 12:45:24 +03:00
Pavel Emelyanov
d6169630a4 api: Rename set_tables -> for_tables_on_all_shards
The former name is not extremely descriptive, hopefully the latter one
is better in this sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-11-01 12:15:01 +03:00
Pavel Emelyanov
822758dffd api: Remove foreach_column_family() helper
There's a whole lot of helpers and wrappers in api/ that help handlers
manipulate keyspaces and tables. One of those is foreach_column_family
which calls the provided callable on a table on each shard. There's
exactly the same (but a bit more flexible) set_table() helper nearby.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-11-01 12:13:35 +03:00
Botond Dénes
0ee0dd3ef4 Merge 'Collect and report backup progress' from Pavel Emelyanov
Task manager GET /status method returns two counters that reflect task progress -- total and completed. To make caller reason about their meaning, additionally there's progress_units field next to those counters.

This patch implements this progress report for backup task. The units are bytes, the total counter is total size of files that are being uploaded, and the completed counter is total amount of bytes successfully sent with PUT requests. To get the counters, the client::upload_file() is extended to calculate those.

fixes #20653

Closes scylladb/scylladb#21144

* github.com:scylladb/scylladb:
  backup_task: Report uploading progress
  s3/client: Account upload progress for real
  s3/client: Introduce upload_progress
  s3: Extract client_fwd.hh
2024-11-01 10:57:12 +02:00
Kefu Chai
64122b3df3 treewide: s/boost::transform/std::ranges::transform/
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::transform`.

in this change, we:

- replace `boost::transform` with `std::ranges::transform`
- update affected code to work with `std::ranges::transform`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21318
2024-11-01 08:15:14 +02:00
Avi Kivity
8c67f9b42e cql3: util: remove unneeded boost/range includes from header files
The includes are redistributed to the source files that need them.

Closes scylladb/scylladb#21391
2024-10-31 23:49:44 +01:00
Nadav Har'El
ee2d75b088 Merge 'Generalize "breakpoint" type of error injection' from Pavel Emelyanov
This pattern is -- if requested (by test) suspend code execution until requestor (the test) explicitly wakes it up. For that the injected place should inject a lambda that is called with so called "handler" at hand and try to read message from the handler. In many cases the inner lambda additionally prints a message into logs that tests waits upon to make sure injection was stepped on. In the end of the day this "breakpoint" is injected like

```
    co_await inject("foo", [] (auto& handler) {
        log.info("foo waiting");
        co_await handler.wait_for_message(timeout);
    });
```

This PR makes breakpoints shorter and more unified, like this

```
    co_await inject("foo", wait_for_message(timeout));
```

where `wait_for_message` is a wrapper structure used to pick new `inject()` overload.

Closes scylladb/scylladb#21342

* github.com:scylladb/scylladb:
  sstables: Use inject(wait_for_message_overload)
  treewide,error_injection: Use inject(wait_for_message) and fix tests
  treewide,error_injection: Use inject(wait_for_message) overload
  error_injection: Add inject() overload with wait_for_message wrapper
2024-10-31 21:56:27 +02:00
Avi Kivity
6a9852d47b utils: chunked_vector: add from_range_t constructor
std::ranges::to<> has a little protocol with containers. Implement it
to get optimized construction.

Similar to the iterator pair constructor, if the range's size can be
obtained (even with an O(N) algorithm), favor that to avoid reallocations.
Copy elements spanwise to promote optimization to memcpy when possible.
2024-10-31 19:32:16 +02:00
Avi Kivity
b2769403d2 utils: chunked_vector: optimize initializer_list constructor
Delegate to the previously optimized iterator-pair constructor.
2024-10-31 18:10:14 +02:00
Avi Kivity
0a81be4321 utils: chunked_vector: iterator constructor: copy spanwise
Instead of copying element-by-element, copy contiguous spans. This
is much faster if the input is a span and the constructor is trivial,
since the whole thing translates to a memcpy.

Make the two branches constexpr to reduce work for the compiler in
optimizing the other branch away.
2024-10-31 18:10:08 +02:00
Avi Kivity
4653430c8e utils: chunked_vector: reserve for forward iterators, not just random access iterators, on construction
For a forward iterator, prefer a two pass algorithm to first count
the number of elements, reserver, then copy the elements, to a single
pass algorithm that involves reallocation and copying.
2024-10-31 17:55:42 +02:00
Kefu Chai
673b107ffa github: use GithubException when appropriate
`Exception` could be too general, what we really care about is
`GithubException`. so let's catch the latter instead for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21364
2024-10-31 18:21:29 +03:00
Kefu Chai
f8221b960f test: route S3 mock server messages through logger
The S3 mock server (introduced in 5a96549c) currently prints its status
messages directly to stdout, which can be distracting when reviewing test
results. For example:

```console
$ ./test.py --verbose --mode debug object_store/test_backup::test_simple_backup
Found 1 tests.
Starting S3 mock server on ('127.226.51.1', 2012)
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/1]      object_store  debug  [ PASS ] object_store.test_backup.1 5.99s
Stopping S3 mock server
-------------------------
CPU utilization: 6.5%
```

Move these messages to use proper logging to give developers more control
over their visibility:

- Make logger parameter mandatory in MockS3Server constructor
- Route "Stopping S3 mock server" message through the provided logger
- Add --log-level option to the standalone mock server launcher

The message is now hidden:

```console
$ ./test.py --verbose --mode debug --save-log-on-success object_store/test_backup::test_simple_backup
Found 1 tests.
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------

[1/1]      object_store  debug  [ PASS ] object_store.test_backup.1 6.25s
------------------------------------------------------------------------------
CPU utilization: 5.5%
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21384
2024-10-31 18:21:29 +03:00
Benny Halevy
78ceaeabca compaction_manager: compaction_disabled: return true if not in compaction_state
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.

This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.

Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.

Fixes scylladb/scylladb#20060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21378
2024-10-31 18:21:29 +03:00
Wojciech Mitros
88ab8db944 mv: run view building in streaming scheduling group
View building is an expensive process that takes a long time to complete.
During the build, it's impact on other work should be minimized, even at
the expense of slightly slowing it down.

Instead, view building is currently performed in the the same scheduling
group (gossip) as other high-priority tasks, in particular raft processing,
which slows it down, making races more likely and increasing the number
of retries that need to be done.

While view building is still initiated in the gossip group (as it's the
result of adding a view, which is a schema change), in this patch the bulk
of the view building work is moved to a low-priority, maintenance scheduling
group (named "streaming" after its main use case).

Additionally, a test is added, where we make sure that the scheduling
group is the one most used when building a view.

Fixes https://github.com/scylladb/scylladb/issues/21232

Closes scylladb/scylladb#21326
2024-10-31 10:13:20 +01:00
Nadav Har'El
7572c483b1 test/topology_experimental_raft: fix flaky test
Today, each test function in test/topology_experimental_raft creates a
cluster in the beginning of the test and drops it at the end of the
function. This is very inefficient if you hope (like I do) to write many
small and pinpointed test functions instead of large test functions that
test 20 unrelated things.

Trying to propose a way to change this sad state of affairs, in
test_alternator.py I created a fixture "alternator3" which I hoped could
be used in multiple tests that need a 3-node Alternator cluster.
Currently only one test uses this fixture.

Unfortunately, it turns out the alternator3 fixture is broken, and
led to flaky test runs (sometimes the test using alternator3 picked
up an existing cluster instead of starting with an empty cluster,
and failed). These problems cannot be *completely* fixed at the current
state of the framework. The framework does not currently allow keeping
a 3-node cluster between test functions, while also allowing other test
functions to create different clusters. The specific flakiness we saw
could be fixed by adding a missing before_test() call, but in the
future we would need to ensure that all the test functions that
use it are contiguous in the test file, and I don't see how we can (or
want to) ensure this. So at this point I am giving up and withdrawing
this proposal until the developers of the topology test framework
make this one of their design goals.

Since there was only one test using this fixture, removing it should
make no performance or correctness difference - it should just fix
the flakiness.

Fixes scylladb/scylladb#21322.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21370
2024-10-31 10:12:26 +01:00
Calle Wilund
c4361037f7 cql_test_env/gossip: Prevent double shutdown call crash
Fixes scylladb/scylladb#21159

When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.

Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
    in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.

Closes scylladb/scylladb#21379
2024-10-31 10:11:20 +01:00
Nadav Har'El
d3f09638f0 Merge 'compound_compat: replace use of boost ranges with std ranges' from Avi Kivity
Replace use of boost::ranges::join() with another construct, as it
has no std replacement, and replace other uses with their std
equivalent, in order to reduce dependency load.

Code cleanup - no backport.

Closes scylladb/scylladb#21382

* github.com:scylladb/scylladb:
  compound_compat: replace use of boost ranges with std ranges
  compound_compat: simplify seriakization of ka/la sstables static cell names
2024-10-31 10:16:41 +02:00
Nadav Har'El
65e29f28bd Merge 'gms: remove unused #includes ' from Kefu Chai
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been
confirmed.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21374

* github.com:scylladb/scylladb:
  .github: add gms to iwyu's CLEANER_DIR
  gms: remove unused `#include`s
2024-10-31 09:06:37 +02:00
Kefu Chai
2498e37a2f mutation_writer,streaming: use reader_consumer_v2 type when appropriate
The `reader_consumer_v2` type
(`std::function<future<> (mutation_reader)>`) is defined alongside
`mutation_reader` in `mutation_reader.hh`.

before this change, we sometimes use
`std::function<future<> (mutation_reader)>` directly when defining a
consumer parameter or a consumer variable.

in this change, we improve maintainability by:

- Reducing duplicate function type declarations
- Centralizing the consumer type definition
- Making future signature updates easier to implement

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21369
2024-10-31 07:17:47 +02:00
Avi Kivity
907da210b6 compound_compat: replace use of boost ranges with std ranges
To reduce the dependency load, replace use of boost ranges
with the std equivalent.

Files that lost the indirect boost dependency have it added as a
direct dependency.
2024-10-30 19:58:07 +02:00
Avi Kivity
982cebc1f6 compound_compat: simplify seriakization of ka/la sstables static cell names
compound_compat is used for serializing ka/la sstables static cell names.
Since we can no longer write such sstabkes, the function is used only
in some tests.

Reduce the use of boost::range::join(): it has no direct equivalent
in std (std::views::concat is in C++26), and it is slow due to the
need to type-erase. Instead of using boost::range::join, extend the
vector used to hold the empty clustering key a bit more, and copy
the view representing the static cell name into into it.
2024-10-30 19:19:57 +02:00
Kefu Chai
d3a6931b14 .github: add gms to iwyu's CLEANER_DIR
to avoid future violations of include-what-you-use.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-30 23:01:34 +08:00
Kefu Chai
52ec315ffd gms: remove unused #includes
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-30 23:01:34 +08:00
Pavel Emelyanov
c16369323b sstables: Use inject(wait_for_message_overload)
This place could be in the pre-previous patch, it just can use the
overload, but it seemengly has a bug. It prints _two_ messages -- that
the injection handler was suspended and that it was woken up. The bug is
in the 2nd message -- it's printed without waiting for the message, so
it likely gets printed before wakeup itself. It seems that no tests care
about it though.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00
Pavel Emelyanov
39cb93be3c treewide,error_injection: Use inject(wait_for_message) and fix tests
This is continuation of previous patch, this time also update tests that
wait for specific message in logs (to make sure injection handler was
called and paused the code execution).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00