Files
scylladb/service
Andrzej Jackowski ea2bdae45a mapreduce: add tablet-aware dispatching algorithm
The primary goal of this change is to reduce the time during which the
Effective Replication Map (ERM) is retained by the mapreduce service.
This ensures that long aggregate queries do not block topology
operations. As ScyllaDB transitions towards tablets, which simplify
work dispatching, the new algorithm is designed specifically for
tablets.

The algorithm divides work so that each `tablet_replica` (a <host,
shard> pair) processes two tablets at a time. After processing of each
`tablet_replica`, the ERM is released and re-acquired.

The new algorithm can be summarized as follows:
1. Prepare a set of exclusive `partition_ranges`, where each range
   represents one tablet. This set is called `ranges_left`, because it
   contains ranges that still need processing.
2. Loop until `ranges_left` is empty:
   I.  Create `tablet_replica` -> `ranges` mapping for the current ERM
       and `ranges_left`. Store this mapping and the number
       representing current ERM version as `ranges_per_replica`.
   II. In parallel, for each tablet_replica, iterate through
       ranges_per_tablet_replica. Select independently up to two ranges
       that are still existing in ranges_left. Remove each range
       selected for processing from ranges_left. Before each iteration,
       verify that ERM version has not changed. If it has,
       return to Step I.

Steps I and II are exclusive to simplify maintaining `ranges_left` and
`ranges_per_replica`:
 - Step I iterates through `ranges_left` and creates
   `ranges_per_replica`
 - Step II iterates through `ranges_per_replica` and remove processed
   ranges from `ranges_left`

To maintain the exclusivity, the algorithm uses `parallel_for_each` in
Step II, requiring all ongoing `tablet_replica` processing to finish
before returning to Step I.

Currently, each node can handle any partition range, even if the
mapreduce supercoordinator does not retain the ERM and the range is
absent locally. This is because `execute_on_this_shard` creates a new
pager to coordinate the partition range read, including obtaining its
own ERM. However, absent ranges are handled by shard 0, so proper
routing is necessary to avoid overloading shard 0. Thus, in Step II,
the ERM is retained during each `tablet_replica` processing.

The tablet split scenario is not well-handled in this implementation.
After a split, the entire pre-split range is sent to a node hosting
the `tablet_replica` containing the range's `end_token`. The node
will typically not have other tablets in the range, and as
aforementioned, absent ranges are handled by shard 0. As a result,
in such scenario, shard 0 handles a significant portion of the range.
This issue is addressed later in this patch series by introducing
`shard_id` in `mapreduce_request`.

Ref. scylladb#21831
2025-06-25 10:18:02 +02:00
..
2025-03-11 12:09:22 +02:00
2025-04-12 11:28:49 +03:00