Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
Closes#8250
* github.com:scylladb/scylla:
commitlog: Make pre-allocation drop O_DSYNC while pre-filling
commitlog: coroutinize allocate_segment_ex
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.
Fixes#8234.
Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)
Closes#8281
* github.com:scylladb/scylla:
test: logalloc_test: harden background reclain test against cpu overcommit
logalloc: background reclaim: use default scheduling group for adjusting shares
logalloc: background reclaim: log shares adjustment under trace level
logalloc: background reclaim: fix shares not updated by periodic timer
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
This is how PhD explain the need for prevoting stage:
One downside of Raft's leader election algorithm is that a server that
has been partitioned from the cluster is likely to cause a disruption
when it regains connectivity. When a server is partitioned, it will
not receive heartbeats. It will soon increment its term to start
an election, although it won't be able to collect enough votes to
become leader. When the server regains connectivity sometime later, its
larger term number will propagate to the rest of the cluster (either
through the server's RequestVote requests or through its AppendEntries
response). This will force the cluster leader to step down, and a new
election will have to take place to select a new leader.
Prevoting stage is addressing that. In the Prevote algorithm, a
candidate only increments its term if it first learns from a majority of
the cluster that they would be willing to grant the candidate their votes
(if the candidate's log is sufficiently up-to-date, and the voters have
not received heartbeats from a valid leader for at least a baseline
election timeout).
The Prevote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won't
be able to increment its term, since it can't receive permission
from a majority of the cluster. Then, when it rejoins the cluster, it
still won't be able to increment its term, since the other servers
will have been receiving regular heartbeats from the leader. Once the
server receives a heartbeat from the leader itself, it will return to
the follower state(in the same term).
In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
* scylla-dev/raft-prevote-v5:
raft: store leader and candidate state in state variant
raft: add boost tests for prevoting
raft: implement prevoting stage in leader election
raft: reset the leader on entering candidate state
raft: use modern unordered_set::contains instead of find in become_candidate
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
byte stream;
This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.
This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.
This patchset only moves code around, no logical changes are made.
Tests: unit(dev)
"
* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
sstables: get rid of mp_row_consumer.{hh,cc}
sstables: get rid of row.hh
sstables/mp_row_consumer.hh: remove unused struct new_mutation
sstables: move mx specific context and consumer to mx/reader.cc
sstables: move kl specific context and consumer to kl/reader.cc
sstables: mv partition.cc sstable_mutation_reader.hh
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause. This will allow us to eventually drop the `restrictions` hierarchy altogether.
Tests: unit (dev, debug)
Closes#8227
* github.com:scylladb/scylla:
cql3: Make get_clustering_bounds() use expressions
cql3/expr: Add is_multi_column()
cql3/expr: Add more operators to needs_filtering
cql3: Replace CK-bound mode with comparison_order
cql3/expr: Make to_range globally visible
cql3: Gather slice-defining WHERE expressions
cql3: Add statement_restrictions::_where
test: Add unit tests for get_clustering_bounds
Use expressions instead of _clustering_columns_restrictions. This is
a step towards replacing the entire restrictions class hierarchy with
expressions.
Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().
This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.
This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.
Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.
Refs #7950
Refs #8081
Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
sstable_set: move all() implementation into sstable_set_impl
sstable_set: preparatory work to change sstable_set::all() api
sstables: remove bag_sstable_set
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.
Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.
this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.
so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.
so the following code
for (auto& sst : *sstable_set.all()) { ...}
becomes
for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"
* 'data_consume_context_bye' of https://github.com/denesb/scylla:
sstable: move data_consume_* factory methods to row.hh
sstables: fold data_consume_context: into its users
sstables: partition.cc: remove data_consume_* forward declarations
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.
However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that
evictable size < chunk size < evictable size + dummy size
In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.
fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
seed from the issue + 200 times more with random seeds)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
Add tests to check if quorum (for leader election and commit index
purposes) is calculated correctly in the presence of non-voting members.
Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>
This patch adds a support for non-voting members. Non voting member is a
member which vote is not counted for leader election purposes and commit
index calculation purposes and it cannot become a leader. But otherwise
it is a normal raft node. The state is needed to let new nodes to catch
up their log without disturbing a cluster.
All kind of transitions are allowed. A node may be added as a voting member
directly or it may be added as non-voting and then changed to be voting
one through additional configuration change. A node can be demoted from
voting to non-voting member through a configuration change as well.
Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.
Misc cleanups.
* scylla-dev.git/raft-confchange-test-v4:
raft: fix spelling
raft: add a unit test for voting
raft: do not account for the same vote twice
raft: remove fsm::set_configuration()
raft: consistently use configuration from the log
raft: add ostream serialization for enum vote_result
raft: advance commit index right after leaving joint configuration
raft: add tracker test
raft: tidy up follower_progress API
raft: update raft::log::apply_snapshot() assert
raft: add a unit test for raft::log
raft: rename log::non_snapshoted_length() to log::in_memory_size()
raft: inline raft::log::truncate_tail()
raft: ignore AppendEntries RPC with a very old term
raft: remove log::start_idx()
raft: return a correct last term on an empty log
raft: do not use raft::log::start_idx() outside raft::log()
raft: rename progress.hh to tracker.hh
raft: extend single_node_is_quiet test
This reverts commit f94f70cda8, reversing
changes made to 5206a97915.
Not the latest version of the series was merged. Rvert prior to
merging the latest one.
The most user-visible aspect of this change is range scans which select
a small subset of the columns. These queries work as the user expects
them to work: unselected columns are not included in determining the
size of the result (or that of the page). This is the aspect this test
is checking for. While at it, also test single partition queries too.
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.
The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).
Closes#8140
* github.com:scylladb/scylla:
treewide: remove timeout config from query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
service: add timeout config to client state
This will prevent accumulation of unnecessary dummy entries.
A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.
Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.
In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.
Refs #8153.
_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.
The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.
perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:
$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]
Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.
Fixes#2622
Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.comCloses#8111
* github.com:scylladb/scylla:
sstables: add test for checking the latency of updating the sstable_set in a table
sstables: move column_family_test class from test/boost to test/lib
sstables: use fast copying of the sstable_set instead of rebuilding it
sstables: replace the sstable_set with a versioned structure
sstables: remove potential ub
sstables: make sstable_set constructor less error-prone
Currently key order validation for the mutation fragment stream
validating filter is all or nothing. Either no keys (partition or
clustering) are validated or all of them. As we suspect that clustering
key order validation would add a significant overhead, this discourages
turning key validation on, which means we miss out on partition key
monotonicity validation which has a much more moderate cost.
This patch makes this configurable in a more fine-grained fashion,
providing separate levels for partition and clustering key monotonicity
validation.
As the choice for the default validation level is not as clear-cut as
before, the default value for the validation level is removed in the
validating filter's constructor.
The optimal path of said method mistakenly captures `pos` (a local
variable) in its reader factory method and passes a temporary range
implicitly constructed from said `pos` as the range parameter to the
sstable reader. This will lead to the sstable reader using a dangling
range and will result in returning no result for queries. This patch
fixes this bug and adds a unit test to cover this code path.
Fixes#8138.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.
Fixes: #8162
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
The multishard combining reader currently assumes that all shards have
data for the read range. This however is not always true and in extreme
cases (like reading a single token) it can lead to huge read
amplification. Avoid this by not pushing shards to
`_shard_selection_min_heap` if the first token they are expected to
produce falls outside of the read range. Also change the read ahead
algorithm to select the shards from `_shard_selection_min_heap`, instead
of walking them in shard order. This was wrong in two ways:
* Shards may be ordered differently with respect to the first partition
they will produce; reading ahead on the next shard in shard order
might not bring in data on the next shard the read will continue on.
Shard order is only correct when starting a new range and shards are
iterated over in the order they own tokens according to the sharding
algorithm.
* Shards that may not have data relevant to the read range are also
considered for read ahead.
After this patch, the multishard reader will only read from shards that
have data relevant to the read range, both in the case of normal reads
and also for read-ahead.
Fixes: #8161
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
This PR introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
We plug the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service call
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
Some parts of generation management still remain in storage_service:
the bootstrap procedure, which happens inside storage_service,
must also do some initialization regarding CDC generations,
for example: on restart it must retrieve the latest known generation
timestamp from disk; on bootstrap it must create a new generation
and announce it to other nodes. The order of these operations w.r.t
the rest of the startup procedure is important, hence the startup
procedure is the only right place for them. We may try decoupling
these services even more in follow-up PRs, but that requires a bit
of careful reasoning. What this PR does is a low-hanging fruit.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these handling functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
This PR is a prerequisite to fixing #7985. The fact that all the CDC generation
management code was in storage_service is technical debt. It will be easier
to modify the management algorithms when they sit in their own module.
Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java
Closes#8172
* github.com:scylladb/scylla:
cdc: move (most of) CDC generation management code to the new service
cdc: coroutinize make_new_cdc_generation
cdc: coroutinize update_streams_description
cdc: introduce cdc::generation_service
main: move cdc_service initialization just prior to storage_service initialization
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
This series adds background reclaim to lsa, with the goal
that most large allocations can be satisfied from available
free memory, and and reclaim work can be done from a preemptible
context.
If the workload has free cpu, then background reclaim will
utilize that free cpu, reducing latency for the main workload.
Otherwise, background reclaim will compete with the main
workload, but since that work needs to happen anyway,
throughput will not be reduced.
A unit test is added to verify it works.
Fixes#1634.
Closes#8044
* github.com:scylladb/scylla:
test: logalloc_test: test background reclaim
logalloc: reduce gap between std min_free and logalloc min_free
logalloc: background reclaim
logalloc: preemptible reclaim
Test that the background reclaimer is able to compete with a
fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU"
is fully randomized.
If the background reclaimer is disabled, the test fails as soon as the
20MB "gap" is exhausted. With the reclaimer enabled, it is able to
free memory ahead of the allocations.
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.
The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is
sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes
and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.
Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.
This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.
The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)
v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
- changed functions not to return values via refs-arguments
- fixed nested classes to properly use language constructors
- renamed index_to to key_t to distinguish from node_index_t
- improved recurring variadic templates not to use sentinel argument
- use standard concepts
v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)
A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.
next todo:
- aarch64 version of 16-keys node search
tests: unit(dev), unit(debug for radix*), pref(dev)
"
* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
test/memory_footpring: Print radix tree node sizes
row: Remove old storages
row: Prepare row::equal for switch
row: Prepare row::difference for switch
row: Introduce radix tree storage type
row-equal: Re-declare the cells_equal lambda
test: Add tests for radix tree
utils: Compact radix tree
array-search: Add helpers to search for a byte in array
test/perf_collection: Add callback to check the speed of clone
test/perf_mutation: Add option to run with more than 1 columns
test/perf_mutation: Prepare to have several regular columns
test/perf_mutation: Use builder to build schema
raft::log::start_idx() is currently not meaningful
in case the log is empty.
Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().
As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.
This change happens to reduce the overall line count.
While at it, improve the comments in raft::replicate_to().
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
---
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
Closes#8116
* github.com:scylladb/scylla:
tests: add a simple CDC cql pytest
cdc: add config option to disable streams rewriting
cdc: rewrite streams to the new description table
cql3: query_processor: improve internal paged query API
cdc: introduce no_generation_data_exception exception type
docs: cdc: mention system.cdc_local table
cdc: coroutinize do_update_streams_description
sys_dist_ks: split CDC streams table partitions into clustered rows
cdc: use chunked_vector for streams in streams_version
cdc: remove `streams_version::expired` field
system_distributed_keyspace: use mutation API to insert CDC streams
storage_service: don't use `sys_dist_ks` before it is started