After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed. The result is:
1. memory footprint
sizeof(class row): 112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes
the "in cache" value depends on the number of cells:
num of cells master patch
1 752 656
2 808 712
3 864 768
4 920 824
5 968 936
6 1136 992
...
16 1840 1672
17 1904 1992 (+88)
18 1976 2048 (+72)
19 2048 2104 (+56)
20 2120 2160 (+40)
21 2184 2208 (+24)
22 2256 2264 ( +8)
23 2328 2320
...
32 2960 2808
After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches
64 7872 6056
128 15040 9512
256 29376 18568
2. perf_mutation test is enhanced by this series and the
results differ depending on the number of columns used
tps value
--column-count master patch
1 59.9k 57.6k (-3.8%)
2 59.9k 57.5k
4 59.8k 57.6k
8 57.6k 57.7k <- eq
16 56.3k 57.6k
32 53.2k 57.4k (+7.9%)
A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.
An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.
The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In some places scylla clones collections of objects, so it's
sometimes needed to measure the speed of this operation.
This patch adds a placeholder for it, but no implementations
for any supported collections. It will be added soon for radix
tree.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The --column-count makes the test generate schema with
the given numbers of columns and make mutation maker
fill random column with the value on each iteration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Teach the schema builder and test itself to work on more
than one regular column, but for now only use 1, as before.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.
However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.
This series separates the register_inactive_reader interface
into 2 parts:
1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.
This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.
2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.
After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.
querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.
inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object. The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.
Test: unit(release), querier_cache_test(debug)
"
* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
querier_cache: insert_querier: ignore errors to register inactive reader
querier_cache: insert_querier: handle errors
querier_utils: mark functions noexcept
reader_concurrency_semaphore: register_inactive_read: make noexcept
reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
reader_concurrency_semaphore: inactive_read: use intrusive list
reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
reader_concurrency_semaphore: inactive_read_handle: swap definition order
reader_lifecycle_policy: retire low level try_resume method
reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.
Fixes#8062Closes#8063
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)
The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.
Fixes#7595Closes#8056
* github.com:scylladb/scylla:
tests: Adjusted tests for DC checking in NTS
locator: Check DC names in NTS
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
Register the inactive reader first with no
evict_notify_handler and ttl.
Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.
The fix is to lookup the object one more time after maybe_commit().
"
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.
It used to be like that a few years ago, but we moved to per-reader
cache to implement incremental promoted index parsing, to avoid OOMs
with large partitions. At that time, the solution involved caching
input streams inside partition index entries, which couldn't be reused
between readers. This could have been solved differently. Instead of
caching input streams, we can cache information needed to created them
(temporary_buffer<>). This solution takes this approach.
This series is also needed before we can implement promoted index
caching. That's because before the promoted index can be shared by
readers, the partition index entries, which hold the promoted index,
must also be shareable.
The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.
Promoted index cursor is no longer created when the partition index
entry is parsed, by it's created on-demand when the top-level cursor
enters the partition. The promoted index cursor is owned by the
top-level cursor, not by the partition index entry.
Below are the results of an experiment performed on my laptop which
demonstrates the improvement in performance.
Load driver command line:
./scylla-bench \
-workload uniform \
-mode read \
--partition-count=10 \
-clustering-row-count=1 \
-concurrency 100
Scylla command line:
scylla --developer-mode=1 -c1 -m1G --enable-cache=0
The workload is IO-bound.
Before, we needed 2 I/O per read, now we need 1 (amortized).
The throughput is ~70% higher.
Before:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 4706 4706 0 35ms 30ms 27ms 25ms 24ms 21ms 21ms
2s 4646 4646 0 42ms 31ms 31ms 27ms 25ms 21ms 22ms
3.1s 4670 4670 0 40ms 27ms 26ms 25ms 25ms 21ms 21ms
4.1s 4581 4581 0 39ms 33ms 33ms 27ms 26ms 21ms 22ms
5.1s 4345 4345 0 40ms 37ms 35ms 32ms 31ms 21ms 23ms
6.1s 4328 4328 0 49ms 40ms 34ms 32ms 31ms 22ms 23ms
7.1s 4198 4198 0 45ms 36ms 35ms 31ms 30ms 22ms 24ms
8.2s 3913 3913 0 51ms 50ms 50ms 39ms 35ms 24ms 26ms
9.2s 4524 4524 0 34ms 31ms 30ms 28ms 27ms 21ms 22ms
After:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 7913 7913 0 25ms 25ms 20ms 15ms 14ms 12ms 13ms
2s 7913 7913 0 18ms 18ms 18ms 16ms 14ms 12ms 13ms
3s 8125 8125 0 20ms 20ms 17ms 15ms 14ms 12ms 12ms
4s 5609 5609 0 41ms 35ms 29ms 28ms 27ms 13ms 18ms
5.1s 8020 8020 0 18ms 17ms 17ms 15ms 14ms 12ms 13ms
6.1s 7102 7102 0 27ms 27ms 24ms 19ms 18ms 13ms 14ms
7.1s 5780 5780 0 26ms 26ms 26ms 23ms 22ms 17ms 18ms
8.1s 6530 6530 0 37ms 34ms 26ms 22ms 20ms 15ms 15ms
9.1s 7937 7937 0 19ms 19ms 17ms 17ms 16ms 12ms 13ms
Tests:
- unit [release]
- scylla-bench
"
* tag 'share-partition-index-v1' of github.com:tgrabiec/scylla:
sstables: Share partition index pages between readers
sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()
sstables: index_reader: Do not store cluster index cursor inside partition indexes
Our test for tracing Alternator requests can't be sure when tracing a request
finished, because tracing is asynchronous and has no official ending signal.
So before we can conclude that tracing failed, we need to wait until a
timeout, which in the current code was roughly 6.4 seconds (the timeout
logic is unnecessarily convoluted, but to make a long story short it has
exponential sleeps starting with 0.1 second and ending with 3.2 seconds,
totaling 6.4 seconds).
It turns out that sporadically, in test runs on overcommitted test machines
with the very slow debug build, we fail this test with this timeout.
So this patch increases the timeout to 51.2 seconds. It should be more
than enough for everyone. Famous last words :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210204151554.582260-1-nyh@scylladb.com>
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).
In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
We can do with a forward declaration instead to reduce
the dependency, and include reader_concurrency_semaphore.hh
in test/lib/reader_permit.cc instead.
We need to include "../../reader_permit.hh" to get the
definition of class reader_permit. We need the include
path to prevent recursive include (or rename test/lib/reader_permit.hh
but this creates a lot of code churn).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.
To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).
Fixes#8028
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.
Tests: unit(release, debug:v1)
"
* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: store inactive readers directly
querier_cache: store readers in the reader concurrency semaphore directly
querier_cache: retire memory based cache eviction
querier_cache: delegate expiry to the reader_concurrency_semaphore
reader_concurrency_semaphore: introduce ttl for inactive reads
querier_cache: use new eviction notify mechanism to maintain stats
reader_concurrency_semaphore: add eviction notification facility
reader_concurrency_semaphore: extract evict code into method evict()
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.
The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree
Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:
There's no copyable key at hands. The clustering key is the
managed_bytes() that is not nothrow-copy-constructibe, neither
it's hash-able for lookup due to prefix lookup.
Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.
B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.
Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.
The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.
Numbers:
- memory_footprint: B-tree master
rows_entry size: 216 232
1 row
in-cache: 968 960 (because of dummy entry)
in-memtable: 1006 1022
100 rows
in-cache: 50774 50856
in-memtable: 50620 50918
- mutation_test: B-tree master
tps.average: 891177 833896
- simple_query: B-tree master
tps.median: 71807 71656
tps.maximum: 71847 71708
* xemul/clustering-cache-over-btree-4:
mutation_partition: Save one keys comparison
partition_snapshot_row_cursor: Remove rows pointer
mutation_partition: Use B-tree insertion sugar
perf-test : Print B-tree sizes
mutation_partition: Switch cache of rows onto B-tree
partition_snapshot_reader: Rename cmp to less for explicity
mutation_partition: Make insertion bullet-proof
mutation_partition: Use tri-compare in non-set places
flat_mutation_reader: Use clear() in destroy_current_mutation()
rows_entry: Generalize compare
utils: Intrusive B-tree (with tests)
tests: Generalize bptree compaction test
tests: Generalize bptree stress test
Fixes#7615
Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.
Small test included.
Closes#7616
* github.com:scylladb/scylla:
commitlog_test: Add multi-entry write test
commitlog: Add "add_entries" call to allow inputting N mutations
commitlog: Make commitlog entries optionally multi-entry
commitlog: Move entry_writer definition to cc file
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().
Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.
Unlike the previous attempt, push_ready_fragments() is not touched.
The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.
I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.
Test: unit (dev, debug, release)
Fixes#7883.
Closes#7928
* github.com:scylladb/scylla:
sstable: reader: preempt after every fragment
clustering_range_walker: fix false discontiguity detected after a static row
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.
Add unit test that reproduces this scenario
and verifies the internal error exception.
Fixes#7977
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)
The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().
A get_origin() method was added to class sstable to retrieve
its origin. It returns an empty string by default if the origin
is missing.
Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates. Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.
A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.
Test: unit(release)
Fixes#7880
"
* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
compaction: log sstable origin
sstables: scylla_metadata: add support for sstable_origin
sstables: sstable_writer_config: add origin member
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The switch is pretty straightforward, and consists of
- change less-compare into tri-compare
- rename insert/insert_check into insert_before_hint
- use tree::key_grabber in mutation_partition::apply_monotonically to
exception-safely transfer a row from one tree to another
- explicitly erase the row from tree in rows_entry::on_evicted, there's
a O(1) tree::iterator method for this
- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
fit the B-tree API
- include the B-tree's external memory usage into stats
That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because
- experimenting with tree shows that numbers 8 through 10 keys with linear
search show the best performance on stress tests for insert/find-s of
keys that are memcmp-able arrays of bytes (which is an approximation of
current clustring key compare). More keys work slower, but still better
than any bigger value with any type of search up to 64 keys per node
- having 12 keys per nodes is the threshold at which the memory footprint
for B-tree becomes smaller than for boost::intrusive::set for partitions
with 32+ keys
- 20 keys for linear root eats the first-split peak and still performs
well in linear search
As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The design of the tree goes from the row-cache needs, which are
1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible
With the above the design is
Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).
Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.
In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().
Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.
The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.
For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.
As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:
vs set vs b+tree
fill: +13% -6%
find: +23% -35%
Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).
v4:
- equip insertion methods with on_alloc_point() calls to catch
potential exception guarantees violations eariler
- add unlink_leftmost_without_rebalance. The method is borrowed from
boost intrusive set, and is added to kill two birds -- provide it,
as it turns out to be popular, and use a bit faster step-by-step
tree destruction than plain begin+erase loop
v3:
- introduce "inline" root node that is embedded into tree object and in
which the 1st key is inserted. This greatly improves the 1-key-tree
performance, which is pretty common case for rows cache
v2:
- introduce "linear" root leaf that grows on demand
This improves the memory consumption for small trees. This linear node may
and should over-grow the NodeSize parameter. This comes from the fact that
there are two big per-key memory spikes on small trees -- 1-key root leaf
and the first split, when the tree becomes 1-key root with two half-filled
leaves. If the linear extention goes above NodeSize it can flatten even the
2nd peak
- mitigate the keys indirection a bit
Prefetching the keys while doing the intra-node linear scan and the nodes
while descending the tree gives ~+5% of fill and find
- generalize stress tests for B and B+ trees
- cosmetic changes
TODO:
- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
unused tree pointer on board
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.
This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.
A unit test is added.
Ref #7883.
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.
Misc cleanups.
* scylla-dev/raft-config-changes-v2:
raft: update README.md
raft: add a simple test for configuration changes
raft: joint consensus, wire up configuration changes in the API
raft: joint consensus, count votes using joint config
raft: joint consensus, wire up configuration changes in FSM
raft: joint consensus, update progress tracker with joint configuration
raft: joint consensus, don't store configuration in FSM
raft: joint consensus, keep track of the last confchange index in the log
raft: joint consensus, implement helpers in class configuration
raft: joint consensus, use unordered_set for server_address list
raft: joint consensus, switch configuration to joint
raft: rename check_committed() to maybe_commit()
raft: fix spelling and add comments
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.
Add unit test to verify the sstable_origin extension
using both empty and a random string.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)
If configure_writer is called with a nullptr, the origin
will be equal to an empty string.
Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin. This was to reduce the
code churn in this patch and to keep the tests simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.
The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.
Refs: #8001.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
In order to work correctly in transitional configuration,
participants must enter it after crashes, restarts and
state changes.
This means it must be stored in Raft log and snapshot
on the leader and followers.
This is most easily done if transitional configuration
is just a flavour of standard configuration.
In FSM, rename _current_config to _configuration,
it now contains both current and future configuration
at all times.
The idea of the monotonicity checking test is: try to apply
one one random partition to another random one sequentually
failing allocations. Each time allocation fails (with the
bad_alloc exception) -- check the exception guarantee is
respected, then apply (!) the very same two partitions to
each other. At the end of the test we make sure, that an
exception may pop up at any point of application and it
will be safe.
This idea is flawed currently. When verifying the guarantee
the test moves the 2nd partition and leaves it empty for the
next loop iteration. So right on the 2nd attempt to apply
partitions it becomes a no-op, doesn't fail and no more
exceptions arise.
Fix by restoring both partitions at the end of each check.
Broken since 74db08165d.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210129153641.5449-1-xemul@scylladb.com>
This series contains an initial implementation of raft persistency module
that uses `raft` system table as the underlying storage model.
"system.raft" table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.
The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.
The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.
IDL definitions are also added for every raft struct so that we
automatically provide serialization and deserialization facilities
needed both for persistency module and for future RPC implmementation.
The first patch is a side-change needed to provide complete
serialization/deserialization for `bytes_ostream`, which we
need when persisting the raft log in the table (since `data`
is a variant containing `raft::command` (aka `bytes_ostream`)
among others).
`bytes_ostream` was lacking `deserialize` function, which is
added in the patch.
The second patch provides serializer for `lw_shared_ptr<T>`
which will be used for `raft::append_entries`, which has
a field with `std::vector<const lw_shared_ptr<raft::log_entry>>`
type.
There is also a patch to extend `fragmented_temporary_buffer`
with a static function `allocate_to_fit` that allocates an
instance of the fragmented buffer that has a specified size.
Individual fragment size is limited to 128kb.
The patch-set also contains the test suite covering basic
functionality of the persistency module.
* manmanson/raft-api-impl-v11:
raft/sys_table_storage: add basic tests for raft_sys_table_storage
raft: introduce `raft_sys_table_storage` class
utils: add `fragmented_temporary_buffer::allocate_to_fit`
raft: add IDL definitions for raft types
raft: create `system.raft` and `system.raft_snapshots` tables
serializer: add `serializer<lw_shared_ptr<T>>` specialization
serializer: add `deserialize` function overload for `bytes_ostream`
The test suite covers the most basic use cases for the system table
backed raft persistency module:
* store/load vote and term
* store/load snapshot
* store snapshot with log tail truncation
* store/load log entries
* log truncation
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
We use a custom sharder for all schema tables: every table under
the `system_schema` keyspace, plus `system.scylla_table_schema_history`.
This sharder puts all data on shard 0.
To achieve this, we hardcode the sharder in initial schema object
definitions. Furthermore - since the sharder is not stored inside schema
mutations yet - whenever we deserialize schema objects from mutations,
we modify the sharder based on the schema's keyspace and table names.
A regression test is added to ensure no one forgets to set the special
sharder for newly added schema tables. This test assumes that all newly
added schema tables will end up in the `system_schema` keyspace (other
tables may go unnoticed, unfortunately).
Closes#7947
"
Currently there are three different methods for creating an sstable
reader:
* one for single key reads
* one for ranged reads
* and one nobody uses
This patch-set consolidates all these into a single `make_reader()`
method, which behind the scenes uses the same logic to dispatch to the
right sstable reader constructor that `sstables::as_mutation_source()`
uses.
This patch-set is part of an effort to clean up the jungle that is the
various reader creation methods. The next step is to clean up the
sstable_set, which has even more methods.
One very sad discovery I made while working on this patch-set is that
we
still default `mutation_reader::forwarding` to `yes` in the sstable
range reader creator method and in the
`mutation_source::make_reader()`.
I couldn't assume that all callers are passing what they mean as the
value for that parameter. I found many sites in tests that create
forwardable single partition readers. This is also something we should
address soon.
Tests: unit(release, debug:v3)
"
* 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla:
cql_query_test: add unit test covering the non-optimal TWCS sstable read path
sstable_mutation_reader: consolidate constructors
tests: don't pass temporary ranges to readers
sstables: sstable_mutation_reader: remove now unused whole sstable constructor
sstables: stats: remove now unused sstable_partition_reads counter
sstable: remove read_.*row.*_flat() methods
tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods
sstables: pass partition_range to create_single_key_sstable_reader()
sstables: sstable: add make_reader()
The sstable read path for TWCS tables takes a different path when the
optimized read path cannot be used. This path was found to be not
covered at all by unit tests which allowed a trivial use-after-free to
slip in. Add a unit test to cover this path as well, so ASAN can catch
such bugs in the future.