Commit Graph

787 Commits

Author SHA1 Message Date
Konstantin Osipov
132db931da raft: add tracker test 2021-02-18 16:04:44 +03:00
Konstantin Osipov
6e3932bbc7 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e58a3e42ca raft: add a unit test for raft::log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
cb035a7c8d raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-18 16:04:43 +03:00
Konstantin Osipov
97a16c0f77 raft: extend single_node_is_quiet test 2021-02-18 16:04:43 +03:00
Botond Dénes
c3b4c3f451 evictable_reader: reset _range_override after fast-forwarding
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
2021-02-17 19:11:00 +02:00
Benny Halevy
35256d1b92 treewide: explicitly use flat_mutation_reader_opt
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.

Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
2021-02-17 17:57:34 +02:00
Avi Kivity
c63e26e26f Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8048

* github.com:scylladb/scylla:
  cdc: Limit size of topology description
  cdc: Extract create_stream_ids from topology_description_generator
2021-02-17 15:43:53 +02:00
Piotr Jastrzebski
649f254863 cdc: Limit size of topology description
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-17 13:24:40 +01:00
Botond Dénes
ba7a9d2ac3 imr: switch back to open-coded description of structures
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578
2021-02-16 23:43:07 +01:00
Michał Chojnowski
6b8a69e01f test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
The off-by-one error would cause
test_multishard_combining_reader_non_strictly_monotonic_positions to fail if
the added range_tombstones filled the buffer exactly to the end.
In such situation, with the old loop condition,
make_fragments_with_non_monotonic_positions would add one range_tombstone too
many to the deque, violating the test assumptions.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
5b79d6ca4c test: mutation_test: remove an obsolete assertion
Due to small value optimizations, the removed assertions are not true in
general. Until now, atomic_cell did not use small value optimizations, but
it will after upcoming changes.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
aa60f28a09 test: mutation_test: initialize an uninitialized variable
It was assumed to be zero-initialized, but C++ does not guarantee that.
It has to be initialized explicitly.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
52bd190bb3 test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
sstable_run_based_compaction_test assumed that sstables are freed immediately
after they are fully processed.
Hovewer, since commit b524f96a74,
mutation_reader_merger releases sstables in batches of 4, which breaks the
assumption. This fix adjusts the test accordingly.

Until now, the test only kept working by chance: by coincidence, the number of
test sstables processed by merging_reader in a single fill_buffer() call was
divisible by 4. Since the test checks happen between those calls,
the test never witnessed a situation when an sstable was fully processed,
but not released yet.

The error was noticed during the work on an upcoming patch which changes the
size of mutation_fragment, and reduces the number of test sstables processed
in a single fill_buffer() call, which breaks the test.
2021-02-16 21:35:14 +01:00
Tomasz Grabiec
508f928220 tests: sstables: Test sstable write fails on missing partition_end mid-stream
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210115163055.74398-1-tgrabiec@scylladb.com>
2021-02-15 15:45:49 +02:00
Avi Kivity
9cbbf40710 Merge "register_inactive_read: error handling" from Benny
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.

However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.

This series separates the register_inactive_reader interface
into 2 parts:

1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.

This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.

2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.

After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.

querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.

inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object.  The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.

Test: unit(release), querier_cache_test(debug)
"

* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
  querier_cache: insert_querier: ignore errors to register inactive reader
  querier_cache: insert_querier: handle errors
  querier_utils: mark functions noexcept
  reader_concurrency_semaphore: register_inactive_read: make noexcept
  reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
  reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
  reader_concurrency_semaphore: inactive_read: use intrusive list
  reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
  reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
  reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
  reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
  reader_concurrency_semaphore: inactive_read_handle: swap definition order
  reader_lifecycle_policy: retire low level try_resume method
  reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
2021-02-10 19:09:21 +02:00
Piotr Sarna
4acc6fecf0 Merge 'locator: Check DC names in NetworkTopologyStrategy' from Juliusz Stasiewicz
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.

Fixes #7595

Closes #8056

* github.com:scylladb/scylla:
  tests: Adjusted tests for DC checking in NTS
  locator: Check DC names in NTS
2021-02-09 14:45:20 +02:00
Juliusz Stasiewicz
97bb15b2f2 tests: Adjusted tests for DC checking in NTS
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
2021-02-09 08:29:35 +01:00
Benny Halevy
46c2229b78 reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
Register the inactive reader first with no
evict_notify_handler and ttl.

Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
e072199b8d reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Konstantin Osipov
adc87aa278 raft: re-lookup progress object after a configuration change
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.

The fix is to lookup the object one more time after maybe_commit().
2021-02-05 12:40:19 +01:00
Benny Halevy
ca6f5cb0bc test: commitlog_test: test_allocation_failure: fill memory using smaller allocations
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.

To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).

Fixes #8028

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
2021-02-03 12:21:20 +02:00
Avi Kivity
913d970c64 Merge "Unify inactive readers" from Botond
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.

Tests: unit(release, debug:v1)
"

* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: store inactive readers directly
  querier_cache: store readers in the reader concurrency semaphore directly
  querier_cache: retire memory based cache eviction
  querier_cache: delegate expiry to the reader_concurrency_semaphore
  reader_concurrency_semaphore: introduce ttl for inactive reads
  querier_cache: use new eviction notify mechanism to maintain stats
  reader_concurrency_semaphore: add eviction notification facility
  reader_concurrency_semaphore: extract evict code into method evict()
2021-02-03 10:59:04 +02:00
Tomasz Grabiec
873e732042 Merge "Switch partition rows onto B-tree" from Pavel Emelyanov
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.

The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree

Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:

  There's no copyable key at hands. The clustering key is the
  managed_bytes() that is not nothrow-copy-constructibe, neither
  it's hash-able for lookup due to prefix lookup.

Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.

B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.

Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.

The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.

Numbers:

- memory_footprint: B-tree       master
  rows_entry size:  216          232

  1 row
   in-cache:        968          960     (because of dummy entry)
   in-memtable:     1006         1022

  100 rows
   in-cache:        50774        50856
   in-memtable:     50620        50918

- mutation_test:    B-tree       master
   tps.average:     891177       833896

- simple_query:     B-tree       master
   tps.median:      71807        71656
   tps.maximum:     71847        71708

* xemul/clustering-cache-over-btree-4:
  mutation_partition: Save one keys comparison
  partition_snapshot_row_cursor: Remove rows pointer
  mutation_partition: Use B-tree insertion sugar
  perf-test : Print B-tree sizes
  mutation_partition: Switch cache of rows onto B-tree
  partition_snapshot_reader: Rename cmp to less for explicity
  mutation_partition: Make insertion bullet-proof
  mutation_partition: Use tri-compare in non-set places
  flat_mutation_reader: Use clear() in destroy_current_mutation()
  rows_entry: Generalize compare
  utils: Intrusive B-tree (with tests)
  tests: Generalize bptree compaction test
  tests: Generalize bptree stress test
2021-02-02 12:26:02 +01:00
Tomasz Grabiec
75eb97b12c Merge 'Commitlog multi-entry write' from Calle Wilund
Fixes #7615

Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.

Small test included.

Closes #7616

* github.com:scylladb/scylla:
  commitlog_test: Add multi-entry write test
  commitlog: Add "add_entries" call to allow inputting N mutations
  commitlog: Make commitlog entries optionally multi-entry
  commitlog: Move entry_writer definition to cc file
2021-02-02 12:23:19 +01:00
Tomasz Grabiec
7b17969a6e Merge 'sstable: reader: preempt after every fragment' from Avi Kivity
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev, debug, release)

Fixes #7883.

Closes #7928

* github.com:scylladb/scylla:
  sstable: reader: preempt after every fragment
  clustering_range_walker: fix false discontiguity detected after a static row
2021-02-02 12:21:58 +01:00
Benny Halevy
0fecc78d88 user_function: throw on_internal_error if executed outside a seastar thread
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.

Add unit test that reproduces this scenario
and verifies the internal error exception.

Fixes #7977

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
2021-02-02 13:03:39 +02:00
Calle Wilund
720a47fe8a commitlog_test: Add multi-entry write test 2021-02-02 10:41:08 +00:00
Avi Kivity
da4fa0629a Merge "sstables: add sstable_origin to scylla_metadata" from Benny
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)

The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().

A get_origin() method was added to class sstable to retrieve
its origin.  It returns an empty string by default if the origin
is missing.

Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates.  Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.

A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.

Test: unit(release)

Fixes #7880
"

* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
  compaction: log sstable origin
  sstables: scylla_metadata: add support for sstable_origin
  sstables: sstable_writer_config: add origin member
2021-02-02 10:35:11 +02:00
Pavel Emelyanov
5c0f9a8180 mutation_partition: Switch cache of rows onto B-tree
The switch is pretty straightforward, and consists of

- change less-compare into tri-compare

- rename insert/insert_check into insert_before_hint

- use tree::key_grabber in mutation_partition::apply_monotonically to
  exception-safely transfer a row from one tree to another

- explicitly erase the row from tree in rows_entry::on_evicted, there's
  a O(1) tree::iterator method for this

- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
  fit the B-tree API

- include the B-tree's external memory usage into stats

That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because

- experimenting with tree shows that numbers 8 through 10 keys with linear
  search show the best performance on stress tests for insert/find-s of
  keys that are memcmp-able arrays of bytes (which is an approximation of
  current clustring key compare). More keys work slower, but still better
  than any bigger value with any type of search up to 64 keys per node

- having 12 keys per nodes is the threshold at which the memory footprint
  for B-tree becomes smaller than for boost::intrusive::set for partitions
  with 32+ keys

- 20 keys for linear root eats the first-split peak and still performs
  well in linear search

As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
2f7c03d84c utils: Intrusive B-tree (with tests)
The design of the tree goes from the row-cache needs, which are

1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible

With the above the design is

Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).

Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.

In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().

Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.

The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.

For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.

As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:

            vs set        vs b+tree
   fill:     +13%           -6%
   find:     +23%          -35%

Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).

v4:
- equip insertion methods with on_alloc_point() calls to catch
  potential exception guarantees violations eariler

- add unlink_leftmost_without_rebalance. The method is borrowed from
  boost intrusive set, and is added to kill two birds -- provide it,
  as it turns out to be popular, and use a bit faster step-by-step
  tree destruction than plain begin+erase loop

v3:
- introduce "inline" root node that is embedded into tree object and in
  which the 1st key is inserted. This greatly improves the 1-key-tree
  performance, which is pretty common case for rows cache

v2:
- introduce "linear" root leaf that grows on demand

  This improves the memory consumption for small trees. This linear node may
  and should over-grow the NodeSize parameter. This comes from the fact that
  there are two big per-key memory spikes on small trees -- 1-key root leaf
  and the first split, when the tree becomes 1-key root with two half-filled
  leaves. If the linear extention goes above NodeSize it can flatten even the
  2nd peak

- mitigate the keys indirection a bit

  Prefetching the keys while doing the intra-node linear scan and the nodes
  while descending the tree gives ~+5% of fill and find

- generalize stress tests for B and B+ trees

- cosmetic changes

TODO:

- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
  unused tree pointer on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:29 +03:00
Avi Kivity
7634a90dd2 clustering_range_walker: fix false discontiguity detected after a static row
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.

This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.

A unit test is added.

Ref #7883.
2021-02-01 19:32:07 +02:00
Pavel Solodovnikov
9d17a654a6 raft: use null_sharder for raft tables
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210201105300.110210-1-pa.solodovnikov@scylladb.com>
2021-02-01 18:52:04 +02:00
Tomasz Grabiec
eac9c1d80a Merge "raft: configuration changes with joint consensus" from Kostja
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.

Misc cleanups.

* scylla-dev/raft-config-changes-v2:
  raft: update README.md
  raft: add a simple test for configuration changes
  raft: joint consensus, wire up configuration changes in the API
  raft: joint consensus, count votes using joint config
  raft: joint consensus, wire up configuration changes in FSM
  raft: joint consensus, update progress tracker with joint configuration
  raft: joint consensus, don't store configuration in FSM
  raft: joint consensus, keep track of the last confchange index in the log
  raft: joint consensus, implement helpers in class configuration
  raft: joint consensus, use unordered_set for server_address list
  raft: joint consensus, switch configuration to joint
  raft: rename check_committed() to maybe_commit()
  raft: fix spelling and add comments
2021-02-01 18:52:04 +02:00
Benny Halevy
77328a936a sstables: scylla_metadata: add support for sstable_origin
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.

Add unit test to verify the sstable_origin extension
using both empty and a random string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
22f6023ac3 sstables: sstable_writer_config: add origin member
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)

If configure_writer is called with a nullptr, the origin
will be equal to an empty string.

Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin.  This was to reduce the
code churn in this patch and to keep the tests simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Konstantin Osipov
b7692af8bc raft: add a simple test for configuration changes
Test adding, removing replacing a node.

With fix-ups by Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-29 22:07:08 +03:00
Konstantin Osipov
1ca738d9a2 raft: joint consensus, use unordered_set for server_address list 2021-01-29 22:07:07 +03:00
Pavel Emelyanov
575c992a35 test: Bring test_apply_monotonically_is_monotonic back to work
The idea of the monotonicity checking test is: try to apply
one one random partition to another random one sequentually
failing allocations. Each time allocation fails (with the
bad_alloc exception) -- check the exception guarantee is
respected, then apply (!) the very same two partitions to
each other. At the end of the test we make sure, that an
exception may pop up at any point of application and it
will be safe.

This idea is flawed currently. When verifying the guarantee
the test moves the 2nd partition and leaves it empty for the
next loop iteration. So right on the 2nd attempt to apply
partitions it becomes a no-op, doesn't fail and no more
exceptions arise.

Fix by restoring both partitions at the end of each check.
Broken since 74db08165d.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210129153641.5449-1-xemul@scylladb.com>
2021-01-29 18:47:15 +01:00
Tomasz Grabiec
16eb4c6ce2 Merge "raft: system table backed persistency module" from Pavel Solodovnikov
This series contains an initial implementation of raft persistency module
that uses `raft` system table as the underlying storage model.

"system.raft" table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.

The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.

The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.

IDL definitions are also added for every raft struct so that we
automatically provide serialization and deserialization facilities
needed both for persistency module and for future RPC implmementation.

The first patch is a side-change needed to provide complete
serialization/deserialization for `bytes_ostream`, which we
need when persisting the raft log in the table (since `data`
is a variant containing `raft::command` (aka `bytes_ostream`)
among others).
`bytes_ostream` was lacking `deserialize` function, which is
added in the patch.

The second patch provides serializer for `lw_shared_ptr<T>`
which will be used for `raft::append_entries`, which has
a field with `std::vector<const lw_shared_ptr<raft::log_entry>>`
type.

There is also a patch to extend `fragmented_temporary_buffer`
with a static function `allocate_to_fit` that allocates an
instance of the fragmented buffer that has a specified size.
Individual fragment size is limited to 128kb.

The patch-set also contains the test suite covering basic
functionality of the persistency module.

* manmanson/raft-api-impl-v11:
  raft/sys_table_storage: add basic tests for raft_sys_table_storage
  raft: introduce `raft_sys_table_storage` class
  utils: add `fragmented_temporary_buffer::allocate_to_fit`
  raft: add IDL definitions for raft types
  raft: create `system.raft` and `system.raft_snapshots` tables
  serializer: add `serializer<lw_shared_ptr<T>>` specialization
  serializer: add `deserialize` function overload for `bytes_ostream`
2021-01-29 11:40:39 +02:00
Pavel Solodovnikov
e309502c42 raft/sys_table_storage: add basic tests for raft_sys_table_storage
The test suite covers the most basic use cases for the system table
backed raft persistency module:
 * store/load vote and term
 * store/load snapshot
 * store snapshot with log tail truncation
 * store/load log entries
 * log truncation

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 02:00:27 +03:00
Kamil Braun
bf115e7d69 schema_tables: put schema tables on shard 0
We use a custom sharder for all schema tables: every table under
the `system_schema` keyspace, plus `system.scylla_table_schema_history`.
This sharder puts all data on shard 0.

To achieve this, we hardcode the sharder in initial schema object
definitions. Furthermore - since the sharder is not stored inside schema
mutations yet - whenever we deserialize schema objects from mutations,
we modify the sharder based on the schema's keyspace and table names.

A regression test is added to ensure no one forgets to set the special
sharder for newly added schema tables. This test assumes that all newly
added schema tables will end up in the `system_schema` keyspace (other
tables may go unnoticed, unfortunately).

Closes #7947
2021-01-28 13:28:22 +02:00
Avi Kivity
32cdcc0c8b Merge "sstables: consolidate reader factory methods" from Botond
"
Currently there are three different methods for creating an sstable
reader:
* one for single key reads
* one for ranged reads
* and one nobody uses

This patch-set consolidates all these into a single `make_reader()`
method, which behind the scenes uses the same logic to dispatch to the
right sstable reader constructor that `sstables::as_mutation_source()`
uses.

This patch-set is part of an effort to clean up the jungle that is the
various reader creation methods. The next step is to clean up the
sstable_set, which has even more methods.

One very sad discovery I made while working on this patch-set is that
we
still default `mutation_reader::forwarding` to `yes` in the sstable
range reader creator method and in the
`mutation_source::make_reader()`.
I couldn't assume that all callers are passing what they mean as the
value for that parameter. I found many sites in tests that create
forwardable single partition readers. This is also something we should
address soon.

Tests: unit(release, debug:v3)
"

* 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla:
  cql_query_test: add unit test covering the non-optimal TWCS sstable read path
  sstable_mutation_reader: consolidate constructors
  tests: don't pass temporary ranges to readers
  sstables: sstable_mutation_reader: remove now unused whole sstable constructor
  sstables: stats: remove now unused sstable_partition_reads counter
  sstable: remove read_.*row.*_flat() methods
  tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods
  sstables: pass partition_range to create_single_key_sstable_reader()
  sstables: sstable: add make_reader()
2021-01-28 12:05:06 +02:00
Botond Dénes
1e9ce62ee6 cql_query_test: add unit test covering the non-optimal TWCS sstable read path
The sstable read path for TWCS tables takes a different path when the
optimized read path cannot be used. This path was found to be not
covered at all by unit tests which allowed a trivial use-after-free to
slip in. Add a unit test to cover this path as well, so ASAN can catch
such bugs in the future.
2021-01-28 11:34:03 +02:00
Botond Dénes
dd26a96e63 tests: don't pass temporary ranges to readers
The sstable_mutation_reader, like all other mutation readers expects
that the partition-range passed to it is kept alive by its creator
for the duration of its lifetime. However, the single-key constructor
of the sstable reader was more tolerant, as it only extracted the key
from the range, essentially requiring only the key to be kept alive (but
not the containing range). Naturally in time some code come to rely on
it and ended up passing temporary ranges to the reader. This behaviour
will no longer be acceptable as we are about to consolidate the various
sstable reader constructors, uniformly requiring that the range is kept
alive. So this patch fixes up the tests so they work with this stricter
requirement. Only two occurences were found.
2021-01-27 17:38:17 +02:00
Botond Dénes
c3b4e990a2 tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods 2021-01-27 17:38:17 +02:00
Avi Kivity
aec231ba2e Merge "Unify query paths" from Botond
"
Currently we have two parallel query paths:
* database::query() -> table::query() -> data_query()
* mutation::query()

The former is used by single partition queries, the latter by range
scans, as mutation::query() is used to convert reconcilable_result to
query::result (which means it is also used in single partition queries
if it triggers read repair). This is a rather unfortunate situation as
we have two parallel implementation of the query code, which means they
are prone to diverge, and in fact they already have -- more on that
later.

This patchset aims to remedy this situation by retiring
`mutation::query()` and migrating users to an implementation based on
the "standard" query path, in other words one using the same building
blocks as the `database::query()` path. This means using
`compact_mutation` for compacting and `query_result_builder` for result
building. These components however were created to work with
`flat_mutation_reader`, however introducing a reader into this pipeline
would mean that we'd have to make all the related APIs asynchronous,
which would cause an insane amount of churn. To avoid this, this
patchset adds an API compatible `consume()` method to `mutation`, which
can accept a `compact_mutation` instance as-is. This allows an elegant
and succinct reimplementation. So far so good.

Like mentioned above, the two implementations have diverged in time, or
have been different from the start. The difference manifest when
calculating digests, more precisely in which tombstones are included in
the digest. The retired `mutation::query()` path incorporates only
non-purgeable tombstones in the digest. The standard query path however
incorporates all tombstones, even those that can be purged. After some
scrutiny however this difference proved to be completely theoretical,
as
the code path where this would matter -- converting reconcilable result
to query result -- passes min timestamp as the query time to the
compaction, so nothing is compacted and hence the difference has no
chance to manifest.

This patch-set was motivated by the desire to provide a single solution
to #7434, instead of two, one for each path.

Tests: unit(release:v2, debug:v2, dev:v3)
"

* 'unified-query-path/v3' of https://github.com/denesb/scylla:
  mutation: remove now unused query() and query_compacted()
  treewide: use query_mutations() instead of mutation::query()
  mutation_test: test_query_digest: ensure digest is produced consistently
  mutation_query: introduce query_mutation()
  mutation_query: to_data_query_result(): migrate to standard query code
  mutation_query: move to_data_query_result() to mutation_partition.cc
  mutation: add consume()
  flat_mutation_reader: move mutation consumer concepts to separate header
  mutation compactor: query compaction: ignore purgeable tombstones
2021-01-27 15:58:47 +02:00
Avi Kivity
f58151d191 test: mutation_test: fix initialization order bug with thread local storage
test_cell_external_memory_usage uses with_allocator() to observe how some
types allocate memory. However, compiler reordering (observed with clang 11
on aarch64) can move the various thread-local CQL type object initialization
into the with_allocator() scope; so any managed object allocated as part of
this initialization also gets measured, and the test fails. The code movement
is legal, as far as I can tell.

Fix this by initializing the type object early; use an atomic_thread_fence
as an optimization barrier so the compiler doesn't eliminate the or move
the early initialization.

Closes #7951
2021-01-26 11:14:42 +02:00
Benny Halevy
1847d49971 test: test_env: pick the highest sstable version by default
If possible, test the highest sstable format version,
as it's the mostly used.

If there pre-written sstables we need to load from the
test directory from an older version, either specify their
version explicitly, or use the new test_env::reusable_sst
method that looks up the latest sstable version in the
given directory and generation.

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>
2021-01-24 10:38:55 +02:00
Botond Dénes
1a3ee71b39 treewide: use query_mutations() instead of mutation::query()
We want to retire the latter.
2021-01-22 15:36:37 +02:00