Commit Graph

25205 Commits

Author SHA1 Message Date
Konstantin Osipov
e49d5f89a5 raft: do not account for the same vote twice
While a duplicate vote from the same server is not possible by a
conforming Raft implementation, Raft assumptions on network permit
duplicates.
So, in theory, it is possible that a vote message is delivered
multiple times.

The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.

Keep track of who has voted yet, and reject duplicate votes.

A unit test follows.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
7ea064ac04 raft: remove fsm::set_configuration()
Set either tracker or votes configuration explicitly.
This saves a few lines and simplifies unit tests.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
4083026b65 raft: consistently use configuration from the log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
c4552ffb9a raft: add ostream serialization for enum vote_result 2021-02-18 16:04:44 +03:00
Konstantin Osipov
2ae04d8a47 raft: advance commit index right after leaving joint configuration
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
The leader's view of stable indexes is:

Server  Match Index
A       5
B       5
C       6
D       7
E       8

The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Let it happen without an extra FSM
step.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
132db931da raft: add tracker test 2021-02-18 16:04:44 +03:00
Konstantin Osipov
6e3932bbc7 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
ed65a8635e raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e58a3e42ca raft: add a unit test for raft::log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
51c968bcb4 raft: rename log::non_snapshoted_length() to log::in_memory_size()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
cfe407b402 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, rename truncate_head()
to truncate_uncommitted().
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e0011c6e4d raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
805d52eb16 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
af8770da63 raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
cb035a7c8d raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-18 16:04:43 +03:00
Konstantin Osipov
04b4d97d6a raft: rename progress.hh to tracker.hh
class tracker is the main class of this module.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
97a16c0f77 raft: extend single_node_is_quiet test 2021-02-18 16:04:43 +03:00
Takuya ASADA
d7f202f900 dist/debian: fix renaming debian/scylla-* files rule
Current renaming rule of debian/scylla-* files is buggy, it fails to
install some .service files when custom product name specified.

Introduce regex based rewriting instead of adhoc renaming, and fixed
wrong renaming rule.

Fixes #8113

Closes #8114
2021-02-18 10:35:19 +02:00
Pekka Enberg
843bf57c3c Update tools/jmx submodule
* tools/jmx 949cefc...bf8bb16 (1):
  > Merge 'dist/debian: fix renaming debian/scylla-* files rule' from Takuya ASADA
2021-02-18 10:35:00 +02:00
Botond Dénes
c3b4c3f451 evictable_reader: reset _range_override after fast-forwarding
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
2021-02-17 19:11:00 +02:00
Benny Halevy
4b46793c19 row_cache: scanning_and_populating_reader: add _read_next_partition flag
Instead of resetting _reader in scanning_and_populating_reader::fill_buffer
in the `reader_finished` case, use a gentler, _read_next_partition flag
on which `read_next_partition` will be called in the next iteration.

Then, read_next_partition can close _reader only before overwriting it
with a new reader.  Otherwise, if _reader is always closed in the
``reader_finished` case, we end up hitting premature end_of_stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>
2021-02-17 19:06:21 +02:00
Benny Halevy
57540dae42 mutation_query: mark reconcilable_result_builder constructor noexcept
With result_memory_accounter begin nothrow move constructible
reconcilable_result_builder does not throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-67-bhalevy@scylladb.com>
2021-02-17 18:56:12 +02:00
Benny Halevy
92e0e84ee5 database: futurize remove
In preparation for futurizing the querier_cache api.

Coroutinize drop_column_family while at it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-61-bhalevy@scylladb.com>
2021-02-17 18:52:53 +02:00
Benny Halevy
5263ab0e9d row_cache: read_context: use query-request is_single_partition helper
Rather than hand-coding the same logic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-32-bhalevy@scylladb.com>
2021-02-17 18:29:39 +02:00
Benny Halevy
35256d1b92 treewide: explicitly use flat_mutation_reader_opt
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.

Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
2021-02-17 17:57:34 +02:00
Avi Kivity
c63e26e26f Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8048

* github.com:scylladb/scylla:
  cdc: Limit size of topology description
  cdc: Extract create_stream_ids from topology_description_generator
2021-02-17 15:43:53 +02:00
Piotr Jastrzebski
649f254863 cdc: Limit size of topology description
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-17 13:24:40 +01:00
Avi Kivity
001652815c Merge 'imr: switch back to open-coded description of structures' from Michał Chojnowski
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578

Closes #8106

* github.com:scylladb/scylla:
  imr: switch back to open-coded description of structures
  utils: managed_bytes: add a few trivial helper methods
  utils: fragment_range: move FragmentedView helpers to fragment_range.hh
  utils: fragment_range: add single_fragmented_mutable_view
  utils: fragment_range: implement FragmentRange for fragment_range
  utils: mutable_view: add front()
  types: remove an unused helper function
  test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
  test: mutation_test: remove an obsolete assertion
  test: mutation_test: initialize an uninitialized variable
  test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
2021-02-17 13:40:16 +02:00
Botond Dénes
ba7a9d2ac3 imr: switch back to open-coded description of structures
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578
2021-02-16 23:43:07 +01:00
Michał Chojnowski
25a9569cc4 utils: managed_bytes: add a few trivial helper methods
We will use them in the upcoming IMR removal patch.
2021-02-16 23:43:07 +01:00
Michał Chojnowski
3f248ca7cc utils: fragment_range: move FragmentedView helpers to fragment_range.hh
In the upcoming IMR removal patch we will need read_simple() and similar helpers
for FragmentedView outside of types.hh. For now, let's move them to
fragment_range.hh, where FragmentedView is defined. Since it's a widely included
header, we should consider moving them to a more specialized header later.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
8a06a576aa utils: fragment_range: add single_fragmented_mutable_view
We will use it later in the upcoming IMR removal patch.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
7b662b9315 utils: fragment_range: implement FragmentRange for fragment_range
This will allow us to pass FragmentedView instances to places where
FragmentRange is expected.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
f972f90193 utils: mutable_view: add front()
We will use it in the upcoming patches.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
9e591c6634 types: remove an unused helper function 2021-02-16 21:35:14 +01:00
Michał Chojnowski
6b8a69e01f test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
The off-by-one error would cause
test_multishard_combining_reader_non_strictly_monotonic_positions to fail if
the added range_tombstones filled the buffer exactly to the end.
In such situation, with the old loop condition,
make_fragments_with_non_monotonic_positions would add one range_tombstone too
many to the deque, violating the test assumptions.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
5b79d6ca4c test: mutation_test: remove an obsolete assertion
Due to small value optimizations, the removed assertions are not true in
general. Until now, atomic_cell did not use small value optimizations, but
it will after upcoming changes.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
aa60f28a09 test: mutation_test: initialize an uninitialized variable
It was assumed to be zero-initialized, but C++ does not guarantee that.
It has to be initialized explicitly.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
52bd190bb3 test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
sstable_run_based_compaction_test assumed that sstables are freed immediately
after they are fully processed.
Hovewer, since commit b524f96a74,
mutation_reader_merger releases sstables in batches of 4, which breaks the
assumption. This fix adjusts the test accordingly.

Until now, the test only kept working by chance: by coincidence, the number of
test sstables processed by merging_reader in a single fill_buffer() call was
divisible by 4. Since the test checks happen between those calls,
the test never witnessed a situation when an sstable was fully processed,
but not released yet.

The error was noticed during the work on an upcoming patch which changes the
size of mutation_fragment, and reduces the number of test sstables processed
in a single fill_buffer() call, which breaks the test.
2021-02-16 21:35:14 +01:00
Nadav Har'El
946e63ee6e cql-pytest: remove "xfail" tag from two passing tests
Issue #7595 was already fixed last week, in commit
b6fb5ee912, so the two tests which failed
because of this issue no longer fail and their "xfail" tag can be removed.

Refs #7595.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216160606.1172855-1-nyh@scylladb.com>
2021-02-16 19:17:22 +02:00
Nadav Har'El
737c1c6cc7 cql-pytest: Additional JSON tests
This patch adds several additional tests o test/cql-pytest/test_json.py
to reproduce additional bugs or clarify some non-bugs.

First, it adds a reproducer for issue #8087, where SELECT JSON may create
invalid JSON - because it doesn't quote a string which is part of a map's
key. As usual for these reproducers, the test passes on Cassandra, and fails
on Scylla (so marked xfail).

We have a bigger test translated from Cassandra's unit tests,
cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys
which demonstrates the same problem, but the test added in this patch is much
shorter and focuses on demonstrating exactly where the problem is.

Second, this patch adds a test test verifies that SELECT JSON works correctly
for UDTs or tuples where one of their components was never set - in such a
case the SELECT JSON should also output this component, with a "null" value.
And this test works (i.e., produces the same result in Cassandra and Scylla).
This test is interesting because it shows that issue #8092 is specific to the
case of an altered UDT, and doesn't happen for every case of null
component in a UDT.

Refs #8087
Refs #8092

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>
2021-02-16 16:05:31 +01:00
Avi Kivity
2f3b265dac Update seastar submodule
* seastar 76cff58964...e53a1059f9 (18):
  > rpc: streaming sink: order outgoing messages
Fixes #7552.
  > http: fix compilation issues when using clang++
  > http/file_handler: normalize file-type for mime detection
  > http/mime_types: add support for svg+xml
  > reactor: simplify get_sched_stats()
  > Merge "output_stream: make api noexcept" from Benny
  > Merge " input_stream: make api noexcept" from Benny
  > rpc: mark 'protocol' class as final
  > tls: reloadable_certificate inotify flag is wrong
Fixes #8082.
  > cli: Ignore the --num-io-queues option
  > io_queue: Do not carry start time in lambda capture
  > fstream: Cancel all IO-s on file_data_source_impl close
  > http: add "Transfer-Encoding: chunked" handling
  > http: add ragel parsers for chunks used in messages with Transfer-Encoding: chunked
  > http: add request content streaming
  > http: add reading/skipping all bytes in an input_stream
  > Merge "Reduce per-io-queue container for prio classes" from Pavel Emelyanov
  > seastar-addr2line: split multiple addresses on the same line
2021-02-16 16:19:26 +02:00
Avi Kivity
789233228b messaging: don't inherit from seastar::rpc::protocol
messaging_service's rpc_protocol_server_wrapper inherits from
seastar::rpc::protocol::server as a way to avoid a
is unfortunate, as protocol.hh wasn't designed for inheritance, and
is not marked final.

Avoid this inheritance by hiding the class as a member. This causes
a lot of boilerplate code, which is unfortunate, but this random
inheritance is bad practice and should be avoided.

Closes #8084
2021-02-16 16:04:44 +02:00
Gleb Natapov
c9392095ce cql3: store cf_prop_defs as optional instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-3-gleb@scylladb.com>
2021-02-16 15:58:38 +02:00
Gleb Natapov
805da054e7 cql3: store cf_name as optional in cf_statement instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-2-gleb@scylladb.com>
2021-02-16 15:58:37 +02:00
Gleb Natapov
6335af625e cql3: assert that unengaged optional is not accessed in keyspace_element_name::get_keyspace()
Message-Id: <20210216085545.54753-2-gleb@scylladb.com>
2021-02-16 15:36:00 +02:00
Gleb Natapov
200ca974c3 Do not access potentially unengaged optional in keyspace_element_name
Currently there are places that call
keyspace_element_name::get_keyspace() without checking that _ks_name is
engaged. Fix those places.
Message-Id: <20210216085545.54753-1-gleb@scylladb.com>
2021-02-16 15:35:59 +02:00
Botond Dénes
4d309fc34a repair: row_level: invoke on_internal_error() on out-of-order partitions
repair_writer::do_write(): already has a partition compare for each
mutation fragment written, do determine whether the fragment belongs to
another partition or not. This equal compare can be converted to a
tri_compare at no extra cost allowing for detecting out-of-order
partitions, in which case `on_internal_error()` is called.

Refs: #7623
Refs: #7552

Test: dtest(RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test:debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210216074523.318217-1-bdenes@scylladb.com>
2021-02-16 15:31:40 +02:00
Benny Halevy
50ca693a02 main: disable stall detector during startup
We see long reactor stalls from `logalloc::prime_segment_pool`
in debug mode yet the stall detector's purpose is to detect
reactor stalls during normal operation where they can increase
the latency of other queries running in parallel.

Since this change doesn't actually fix the stalls but rather
hides them, the following annotations will just refrence
the respective github issues rather than auto-close them.

Refs #7150
Refs #5192
Refs #5960

Restore blocked_reactor_notify_ms right before
starting storage_proxy.  Once storage_proxy is up, this node
affects cluster latency, and so stalls should be reported so
they can be fixed.

Test: secondary_index_test --blocked-reactor-notify-ms 1 (release)
DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>
2021-02-16 13:28:31 +02:00
Tomasz Grabiec
446ea07ac6 Merge "raft: server instance init and raft RPC handlers" from Pavel Solodovnikov
This series provides a `raft_services` class to create and store
a raft schema changes server instances, and also wires up the RPC handlers
for Raft RPC verbs.

* manmanson/raft-api-server-handlers-v10:
  raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances
  raft: move server address handling from `raft_rpc` to `raft_services` class
  raft: wire up schema Raft RPC handlers
  raft: raft_rpc: provide `update_address_mapping` and dispatcher functions
  raft: pass `group_id` as an argument to raft rpc messages
  raft: use a named constant for pre-defined schema raft group
2021-02-16 11:14:50 +01:00