Commit Graph

29508 Commits

Author SHA1 Message Date
Nadav Har'El
e4b2dfb54d alternator ttl: when node is down, secondary node continues to expire
The current implementation of the Alternator expiration (TTL) feature
has each node scan for expired partitions in its own primary ranges.
This means that while a node is down, items in its primary ranges will
not get expired.

But we note that doesn't have to be this way: If only a single node is
down, and RF=3, the items that node owns are still readable with QUORUM -
so these items can still be safely read and checked for expiration - and
also deleted.

This patch implements a fairly simple solution: When a node completes
scanning its own primary ranges, also checks whether any of its *secondary*
ranges (ranges where it is the *second* replica) has its primary owner
down. For such ranges, this node will scan them as well. This secondary
scan stops if the remote node comes back up, but in that case it may
happen that both nodes will work on the same range at the same time.
The risks in that are minimal, though, and amount to wasted work and
duplicate deletion records in CDC. In the future we could avoid this by
using LWT to claim ownership on a range being scanned.

We have a new dtest (see a separate patch), alternator_ttl_tests.py::
TestAlternatorTTL::test_expiration_with_down_node, which reproduces this
and verifies this fix. The test starts a 5-node cluster, with 1000 items
with random tokens which are due to be expired immediately. The test
expects to see all items expiring ASAP, but when one of the five nodes
is brought down, this doesn't happen: Some of the items are not expired,
until this patch is used.

Fixes #9787

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211222131933.406148-1-nyh@scylladb.com>
2021-12-26 14:10:52 +02:00
Benny Halevy
f7b8b809d0 sstables: parse chunked_vector<std::integral Members>: maximize chunk size
Currently this parse function reads only 100KB worth
of members in eac hiteration.

Since the default max_chunk_capacity is 128KB,
100KB underutilize the chunk capacity, and it could
be safely increased to the max to reduce the number of
allocations and corresponding calls to read_exactly
for large arrays.

Expose utils::chunked_vector::max_chunk_capacity
so that the caler wouldn't have to guess this number
and use it in parse().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222103126.1819289-2-bhalevy@scylladb.com>
2021-12-22 15:47:37 +02:00
Benny Halevy
d95f6602a7 sstables: coroutinize parse functions
Simplify the implementation using coroutines.
This also has the potential to coalesce multiple
allocations into one.

test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222103126.1819289-1-bhalevy@scylladb.com>
2021-12-22 15:47:37 +02:00
Benny Halevy
2f2e3b2e84 test: lib: index_reader_assertions: close reader before it is destroyed
Otherwise, it may trip an assertion when the nuderlying
file is closed, as seen in e.g.:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4318/artifact/testlog/x86_64_release/sstable_3_x_test.test_read_rows_only_index.4174.log
```
test/boost/sstable_3_x_test.cc(0): Entering test case "test_read_rows_only_index"
sstable_3_x_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed.
Aborting on shard 0.
Backtrace:
  0x22557e8
  0x2286842
  0x7f2799e99a1f
  /lib64/libc.so.6+0x3d2a1
  /lib64/libc.so.6+0x268a3
  /lib64/libc.so.6+0x26788
  /lib64/libc.so.6+0x35a15
  0x222c53d
  0x222c548
  0xb929cc
  0xc0b23b
  0xa84bbf
  0x24d0111
```

Decoded:
```
__GI___assert_fail at :?
~file_data_source_impl at ./build/release/seastar/./seastar/src/core/fstream.cc:205
~file_data_source_impl at ./build/release/seastar/./seastar/src/core/fstream.cc:202
std::default_delete<seastar::data_source_impl>::operator()(seastar::data_source_impl*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85
 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361
 (inlined by) ~data_source at ././seastar/include/seastar/core/iostream.hh:55
 (inlined by) ~input_stream at ././seastar/include/seastar/core/iostream.hh:254
 (inlined by) ~continuous_data_consumer at ././sstables/consumer.hh:484
 (inlined by) ~index_consume_entry_context at ././sstables/index_reader.hh:116
 (inlined by) std::default_delete<sstables::index_consume_entry_context<sstables::index_consumer> >::operator()(sstables::index_consume_entry_context<sstables::index_consumer>*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85
 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361
 (inlined by) ~index_bound at ././sstables/index_reader.hh:395
 (inlined by) ~index_reader at ././sstables/index_reader.hh:435
std::default_delete<sstables::index_reader>::operator()(sstables::index_reader*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85
 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361
 (inlined by) ~index_reader_assertions at ././test/lib/index_reader_assertions.hh:31
 (inlined by) operator() at ./test/boost/sstable_3_x_test.cc:4630
```

Test: unit(dev), sstable_3_x_test.test_read_rows_only_index(release X 10000)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222132858.2155227-1-bhalevy@scylladb.com>
2021-12-22 15:33:22 +02:00
Raphael S. Carvalho
e80cb51b6a distributed_loader: make shutdown clean by properly handling compaction_stopped exception
Today, when resharding is interrupted, shutdown will not be clean
because stopped exception interrupts the shutdown process.
Let's handle stopped exception properly, to allow shutdown process
to run to completion.

Refs #9759

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211221175717.62293-1-raphaelsc@scylladb.com>
2021-12-22 15:08:31 +02:00
Botond Dénes
def6d48307 Merge 'gdb: Introduce "scylla lsa-check"' from Tomasz Grabiec
Catches inconsistencies in LSA state.

Currently:

  - discrepancy between segment set in _closed_segments and shard's
    segment descriptors

  - cross-shard segment references in _closed_segments

  - discrepancy in _closed_occupancy stats and what's in segment
    descriptors

  - segments not present in _closed_segments but present in
    segment descriptors

Refs https://github.com/scylladb/scylla/issues/9544

Closes #9834

* github.com:scylladb/scylla:
  gdb: Introduce "scylla lsa-check"
  gdb: Make get_base_class_offset() also see indirect base classes
2021-12-22 15:08:31 +02:00
Pavel Emelyanov
7286374dba migration_manager: Remove last occurrence of get_local_storage_proxy()
The migration manager got local storage proxy reference recently, but one
method still uses the global call. Fix it.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211221120034.21824-1-xemul@scylladb.com>
2021-12-22 15:08:31 +02:00
Botond Dénes
aba68c8f83 Merge "reader_concurrency_semaphore: convert to flat_mutation_reader_v2" from Michael
"
The second patch in this series is a mechanical conversion of
reader_concurrency_semaphore to flat_mutation_reader_v2, and caller
updates.

The first patch is needed to pass the test suite, since without it a
real reader version conversion would happen on every entry to and exit
from reader_concurrency_semaphore, which is stressful (for example:
mutation_reader_test.test_multishard_streaming_reader reaches 8191
conversions for a couple of readers, which somehow causes it to catch
SIGSEGV in diverse and seemingly-random places).

Note that in a real workload it is unreasonable to expect readers being
parked in a reader_concurrency_semaphore to be pristine, so
short-circuiting their version conversions will be impossible and this
workaround will not really help.
"

* tag 'rcs-v2-v4' of https://github.com/cmm/scylla:
  reader_concurrency_semaphore: convert to flat_mutation_reader_v2
  short-circuit flat mutation reader upgrades and downgrades
2021-12-22 15:08:31 +02:00
Tomasz Grabiec
3e81318587 gdb: Introduce "scylla lsa-check"
Catches inconsistencies in LSA state.

Currently:

  - discrepancy between segment set in _closed_segments and shard's
    segment descritpors

  - cross-shard segment references in _closed_segments

  - discrepancy in _closed_occupancy stats and what's in segment
    descriptors

  - segments not present in _closed_segments but present in
    segment descriptors
2021-12-21 21:18:52 +01:00
Tomasz Grabiec
d754504fa2 gdb: Make get_base_class_offset() also see indirect base classes
I need it so that segment_descriptor is seen as inheriting from
list_base_hook<>, which it does via log_heap_hook.
2021-12-21 21:18:52 +01:00
Michael Livshin
a1b8ba23d2 reader_concurrency_semaphore: convert to flat_mutation_reader_v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-12-21 11:26:17 +02:00
Michael Livshin
9f656b96ac short-circuit flat mutation reader upgrades and downgrades
When asked to upgrade a reader that itself is a downgrade, try to
return the original v2 reader instead, and likewise when downgrading
upgraded v1 readers.

This is desirable because version transformations can result from,
say, entering/leaving a reader concurrency semaphore, and the amount
of such transformations is practically unbounded.

Such short-circuiting is only done if it is safe, that is: the
transforming reader's buffer is empty and its internal range tombstone
tracking state is discardable.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-12-21 11:26:17 +02:00
Raphael S. Carvalho
64ec1c6ec6 table: Make sure major compaction doesn't miss data in memtable
Make sure that major will compact data in all sstables and memtable,
as tombstones sitting in memtable could shadow data in sstables.
For example, a tombstone in memtable deleting a large partition could
be missed in major, so space wouldn't be saved as expected.
Additionally, write amplification is reduced as data in memtable
won't have to travel through tiers once flushed.

Fixes #9514.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217160055.96693-2-raphaelsc@scylladb.com>
2021-12-21 07:21:34 +02:00
Raphael S. Carvalho
e1e8e020fe tests: Allow memtable to be flushed through column_family_for_tests
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217160055.96693-1-raphaelsc@scylladb.com>
2021-12-21 07:21:26 +02:00
Raphael S. Carvalho
e05859c3f9 compaction: kill unused code for resharding_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-2-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Raphael S. Carvalho
d1f2fd7f03 compaction: rename compacting_sstable_writer to compacted_fragments_writer
the name compacting_sstable_writer is misleading as it doesn't perform
any compaction. let's rename it to a name that reflects more what it
does.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-1-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Avi Kivity
f190434beb Merge "table,sstable_set: use v2 readers below the cache" from Bodtrond
"
Convert sstable_set and table::make_sstable_reader() to v2. With this
all readers below cache use the v2 format.

Tests: unit(dev)
"

* 'table-make-sstable-reader-v2/v1' of https://github.com/denesb/scylla:
  table: upgrade make_sstable_reader() to v2
  sstables/sstable_set: create_single_key_sstable_reader() upgrade to v2
  sstables/sstable_set: remove unused and undefined make_reader() member
2021-12-20 17:53:44 +02:00
Botond Dénes
18cddd3279 table: upgrade make_sstable_reader() to v2
With this all readers below cache use the v2 format (except kl/la
readers).
2021-12-20 17:40:46 +02:00
Botond Dénes
9027c6f936 sstables/sstable_set: create_single_key_sstable_reader() upgrade to v2
With this all methods of the sstable set create v2 readers.
2021-12-20 17:17:33 +02:00
Botond Dénes
847eddf19a sstables/sstable_set: remove unused and undefined make_reader() member 2021-12-20 17:17:31 +02:00
Botond Dénes
55bb70a878 Merge "Make sure TWCS per-window major includes all files" from Raphael
"
TWCS perform STCS on a window as long as it's the most recent one.
From there on, TWCS will compact all files in the past window into
a single file. With some moderate write load, it could happen that
there's still some compaction activity in that past window, meaning
that per-window major may miss some files being currently compacted.
As a result, a past window may contain more than 1 file after all
compaction activity is done on its behalf, which may increase read
amplification. To avoid that, TWCS will now make sure that per-window
major is serialized, to make sure no files are missed.

Fixes #9553.

tests: unit(dev).
"

* 'fix_twcs_per_window_major_v3' of https://github.com/raphaelsc/scylla:
  TWCS: Make sure major on past window is done on all its sstables
  TWCS: remove needless param for STCS options
  TWCS: kill unused param in newest_bucket()
  compaction: Implement strategy control and wire it
  compaction: Add interface to control strategy behavior.
2021-12-20 17:12:50 +02:00
Avi Kivity
e772fcbd57 Merge "Convert combined reader to v2" from Botond
"
Users are adjusted by sprinkling `upgrade_to_v2()` and
`downgrade_to_v1()` where necessary (or removing any of these where
possible). No attempt was made to optimize and reduce the amount of
v1<->v2 conversions. This is left for follow-up patches to keep this set
small.

The combined reader is composed of 3 layers:
1. fragment producer - pop fragments from readers, return them in batches
  (each fragment in a batch having the same type and pos).
2. fragment merger - merge fragment batches into single fragments
3. reader implementation glue-code

Converting layers (1) and (3) was mostly mechanical. The logic of
merging range tombstone changes is implemented at layer (2), so the two
different producer (layer 1) implementations we have share this logic.

Tests: unit(dev)
"

* 'combined-reader-v2/v4' of https://github.com/denesb/scylla:
  test/boost/mutation_reader_test: add test_combined_reader_range_tombstone_change_merging
  mutation_reader: convert make_clustering_combined_reader() to v2
  mutation_reader: convert position_reader_queue to v2
  mutation_reader: convert make_combined_reader() overloads to v2
  mutation_reader: combined_reader: convert reader_selector to v2
  mutation_reader: convert combined reader to v2
  mutation_reader: combined_reader: attach stream_id to mutation_fragments
  flat_mutation_reader_v2: add v2 version of empty reader
  test/boost/mutation_reader_test: clustering_combined_reader_mutation_source_test: fix end bound calculation
2021-12-20 14:01:03 +02:00
Botond Dénes
7f331cee01 test/boost/mutation_reader_test: add test_combined_reader_range_tombstone_change_merging
Stressing the range tombstone change merging logic.
2021-12-20 09:29:05 +02:00
Botond Dénes
e1bbc4a480 mutation_reader: convert make_clustering_combined_reader() to v2
Just sprinkle the right amount downgrade_to_v1() and upgrade_to_v2() to
call sites, no attempts at optimization was done.
2021-12-20 09:29:05 +02:00
Botond Dénes
2364144b19 mutation_reader: convert position_reader_queue to v2
By removing the converting (v1->v2) constructor of
`reader_and_upper_bound` and adjusting its users.
2021-12-20 09:29:05 +02:00
Botond Dénes
aeddcf50a1 mutation_reader: convert make_combined_reader() overloads to v2
Just sprinkle the right amount downgrade_to_v1() and upgrade_to_v2() to
call sites, no attempts at optimization was done.
2021-12-20 09:29:05 +02:00
Botond Dénes
1554b94b78 mutation_reader: combined_reader: convert reader_selector to v2 2021-12-20 09:29:05 +02:00
Botond Dénes
71835bdee1 mutation_reader: convert combined reader to v2
The meat of the change is on the fragment merger level, which is now
also responsible for merging range tombstone changes. The fragment
producers are just mechanically converted to v2 by appending `_v2` to
the appropriate type names.
The beauty of this approach is that range tombstone merging happens in a
single place, shared by all fragment producers (there is 2 of them).

Selectors and factory functions are left as v1 for now, they will be
converted incrementally by the next patches.
2021-12-20 09:29:05 +02:00
Asias He
eba4a4fba4 repair: Allow ignoring dead nodes for replace operation
Consider

1) n1, n2, n3, n4, n5
2) n2 and n3 are both down
3) start n6 to replace n2
4) start n7 to replace n3

We want to replace the dead nodes n2 and n3 to fix the cluster to have 5
running nodes.

Replace operation in step 3 will fail because n3 is down.
We would see errors like below:

replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed
for replace operation are down. It is highly recommended to fix the down
nodes and try again.

In the above example, currently, there is no way to replace any of the
dead nodes.

Users can either fix one of the dead nodes and run replace or run
removenode operation to remove one of the dead nodes then run replace
and run bootstrap to add another node.

Fixing dead nodes is always the best solution but it might not be
possible. Running removenode operation is not better than running
replace operation (with best effort by ignoring the other dead node) in
terms of data consistency. In addition, users have to run bootstrap
operation to add back the removed node. So, allowing replacing in such
case is a clear win.

This patch adds the --ignore-dead-nodes-for-replace option to allow run
replace operation with best effort mode. Please note, use this option
only if the dead nodes are completely broken and down, and there is no
way to fix the node and bring it back. This also means the user has to
make sure the ignored dead nodes specified are really down to avoid any
data consistency issue.

Fixes #9757

Closes #9758
2021-12-20 00:49:03 +02:00
Avi Kivity
7bdc999bba service: paxos_state: wean off get_local_storage_proxy()
Instead of calling get_local_storage_proxy in paxos_state, get it from the
caller (who is, in fact, storage_proxy or one of its components).

Some of the callers, although they are storage_proxy components, don't
have a storage_proxy reference handy and so they ignomiously call
get_local_storage_proxy() themselves. This will be adjusted later.

The other callers who are, in fact, storage_proxy, have to take special
care not to cross a shard boundary. When they do, smp::submit_to()
is converted to sharded::invoke_on() in order to get the correct local instance.

Test: unit (dev)

Closes #9824
2021-12-20 00:31:13 +02:00
Nadav Har'El
252ce8afd4 Merge 'Extend stop compaction api' from Benny Halevy
Allow stopping compaction by type on a given keyspace and list of tables.

Also add api unit test suite that tests the existing `stop_compaction` api
and the new `stop_keyspace_compaction` api.

Fixes #9700

Closes #9746

* github.com:scylladb/scylla:
  api: storage_service: validate_keyspace: improve exception error message
  api: compaction_manager: add stop_keyspace_compaction
  api: storage_service: expose validate_keyspace and parse_tables
  api: compaction_manager: stop_compaction: fix type description
  compaction_manager: stop_compaction: expose optional table*
  test: api: add basic compaction_manager test
2021-12-20 00:18:46 +02:00
Pavel Emelyanov
d88ae7edae Merge 'migration_manager: retire global storage proxy refs' from Avi Kivity
Replace get_local_storage_proxy() and get_local_storage_proxy() with
constructor-provided references. Some unneeded cases were removed.

Test: unit (dev)

Closes #9816

* github.com:scylladb/scylla:
  migration_manager: replace uses of get_storage_proxy and get_local_storage_proxy with constructor-provided reference
  migration_manager: don't keep storage_proxy alive during schema_check verb
  mm: don't capture storage proxy shared_ptr during background schema merge
  mm: remove stats on schema version get
2021-12-17 17:53:08 +03:00
Raphael S. Carvalho
f508f54f3e table: move min_compaction_threshold() and compaction_enforce_min_threshold() into table_state
Compaction specific methods can be implemented in table_state only,
as they aren't needed elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211214191822.164223-1-raphaelsc@scylladb.com>
2021-12-17 10:00:31 +02:00
Piotr Sarna
f49c20aa24 thrift: drop obtaining incorrect permits
The thrift layer started partially having admission control
after commit ef1de114f0,
but code inspection suggests that it might cause use-after-free
in a few cases, when a permit is obtained more than once per
handling - due to the fact that some functions tail-called other
functions, which also obtain a permit.
These extraneous permits are not taken anyore.

Tests: "please trust me" + cassandra-stress in thrift mode
Message-Id: <ac5d711288b22c5fed566937722cceeabc234e16.1639394937.git.sarna@scylladb.com>
2021-12-17 09:35:24 +02:00
Avi Kivity
7c23ed888d Update tools/jmx submodule (dropping unneeded dependencies)
* tools/jmx 2c43d99...53f7f55 (1):
  > pom.xml: drop unneeded logging dependencies
2021-12-16 21:54:36 +02:00
Avi Kivity
a97731a7e5 migration_manager: replace uses of get_storage_proxy and get_local_storage_proxy with constructor-provided reference
A static helper also gained a storage_proxy parameter.
2021-12-16 21:05:47 +02:00
Avi Kivity
aca9029c24 migration_manager: don't keep storage_proxy alive during schema_check verb
The schema_check verb doesn't leak tasks, so when the verb is
unregistered it will be drained. So protection for storage_proxy lifetime
can be removed.
2021-12-16 21:04:27 +02:00
Avi Kivity
26c656f6ed mm: don't capture storage proxy shared_ptr during background schema merge
The definitions_update() verb captures a shared_ptr to storage_proxy
to keep it alive while the background task executes.

This was introduced in (2016!):

commit 1429213b4c
Author: Pekka Enberg <penberg@scylladb.com>
Date:   Mon Mar 14 17:57:08 2016 +0200

    main: Defer migration manager RPC verb registration after commitlog replay

    Defer registering migration manager RPC verbs after commitlog has has
    been replayed so that our own schema is fully loaded before other other
    nodes start querying it or sending schema updates.
    Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>

when moving this code from storage_proxy.cc.

Later, better protection with a gate was added:

commit 14de126ff8
Author: Pavel Emelyanov <xemul@scylladb.com>
Date:   Mon Mar 16 18:03:48 2020 +0300

    migration_manager: Run background schema merge in gate

    The call for merge_schema_from in some cases is run in the
    background and thus is not aborted/waited on shutdown. This
    may result in use-after-free one of which is

    merge_schema_from
     -> read_schema_for_keyspace
         -> db::system_keyspace::query
             -> storage_proxy::query
                 -> query_partition_key_range_concurrent

    in the latter function the proxy._token_metadata is accessed,
    while the respective object can be already free (unlike the
    storage_proxy itself that's still leaked on shutdown).

    Related bug: #5903, #5999 (cannot reproduce though)
    Tests: unit(dev), manual start-stop
           dtest(consistency.TestConsistency, dev)
           dtest(schema_management, dev)

    Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
    Reviewed-by: Pekka Enberg <penberg@scylladb.com>
    Message-Id: <20200316150348.31118-1-xemul@scylladb.com>

Since now the task execution is protected by the gate and
therefore migration_manager lifetime (which is contained within
that of storage_proxy, as it is constructed afterwards), capturing
the shared_ptr is not needed, and we therefore remove it, as
it uses the deprecated global storage_proxy accessors.
2021-12-16 21:01:06 +02:00
Botond Dénes
7db31e1bdb mutation_reader: combined_reader: attach stream_id to mutation_fragments
The fragment producer component of the combined reader returns a batch
of fragments on each call to operator()(). These fragments are merged
into a single one by the fragment merger. This patch adds a stream id to
each fragment in the batch which identifies the stream (reader) it
originates from. This will be used in the next patches to associate
range-tombstone-changes originating from the same stream with each other.
2021-12-16 14:57:49 +02:00
Botond Dénes
c193bbed82 flat_mutation_reader_v2: add v2 version of empty reader
Convert the v1 implementation to v2, downgrade to v1 in the existing
`make_empty_flat_reader()`.
2021-12-16 14:57:49 +02:00
Botond Dénes
f15f4952be test/boost/mutation_reader_test: clustering_combined_reader_mutation_source_test: fix end bound calculation
Currently the test assumes that fragments represent weakly monotonic
upper bounds and therefore unconditionally overwrites the upper-bound on
receiving each fragment. Range tombstones however violate this as a
range tombstone with a smaller position (lower bound) may have a higher
upper bound than some or all fragments that follow it in the stream.
This causes test failures after the converting the combined reader to
v2, but not before, no idea why.
2021-12-16 14:57:49 +02:00
Nadav Har'El
9ae98dbe92 Merge 'Reduce boot time for dtest setup' from Asias He
This patch helps to speed up node boot up for test setups like dtest.

Nadav reported

```
With Asias's two patches o Scylla, and my patch to enable it in dtest:

Boot time of 5 nodes is now down to 9 seconds!

Remember we started this exercise with 214 seconds? :-)
```

Closes #9808

* github.com:scylladb/scylla:
  storage_service: Recheck tokens before throw in storage_service::bootstrap
  gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero
2021-12-16 13:44:42 +02:00
Pavel Emelyanov
b2a62d2b59 Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec
Appending an empty range adjacent to an existing range tombstone would
not deoverlap (by dropping the empty range tombstone) resulting in
different (non canoncial) result depending on the order of appending.

Suppose that range tombstone [a, b] covers range tombstone [x, x), and [a, x) and [x, b) are range tombstones which correspond to [a, b] split around position x.

Appending [a, x) then [x, b) then [x, x) would give [a, b)
Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b)

The fix is to drop empty range tombstones in range_tombstone_list so that the result is canonical.

Fixes #9661

Closes #9764

* github.com:scylladb/scylla:
  range_tombstone_list: Deoverlap adjacent empty ranges
  range_tombstone_list: Convert to work in terms of position_in_partition
2021-12-16 10:00:40 +03:00
Avi Kivity
c40043b142 mm: remove stats on schema version get 2021-12-15 18:56:18 +02:00
Nadav Har'El
d323b82cf6 Merge 'Introduce data_dictionary module' from Avi Kivity
The full user-defined structure of the database (keyspaces,
tables, user-defined types, and similar metadata, often known
as the schema in other databases) is needed by much of the
front-end code. But in Scylla it is deeply intertwined with the
replica data management code - ::database, ::keyspace, and
::table. Not only does the front-end not need data access, it
cannot get correct data via these objects since they represent
just one replica out of many.

This dual-role is a frequent cause of recompilations. It was solved
to some degree by forward declarations, but there is still a lot
of incidental dependencies.

To solve this, we introduce a data_dictionary module (and
namespace) to exclusively deal with greater schema metadata.
It is an interface, with a backing implementation by the existing code,
so it doesn't add a new source of truth. The plan is to allow mock
implementations for testing as well.

Test: unit (dev, release, debug).

Closes #9783

* github.com:scylladb/scylla:
  cql3, related: switch to data_dictionary
  test: cql_test_env: provide access to data_dictionary
  storage_proxy: provide access to data_dictionary
  database: implement data_dictionary interface
  data_dictionary: add database/keyspace/table objects
  data_dictionary: move keyspace_metadata to data_dictionary
  data_dictionary: move user_types_metadata to new module data_dictionary
2021-12-15 18:29:28 +02:00
Avi Kivity
87917d2536 Merge "gms: gossiper: coroutinize a few small functions" from Pavel S
"
Start converting small functions in gossiper code
from using `seastar::thread` context to coroutines.

For now, the changes are quite trivial.
Later, larger code fragments will be converted
to eliminate uses of `seastar::async` function calls.

Moving the code to coroutines makes the code a bit
more readable and also mmediately evident that a given
function is async just looking at the signature (for
example, for void-returning functions, a coroutine
will return `future<>` instead of `void` in case of
a seastar::thread-using function).

Tests: unit(dev)
"

* 'coro_gossip_v1' of https://github.com/ManManson/scylla:
  gms: gossiper: coroutinize `maybe_enable_features`
  gms: gossiper: coroutinize `wait_alive`
  gms: gossiper: coroutinize `add_saved_endpoint`
  gms: gossiper: coroutinize `evict_from_membership`
2021-12-15 16:02:18 +02:00
Avi Kivity
d768e9fac5 cql3, related: switch to data_dictionary
Stop using database (and including database.hh) for schema related
purposes and use data_dictionary instead.

data_dictionary::database::real_database() is called from several
places, for these reasons:

 - calling yet-to-be-converted code
 - callers with a legitimate need to access data (e.g. system_keyspace)
   but with the ::database accessor removed from query_processor.
   We'll need to find another way to supply system_keyspace with
   data access.
 - to gain access to the wasm engine for testing whether used
   defined functions compile. We'll have to find another way to
   do this as well.

The change is a straightforward replacement. One case in
modification_statement had to change a capture, but everything else
was just a search-and-replace.

Some files that lost "database.hh" gained "mutation.hh", which they
previously had access to through "database.hh".
2021-12-15 13:54:23 +02:00
Avi Kivity
399e2895f1 test: cql_test_env: provide access to data_dictionary
Allow tests to have access to the data_dictionary.
2021-12-15 13:54:18 +02:00
Avi Kivity
c2da20484d storage_proxy: provide access to data_dictionary
Probably storage_proxy is not the correct place to supply
data_dictionary, but it is available to practically all of
the coordinator code, so it is convenient.
2021-12-15 13:54:08 +02:00
Avi Kivity
1de0a4b823 database: implement data_dictionary interface
Implement the new data_dictionary interface using the existing
::database, ::keyspace, and ::table classes. The implementation
is straightforward. This will allow the coordinator code to access
the full schema without depending on the gnarly bits that compose
::database, like reader_concurrency_semaphore or the backlog
controller.
2021-12-15 13:53:46 +02:00