Commit Graph

2632 Commits

Author SHA1 Message Date
Tomasz Grabiec
f2ed9fcd7e schema_mutations, migration_manager: Ignore empty partitions in per-table digest
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.
2023-07-03 23:06:55 +02:00
Alejo Sanchez
520bd90008 test/boost/memtable_test: split test plain/reverse
Split long running test
test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests
for _plain and _reverse.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14447
2023-07-03 15:20:12 +03:00
Piotr Dulikowski
ee9bfb583c combined: mergers: remove recursion in operator()()
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.

In some unlucky circumstances, a stack overflow can occur:

- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
  end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
  combined reader (~500),
- All of the above needs to happen within a single task quota.

In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.

A regression test is added.

Fixes: scylladb/scylladb#14415

Closes #14452
2023-06-30 12:07:13 +03:00
Nadav Har'El
dd63169077 Merge 'test/boost/index_with_paging_test: reduce running time' from Alecco
Reduce test string value size, parallelize inserts, and use a prepared statement,

The debug running time for this tests is reduced from 13:18 to 7:52.

Refs #13905

Closes #14380

* github.com:scylladb/scylladb:
  test/boost/index_with_paging_test: parallel insert
  test/boost/index_with_paging_test: prepared statement
  test/boost/index_with_paging_test: reduce running time
2023-06-29 10:45:01 +03:00
Kamil Braun
b912eeade5 Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.

The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.

Broadcast table commands are not mutations, so they are never combined.

* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
  test: add test for group0 raft command merging
  service: raft: respect max mutation size limit when persisting raft entries
  group0_state_machine: merge commands before applying them whenever possible
2023-06-28 17:21:07 +02:00
Alejo Sanchez
d4697ed21e test/boost/index_with_paging_test: parallel insert
Parallelize inserts for long-running test_index_with_paging.

Run time in debug mode reduced by 1 minute 48 seconds.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 16:11:58 +02:00
Alejo Sanchez
70a3179888 test/boost/index_with_paging_test: prepared statement
Prepare statement for insert.

Run time in debug mode reduced by 9 seconds.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 14:49:21 +02:00
Alejo Sanchez
48d24269f1 test/boost/index_with_paging_test: reduce running time
Reduce test string value size for test_index_with_paging from 4096 to
100. With 100 bytes it should make the base row significantly larger
than the key so the test will exercise both types of paging in the
scanning code.

The debug running time for this tests is reduced from 9 minutes to 6
minutes.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 13:55:52 +02:00
Kamil Braun
96bc78905d readers: evictable_reader: don't accidentally consume the entire partition
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the previous buffer-fill. Otherwise, the
reader could get stuck in an infinite loop between buffer fills, if the
reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison
between the position after the last buffer-fill and the current
last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would
continue filling the buffer until partition end which may lead to OOMs
such as in #13491.

There was already a fix in this area to handle `partition_start`
fragments correctly - #13563 - but it missed that the position
comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to
detect this case.

Fixes #13491
2023-06-27 14:37:29 +02:00
Kamil Braun
5800ce8ddd test: flat_mutation_reader_assertions: squash r_t_cs with the same position
test_range_tombstones_v2 is too strict for this reader -- it expects a
particular sequence of `range_tombstone_change`s, but
multishard_combining_reader, when tested with a small buffer, may
generate -- as expected -- additional (redundant) range tombstone change
pairs (end+start).

Currently we don't observe these redundant fragments due to a bug in
`evictable_reader_v2` but they start appearing once we fix the bug and
the test must be prepared first.

To prepare the test, modify `flat_reader_assertions_v2` so it squashes
redundant range tombstone change pairs. This happens only in non-exact
mode.

Enable exact mode in `test_sstable_reversing_reader_random_schema` for
comparing two readers -- the squashing of `r_t_c`s may introduce an
artificial difference.
2023-06-27 14:37:25 +02:00
Gleb Natapov
945f476363 test: add test for group0 raft command merging
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
2023-06-27 14:59:55 +03:00
Botond Dénes
f5e3b8df6d Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
```
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
`INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`

to
`INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`

Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.

Closes #14364

* github.com:scylladb/scylladb:
  table: Optimize creation of reader excluding staging for view building
  view_update_generator: Dump throughput and duration for view update from staging
  utils: Extract pretty printers into a header
2023-06-27 07:25:30 +03:00
Raphael S. Carvalho
1d8cb32a5d table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 22:30:39 -03:00
Raphael S. Carvalho
83c70ac04f utils: Extract pretty printers into a header
Can be easily reused elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 21:58:20 -03:00
Alexey Novikov
ca4e7f91c6 compact and remove expired rows from cache on read
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs #2252, #6033

Closes #12917
2023-06-26 15:29:01 +02:00
Alejo Sanchez
4999cbc1cf test/boost/cql_functions_test: split long running tests
Split long running test_aggregate_functions to one case per type.

This allows test.py to run them in parallel.

Before this it would take 18 minutes to run in debug mode. Afterwards
each case takes 30-45 seconds.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14368
2023-06-26 11:29:36 +03:00
Alejo Sanchez
8b1968cfbb test/boost/schema_changes_test: split long-running test
Split long running test test_schema_changes in 3 parts, one for each
writable_sstable_versions so it can be run in parallel by test.py.

Add static checks to alert if the array of types changed.

Original test takes around 24 minutes in debug mode, and each new split
test takes around 8 minutes.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14367
2023-06-26 11:24:07 +03:00
Alejo Sanchez
633f026d63 test/boost/memtable_test: allow parallel run
Remove previous configuration blocking parallel run.

Test cases run fine in local debug.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14369
2023-06-26 11:23:43 +03:00
Alejo Sanchez
3cbfd863eb test/boost/database_test: split long running tests
Split long running tests
test_database_with_data_in_sstables_is_a_mutation_source_plain and
test_database_with_data_in_sstables_is_a_mutation_source_reverse.

They run with x_log2_compaction_groups of 0 and 1, each one taking from
10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes.

Split the test cases to run with 0 and 1, so test.py can run them in
parallel.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14356
2023-06-26 11:20:27 +03:00
Avi Kivity
b858a4669d cql3: expr: break up expression.hh header
Adding a function declaration to expression.hh causes many
recompilations. Reduce that by:

 - moving some restrictions-related definitions to
   the existing expr/restrictions.hh
 - moving evaluation related names to a new header
   expr/evaluate.hh
 - move utilities to a new header
   expr/expr-utilities.hh

expression.hh contains only expression definitions and the most
basic and common helpers, like printing.
2023-06-22 14:21:03 +03:00
Kefu Chai
f014ccf369 Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai""
This reverts commit 562087beff.

The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.

Fixes #10459
2023-06-21 13:02:40 +03:00
Avi Kivity
e233f471b8 Merge 'Respect tablet shard assignment' from Tomasz Grabiec
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet

After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.

To make the life easier, support was added to obtain table pointer from the schema pointer:

```
schema_ptr s;
s->table().shard_of(...)
```

Closes #13939

* github.com:scylladb/scylladb:
  locator: network_topology_startegy: Allocate shards to tablets
  locator: Store node shard count in topology
  service: topology: Extract topology updating to a lambda
  test: Move test_tablets under topology_experimental
  sstables: Add trace-level logging related to shard calculation
  schema: Catch incorrect uses of schema::get_sharder()
  dht: Rename dht::shard_of() to dht::static_shard_of()
  treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
  storage_proxy: Avoid multishard reader for tablets
  storage_proxy: Obtain shard from erm in the read path
  db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
  forward_service: Use table sharder
  alternator: Use table sharder
  db: multishard: Obtain sharder from erm
  sstable_directory: Improve trace-level logging
  db: table: Introduce shard_of() helper
  db: Use table sharder in compaction
  sstables: Compute sstable shards using sharder from erm when loading
  sstables: Generate sharding metadata using sharder from erm when writing
  test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
  dht: Make split_range_to_single_shard() prepared for tablet sharder
  sstables: Move compute_shards_for_this_sstable() to load()
  dht: Take sharder externally in splitting functions
  locator: Make sharder accessible through effective_replication_map
  dht: sharder: Document guarantees about mapping stability
  tablets: Implement tablet sharder
  tablets: Include pending replica in get_shard()
  dht: sharder: Introduce next_shard()
  db: token_ring_table: Filter out tablet-based keyspaces
  db: schema: Attach table pointer to schema
  schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
  schema_registry: Make learn(schema_ptr) attach entry to the target schema
  test: lib: cql_test_env: Expose feature_service
  test: Extract throttle object to separate header
2023-06-21 10:20:41 +03:00
Calle Wilund
f18e967939 storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294
2023-06-21 10:08:27 +03:00
Tomasz Grabiec
ebdebb982b locator: network_topology_startegy: Allocate shards to tablets
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.

Takes load due to current tablet allocation into account.

Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
2023-06-21 00:58:25 +02:00
Tomasz Grabiec
29cbdb812b dht: Rename dht::shard_of() to dht::static_shard_of()
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
21198e8470 treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e48ec6fed3 db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
d92287f997 db: multishard: Obtain sharder from erm
This is not strictly necessary, as the multishard reader will be later
avoided altogether for tablet-based tables, but it is a step towards
converting all code to use the erm->get_sharder() instead of
schema::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
36da062bcb db: Use table sharder in compaction 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
ad983ac23d sstables: Compute sstable shards using sharder from erm when loading
schema::get_sharder() does not use the correct sharder for
tablet-based tables.  Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
36e12020b9 test: partitioner: Test split_range_to_single_shard() on tablet-like sharder 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
28b972a588 dht: Make split_range_to_single_shard() prepared for tablet sharder
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.

Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:

  start_token = sharder.token_for_next_shard(end_token, shard);

If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
390bcf3fae dht: Take sharder externally in splitting functions
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.

Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
22ab100b41 tablets: Implement tablet sharder 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e44e6033d8 tablets: Include pending replica in get_shard()
We need to move get_shard() from tablet_info to tablet_map in order to
have access to transition_info.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
2303466375 db: schema: Attach table pointer to schema
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.

Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.

It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
ad6d2b42f2 test: Extract throttle object to separate header 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
87b4606cd6 Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability
2023-06-20 12:11:48 +02:00
Benny Halevy
761d62cd82 compare_atomic_cell_for_merge: compare value last for live cells
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was changed in CASSANDRA-14592
for consistency with the preference for dead cells over live cells,
as expiring cells will become tombstones at a future time
and then they'd win over live cells with the same timestamp,
hence they should win also before expiration.

In addition, comparing the cell value before expiration
can lead to unintuitive corner cases where rewriting
a cell using the same timestamp but different TTL
may cause scylla to return the cell with null value
if it expired in the meanwhile.

Also, when multiple columns are written using two upserts
using the same write timestamp but with different expiration,
selecting cells by their value may return a mixed result
where each cell is selected individually from either upsert,
by picking the cells with the largest values for each column,
while using the expiration time to break tie will lead
to a more consistent results where a set of cell from
only one of the upserts will be selected.

Fixes scylladb/scylladb#14182

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Benny Halevy
ec034b92c0 mutation_test: test_cell_ordering: improve debuggability
Currently, it is hard to tell which of the many sub-cases
fail in this unit test, in case any of them fails.

This change uses logging in debug and trace level
to help with that by reproducing the error
with --logger-log-level testlog=trace
(The cases are deterministic so reproducing should not
be a problem)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Tomasz Grabiec
5fa08adc88 Merge 'cache_flat_mutation_reader: use the correct schema in prepare_hash' from Michał Chojnowski
Since `mvcc: make schema upgrades gentle` (51e3b9321b),
rows pointed to by the cursor can have different (older) schema
than the schema of the cursor's snapshot.

However, one place in the code wasn't updated accordingly,
causing a row to be processed with the wrong schema in the right
circumstances.

This passed through unit testing because it requires
a digest-computing cache read after a schema change,
and no test exercised this.

This series fixes the bug and adds a unit test which reproduces the issue.

Fixes #14110

Closes #14305

* github.com:scylladb/scylladb:
  test: boost/row_cache_test: add a reproducer for #14110
  cache_flat_mutation_reader: use the correct schema in prepare_hash
  mutation: mutation_cleaner: add pause()
2023-06-20 01:30:11 +02:00
Michał Chojnowski
02bcb5d539 test: boost/row_cache_test: add a reproducer for #14110 2023-06-19 22:50:46 +02:00
Botond Dénes
bd7a3e5871 Merge 'Sanitize sstables-making utils in tests' from Pavel Emelyanov
There are tons of wrappers that help test cases make sstables for their needs. And lots of code duplication in test cases that do parts of those helpers' work on their own. This set cleans some bits of those

Closes #14280

* github.com:scylladb/scylladb:
  test/utils: Generalize making memtable from vector<mutation>
  test/util: Generalize make_sstable_easy()-s
  test/sstable_mutation: Remove useless helper
  test/sstable_mutation: Make writer config in make_sstable_mutation_source()
  test/utils: De-duplicate make_sstable_containing-s
  test/sstable_compaction: Remove useless one-line local lambda
  test/sstable_compaction: Simplify sstable making
  test/sstables*: Make sstable from vector of mutations
  test/mutation_reader: Remove create_sstable() helper from test
2023-06-19 14:05:29 +03:00
Pavel Emelyanov
6bec03f96f test: Remove sstable_utils' storage_prefix() helper
It's excessive, test case that needs it can get storage prefix without
this fancy wrapper-helper

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14273
2023-06-19 13:51:04 +03:00
Pavel Emelyanov
1a332ef5e2 test: Check sstable bytes correctness on S3 too
Commit 4e205650 (test: Verify correctness of sstable::bytes_on_disk())
added a test to verify that sstable::bytes_on_disk() is equal to the
real size of real files. The same test case makes sense for S3-backed
sstables as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14272
2023-06-19 13:47:31 +03:00
Nadav Har'El
ac3d0d4460 Merge 'cql3: expr: support evaluate(column_mutation_attribute)' from Avi Kivity
In preparation for converting selectors to evaluate expressions,
add support for evaluating column_mutation_attribute (representing
the WRITETIME/TTL pseudo-functions).

A unit test is added.

Fixes #12906

Closes #14287

* github.com:scylladb/scylladb:
  test: expr: test evaluation of column_mutation_attribute
  test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps
  cql3: expr: evaluate() column_mutation_attribute
2023-06-19 11:11:49 +03:00
Botond Dénes
562087beff Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.

Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.

Refs: #14283
2023-06-19 08:49:27 +03:00
Avi Kivity
0f98e9f8c8 test: expr: test evaluation of column_mutation_attribute
There's no way to evaluate a column_mutation_attribute via CQL
yet (the only user uses old-style cql3::selection::selector), so
we only supply a unit test.
2023-06-18 22:47:46 +03:00
Nadav Har'El
97d444bbf7 Merge 'cql3/expression: implement evaluate(field_selection) ' from Jan Ciołek
Implement `expr:valuate()` for `expr::field_selection`.

`field_selection` is used to represent access to a struct field.
For example, with a UDT value:
```
CREATE TYPE my_type (a int, b int);
```
The expression `my_type_value.a` would be represented as a `field_selection`, which selects the field `a`.

Evaluating such an expression consists of finding the right element's value in a serialized UDT value and returning it.

Note that it's still not possible to use `field_selection` inside the `WHERE` clause. Enabling it would require changes to the grammar, as well as query planning, Current `statement_restrictions` just reacts with `on_internal_error` when it encounters a `field_selection`.
Nonetheless it's a step towards relaxing the grammar, and now it's finally possible to evaluate all kinds of prepared expressions (#12906)

Fixes: https://github.com/scylladb/scylladb/issues/12906

Closes #14235

* github.com:scylladb/scylladb:
  boost/expr_test: test evaluate(field_selection)
  cql3/expr: fix printing of field_selection
  cql3/expression: implement evaluate(field_selection)
  types/user: modify idx_of_field to use bytes_view
  column_identifer: add column_identifier_raw::text()
  types: add read_nth_user_type_field()
  types: add read_nth_tuple_element()
2023-06-18 11:08:25 +03:00
Pavel Emelyanov
85310bc043 test/sstable_mutation: Remove useless helper
There are two make_sstable_mutation_source() helpers that call one
another and test cases only need one of them, so leave just one that's
in use.

Also don't pass env's tempdir to make_sstable() util call, it can get
env's tempdir on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:21:40 +03:00