Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Split long running test
test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests
for _plain and _reverse.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14447
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
Reduce test string value size, parallelize inserts, and use a prepared statement,
The debug running time for this tests is reduced from 13:18 to 7:52.
Refs #13905Closes#14380
* github.com:scylladb/scylladb:
test/boost/index_with_paging_test: parallel insert
test/boost/index_with_paging_test: prepared statement
test/boost/index_with_paging_test: reduce running time
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.
The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.
Broadcast table commands are not mutations, so they are never combined.
* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
test: add test for group0 raft command merging
service: raft: respect max mutation size limit when persisting raft entries
group0_state_machine: merge commands before applying them whenever possible
Parallelize inserts for long-running test_index_with_paging.
Run time in debug mode reduced by 1 minute 48 seconds.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Reduce test string value size for test_index_with_paging from 4096 to
100. With 100 bytes it should make the base row significantly larger
than the key so the test will exercise both types of paging in the
scanning code.
The debug running time for this tests is reduced from 9 minutes to 6
minutes.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the previous buffer-fill. Otherwise, the
reader could get stuck in an infinite loop between buffer fills, if the
reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison
between the position after the last buffer-fill and the current
last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would
continue filling the buffer until partition end which may lead to OOMs
such as in #13491.
There was already a fix in this area to handle `partition_start`
fragments correctly - #13563 - but it missed that the position
comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to
detect this case.
Fixes#13491
test_range_tombstones_v2 is too strict for this reader -- it expects a
particular sequence of `range_tombstone_change`s, but
multishard_combining_reader, when tested with a small buffer, may
generate -- as expected -- additional (redundant) range tombstone change
pairs (end+start).
Currently we don't observe these redundant fragments due to a bug in
`evictable_reader_v2` but they start appearing once we fix the bug and
the test must be prepared first.
To prepare the test, modify `flat_reader_assertions_v2` so it squashes
redundant range tombstone change pairs. This happens only in non-exact
mode.
Enable exact mode in `test_sstable_reversing_reader_random_schema` for
comparing two readers -- the squashing of `r_t_c`s may introduce an
artificial difference.
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
```
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
`INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`
to
`INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`
Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.
Closes#14364
* github.com:scylladb/scylladb:
table: Optimize creation of reader excluding staging for view building
view_update_generator: Dump throughput and duration for view update from staging
utils: Extract pretty printers into a header
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s
to
INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s
Refs #14089.
Fixes#14244.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch
Refs #2252, #6033Closes#12917
Split long running test_aggregate_functions to one case per type.
This allows test.py to run them in parallel.
Before this it would take 18 minutes to run in debug mode. Afterwards
each case takes 30-45 seconds.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14368
Split long running test test_schema_changes in 3 parts, one for each
writable_sstable_versions so it can be run in parallel by test.py.
Add static checks to alert if the array of types changed.
Original test takes around 24 minutes in debug mode, and each new split
test takes around 8 minutes.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14367
Remove previous configuration blocking parallel run.
Test cases run fine in local debug.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14369
Split long running tests
test_database_with_data_in_sstables_is_a_mutation_source_plain and
test_database_with_data_in_sstables_is_a_mutation_source_reverse.
They run with x_log2_compaction_groups of 0 and 1, each one taking from
10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes.
Split the test cases to run with 0 and 1, so test.py can run them in
parallel.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14356
Adding a function declaration to expression.hh causes many
recompilations. Reduce that by:
- moving some restrictions-related definitions to
the existing expr/restrictions.hh
- moving evaluation related names to a new header
expr/evaluate.hh
- move utilities to a new header
expr/expr-utilities.hh
expression.hh contains only expression definitions and the most
basic and common helpers, like printing.
This reverts commit 562087beff.
The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.
Fixes#10459
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet
After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.
To make the life easier, support was added to obtain table pointer from the schema pointer:
```
schema_ptr s;
s->table().shard_of(...)
```
Closes#13939
* github.com:scylladb/scylladb:
locator: network_topology_startegy: Allocate shards to tablets
locator: Store node shard count in topology
service: topology: Extract topology updating to a lambda
test: Move test_tablets under topology_experimental
sstables: Add trace-level logging related to shard calculation
schema: Catch incorrect uses of schema::get_sharder()
dht: Rename dht::shard_of() to dht::static_shard_of()
treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
storage_proxy: Avoid multishard reader for tablets
storage_proxy: Obtain shard from erm in the read path
db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
forward_service: Use table sharder
alternator: Use table sharder
db: multishard: Obtain sharder from erm
sstable_directory: Improve trace-level logging
db: table: Introduce shard_of() helper
db: Use table sharder in compaction
sstables: Compute sstable shards using sharder from erm when loading
sstables: Generate sharding metadata using sharder from erm when writing
test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
dht: Make split_range_to_single_shard() prepared for tablet sharder
sstables: Move compute_shards_for_this_sstable() to load()
dht: Take sharder externally in splitting functions
locator: Make sharder accessible through effective_replication_map
dht: sharder: Document guarantees about mapping stability
tablets: Implement tablet sharder
tablets: Include pending replica in get_shard()
dht: sharder: Introduce next_shard()
db: token_ring_table: Filter out tablet-based keyspaces
db: schema: Attach table pointer to schema
schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
schema_registry: Make learn(schema_ptr) attach entry to the target schema
test: lib: cql_test_env: Expose feature_service
test: Extract throttle object to separate header
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.
Takes load due to current tablet allocation into account.
Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
This is not strictly necessary, as the multishard reader will be later
avoided altogether for tablet-based tables, but it is a step towards
converting all code to use the erm->get_sharder() instead of
schema::get_sharder().
schema::get_sharder() does not use the correct sharder for
tablet-based tables. Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.
Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:
start_token = sharder.token_for_next_shard(end_token, shard);
If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.
Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.
Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.
It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
Fixes#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182Closes#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was changed in CASSANDRA-14592
for consistency with the preference for dead cells over live cells,
as expiring cells will become tombstones at a future time
and then they'd win over live cells with the same timestamp,
hence they should win also before expiration.
In addition, comparing the cell value before expiration
can lead to unintuitive corner cases where rewriting
a cell using the same timestamp but different TTL
may cause scylla to return the cell with null value
if it expired in the meanwhile.
Also, when multiple columns are written using two upserts
using the same write timestamp but with different expiration,
selecting cells by their value may return a mixed result
where each cell is selected individually from either upsert,
by picking the cells with the largest values for each column,
while using the expiration time to break tie will lead
to a more consistent results where a set of cell from
only one of the upserts will be selected.
Fixesscylladb/scylladb#14182
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, it is hard to tell which of the many sub-cases
fail in this unit test, in case any of them fails.
This change uses logging in debug and trace level
to help with that by reproducing the error
with --logger-log-level testlog=trace
(The cases are deterministic so reproducing should not
be a problem)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since `mvcc: make schema upgrades gentle` (51e3b9321b),
rows pointed to by the cursor can have different (older) schema
than the schema of the cursor's snapshot.
However, one place in the code wasn't updated accordingly,
causing a row to be processed with the wrong schema in the right
circumstances.
This passed through unit testing because it requires
a digest-computing cache read after a schema change,
and no test exercised this.
This series fixes the bug and adds a unit test which reproduces the issue.
Fixes#14110Closes#14305
* github.com:scylladb/scylladb:
test: boost/row_cache_test: add a reproducer for #14110
cache_flat_mutation_reader: use the correct schema in prepare_hash
mutation: mutation_cleaner: add pause()
There are tons of wrappers that help test cases make sstables for their needs. And lots of code duplication in test cases that do parts of those helpers' work on their own. This set cleans some bits of those
Closes#14280
* github.com:scylladb/scylladb:
test/utils: Generalize making memtable from vector<mutation>
test/util: Generalize make_sstable_easy()-s
test/sstable_mutation: Remove useless helper
test/sstable_mutation: Make writer config in make_sstable_mutation_source()
test/utils: De-duplicate make_sstable_containing-s
test/sstable_compaction: Remove useless one-line local lambda
test/sstable_compaction: Simplify sstable making
test/sstables*: Make sstable from vector of mutations
test/mutation_reader: Remove create_sstable() helper from test
It's excessive, test case that needs it can get storage prefix without
this fancy wrapper-helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14273
Commit 4e205650 (test: Verify correctness of sstable::bytes_on_disk())
added a test to verify that sstable::bytes_on_disk() is equal to the
real size of real files. The same test case makes sense for S3-backed
sstables as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14272
In preparation for converting selectors to evaluate expressions,
add support for evaluating column_mutation_attribute (representing
the WRITETIME/TTL pseudo-functions).
A unit test is added.
Fixes#12906Closes#14287
* github.com:scylladb/scylladb:
test: expr: test evaluation of column_mutation_attribute
test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps
cql3: expr: evaluate() column_mutation_attribute
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.
Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.
Refs: #14283
There's no way to evaluate a column_mutation_attribute via CQL
yet (the only user uses old-style cql3::selection::selector), so
we only supply a unit test.
Implement `expr:valuate()` for `expr::field_selection`.
`field_selection` is used to represent access to a struct field.
For example, with a UDT value:
```
CREATE TYPE my_type (a int, b int);
```
The expression `my_type_value.a` would be represented as a `field_selection`, which selects the field `a`.
Evaluating such an expression consists of finding the right element's value in a serialized UDT value and returning it.
Note that it's still not possible to use `field_selection` inside the `WHERE` clause. Enabling it would require changes to the grammar, as well as query planning, Current `statement_restrictions` just reacts with `on_internal_error` when it encounters a `field_selection`.
Nonetheless it's a step towards relaxing the grammar, and now it's finally possible to evaluate all kinds of prepared expressions (#12906)
Fixes: https://github.com/scylladb/scylladb/issues/12906Closes#14235
* github.com:scylladb/scylladb:
boost/expr_test: test evaluate(field_selection)
cql3/expr: fix printing of field_selection
cql3/expression: implement evaluate(field_selection)
types/user: modify idx_of_field to use bytes_view
column_identifer: add column_identifier_raw::text()
types: add read_nth_user_type_field()
types: add read_nth_tuple_element()
There are two make_sstable_mutation_source() helpers that call one
another and test cases only need one of them, so leave just one that's
in use.
Also don't pass env's tempdir to make_sstable() util call, it can get
env's tempdir on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>