On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
(cherry picked from commit 519e167611)
Backported from #20985 to 6.0
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#21249
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.
So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.
This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.
It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.
Fixes https://github.com/scylladb/scylladb/issues/20626.
(cherry picked from commit 999f1f1318)
(cherry picked from commit 38ce2c605d)
Refs https://github.com/scylladb/scylladb/pull/20737Closesscylladb/scylladb#21201
* github.com:scylladb/scylladb:
tablet: Fix single-sstable split when attaching new unsplit sstables
replica: Fix tablet split execute after restart
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.
fixes#20043
(cherry picked from commit f09fe4f351)
(cherry picked from commit e5bf376cbc)
(cherry picked from commit 1863ccd900)
Refs #21020Closesscylladb/scylladb#21112
* github.com:scylladb/scylladb:
tablets: Validate system.tablets update
group0_client: Introduce change validation
group0_client: Add shared_token_metadata dependency
replica/tablets: Add to_tablet_metadata_(row_)?key helpers
replica/tablets: extract tablet_replica_set_from_cell()
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.
The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.
However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.
Refs scylladb/scylladb#17575
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)
Closesscylladb/scylladb#20724
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes https://github.com/scylladb/scylladb/issues/20044.
Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.
(cherry picked from commit bcd358595f)
(cherry picked from commit 93815e0649)
Refs https://github.com/scylladb/scylladb/pull/20939Closesscylladb/scylladb#21204
* github.com:scylladb/scylladb:
replica: Fix tombstone GC during tablet split preparation
service: Improve error handling for split
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.
If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extraceted from larger patch f5976aa87b (replica/tablets: add
get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint())
by Botond. The helpers are needed to decode mutations with tablets
update to validate them later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).
The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups. Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.
Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.
Fixes scylladb/scylladb#18862
Needs backport to 6.0 since tablets require this capability
(cherry picked from commit a7b92d7b6f)
(cherry picked from commit 027e64876a)
(cherry picked from commit 44bd183187)
(cherry picked from commit f47b5e60bc)
Refs #19555Closesscylladb/scylladb#20645
* github.com:scylladb/scylladb:
sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
sstables: storage: keep base directory in base class
sstables: storage: define opened_directory in header file
sstable_directory: use only dirlog
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes#20044.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
let's assume there are 2 nodes, n1, n2. n1 is the coordinator.
1) n1 emits split
2) n1 and n2 complete split work
3) n1 becomes aware all replicas are ready for split
4) n2 restarts, but places split sstable into main group[1]
5) n1 executes split
6) n2 handles split completion, but see the main group is not empty
[1]: During split, main group should only contain unsplit sstables.
If all sstables are split, main must be empty.
This is a result of replica not setting storage group to split mode on restart
(using tablet map) and therefore sstables are incorrectly placed on main group.
The fix is about looking at tablet map and setting group to split mode before
sstables are populated into it.
Refs #20626.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 999f1f1318)
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.
Note that this requires loading and processing
the base directory first.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f47b5e60bc)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
# Conflicts:
# sstables/sstable_directory.hh
Refs #20686
Refs #15607
In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.
(cherry picked from commit 60f8a9f39d)
Closesscylladb/scylladb#20706
Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.
This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.
This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.
The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.
The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.
Fixesscylladb/scylladb#19152Closesscylladb/scylladb#19488Fixesscylladb/scylladb#20787
(cherry picked from commit 08b29460fc)
Closesscylladb/scylladb#20934
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.
Ignore cleanup of deallocated storage groups.
Fixes: https://github.com/scylladb/scylladb/issues/19752.
Needs to be backported to all branches with tablets (6.0 and later)
(cherry picked from commit 20d6cf55f2)
(cherry picked from commit 2c4b1d6b45)
Refs https://github.com/scylladb/scylladb/pull/20584Closesscylladb/scylladb#20628
* github.com:scylladb/scylladb:
test: check if cleanup of deallocated sg is ignored
replica: ignore cleanup of deallocated storage group
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301Closesscylladb/scylladb#20471
* github.com:scylladb/scylladb:
cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
database::get_all_tables_flushed_at: fix return value
(cherry picked from commit 0e5b444777)
Backported from #20471 to 6.0.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20580
Consider the following:
```
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.
Fixes https://github.com/scylladb/scylladb/issues/19378.
Fixes https://github.com/scylladb/scylladb/issues/19416.
Please replace this line with justification for the backport/* labels added to this PR
(cherry picked from commit 239344ab55)
(cherry picked from commit 74612ad358)
Refs https://github.com/scylladb/scylladb/pull/19427Closesscylladb/scylladb#20593
* github.com:scylladb/scylladb:
tablets: Fix race between repair and split
compaction: Allow "offline" sstable to be split
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.
Ignore cleanup of deallocated storage group.
(cherry picked from commit 20d6cf55f2)
Consider the following:
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 74612ad358)
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.
(cherry picked from commit 46563d719f)
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.
(cherry picked from commit de5329157c)
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.
Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.
Analogically for remove_table.
Fixes: #19833.
(cherry picked from commit 483d89ed6d)
Closesscylladb/scylladb#20245
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.
Fixes#19519
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 44583eed9e)
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.
Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.
The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.
The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
--------
seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::(anonymous namespace)::thread_wake_task
--------
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...
From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.
Fixes: #19873.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 5af1f41ecd)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
rwlock was added to protect iterations against concurrent updates to the map.
the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).
the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).
to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.
Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.
Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.
Fixes#18821.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c539b7c861)
with tablets, we're expected to have a worst of ~100 tablets in a given
table and shard, so let's avoid linear search when looking for the
memtable_list in a range scan. we're bounded by ~100 elements, so
shouldn't be a big problem, but it's an inefficiency we can easily
get rid of.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19286
(cherry picked from commit f143f5b90d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.
Fixes#18506.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ad5c5bca5f)
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.
Fixesscylladb/scylla-enterprise#4475Closesscylladb/scylladb#20017
In the case of the user semaphore, this receives the new
reader_concurrency_semaphore_cpu_limit config item.
Not used yet.
(cherry picked from commit 59faa6d4ff)
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b8bd4c51c2)
Add more logging that provide more visibility into what happens during
topology loading.
Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com>
(cherry picked from commit fb764720d3)
This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken.
This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series.
Fixes: https://github.com/scylladb/scylladb/issues/18674
- [x] This patch has to be backported because it fixes broken functionality
(cherry picked from commit dbccb61636)
(cherry picked from commit 4590026b38)
(cherry picked from commit feea609e37)
(cherry picked from commit 0c61b1822c)
(cherry picked from commit 8ef4fbdb87)
Refs #18705Closesscylladb/scylladb#19240
* github.com:scylladb/scylladb:
test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
test/pylib: rest_client: add get_injection()
api/error_injection: add getter for error_injection
utils/error_injection: add set_parameter()
replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
This config item is propagated to the table object via table::config.
Although the field in table::config, used to propagate the value, was
utils::updateable_value<T>, it was assigned a constant and so the
live-update chain was broken.
This patch fixes this.
(cherry picked from commit dbccb61636)
Consider the following:
1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in
Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"
which means loss of write on pending replica.
The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.
Fixes#19052.
Fixes#19033.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 7b41630299)
Closesscylladb/scylladb#19229
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.
Fixes https://github.com/scylladb/scylladb/issues/18085.
(cherry picked from commit abcc68dbe7)
(cherry picked from commit 551bf9dd58)
(cherry picked from commit e7246751b6)
Refs https://github.com/scylladb/scylladb/pull/18287Closesscylladb/scylladb#19095
* github.com:scylladb/scylladb:
topology_experimental_raft/test_tablets: restore usage of check_with_down
test: Fix flakiness in topology_experimental_raft/test_tablets
service: Use tablet read selector to determine which replica to account table stats
storage_service: Fix race between tablet split and stats retrieval
storage_proxy is responsible for intersecting the range of the read
with tablets, and calling replica with a single tablet range, therefore
it makes sense to avoid touching memtables of tablets that don't
intersect with a particular range.
Note this is a performance issue, not correctness one, as memtable
readers that don't intersect with current range won't produce any
data, but cpu is wasted until that's realized (they're added to list
of readers in mutation_reader_merger, more allocations, more data
sources to peek into, etc).
That's also important for streaming e.g. after decommission, that
will consume one tablet at a time through a reader, so we don't want
memtables of streamed tablets (that weren't cleaned up yet) to
be consumed.
Refs #18904.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 832fb43fb4)
Closesscylladb/scylladb#18983
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.
Fixes#18977.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b396b05e20)
Closesscylladb/scylladb#19008
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
(cherry picked from commit 2ab143fb40)
[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.
It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).
Fixes#18085.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit abcc68dbe7)
File-based tablet streaming calls every shard to return data of every
group that intersects with a given range.
After dynamic group allocation, that breaks as the tablet range will
only be present in a single shard, so an exception is thrown causing
migration to halt during streaming phase.
Ideally, only one shard is invoked, but that's out of the scope of this
fix and compaction_groups_for_token_range() should return empty result
if none of the local groups intersect with the range.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit eb8ef38543)
Closesscylladb/scylladb#18859
Currently, we do not explicitly set a scheduling group for the schema
commitlog which causes it to run in the default scheduling group (called
"main"). However:
- It is important and significant enough that it should run in a
scheduling group that is separate from the main one,
- It should not run in the existing "commitlog" group as user writes may
sometimes need to wait for schema commitlog writes (e.g. read barrier
done to learn the schema necessary to interpret the user write) and we
want to avoid priority inversion issues.
Therefore, introduce a new scheduling group dedicated to the schema
commitlog.
Fixes: scylladb/scylladb#15566Closesscylladb/scylladb#18715
There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header.
Closesscylladb/scylladb#18703
* github.com:scylladb/scylladb:
api,database: Move auto-compaction toggle guard
api: Move some table manipulation helpers from storage_service
api: Move table-related calls from storage_service domain
api: Reimplement some endpoints using existing helpers
api: Lost unset of tombstone-gc endpoints
User-defined types can depend on each other, creating directed acyclic graph.
In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.
This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema.
Fixes#18539Closesscylladb/scylladb#18302
* github.com:scylladb/scylladb:
test/cql-pytest/test_describe: add test for UDTs ordering
cql3/statements/describe_statement: UDTs topological sorting
cql3/statements/describe_statement: allow to skip alphabetical sorting
types: add a method to get all referenced user types
db/cql_type_parser: use generic topological sorting
db/cql_type_parses: futurize raw_builder::build()
test/boost: add test for topological sorting
utils: introduce generic topological sorting algorithm
PR https://github.com/scylladb/scylladb/pull/18186 introduced a fiber that reloads reclaimed bloom filters when memory becomes available. Use maintenance scheduling group to run that fiber instead of running it in the main scheduling group.
Fixes#18675Closesscylladb/scylladb#18721
* github.com:scylladb/scylladb:
sstables_manager: use maintenance scheduling group to run components reload fiber
sstables_manager: add member to store maintenance scheduling group
Store that maintenance scheduling group inside the sstables_manager. The
next patch will use this to run the components reloader fiber.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The DIGEST_FOR_NULL_VALUES feature was added in 21a77612b3 (2020; 4.4)
and can now be assumed to be always present. The hasher which it invoked
is removed.
Toggling per-table auto-compaction enabling bit is guarded with
on-database boolean and raii guard. It's only used by a single
api/column_family.cc file, so it can live there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>