The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;
In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252
The fix will need backport to all live release.
Closesscylladb/scylladb#23255
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add memtable overlap check tests
replica/table: add error injection to memtable post-flush phase
utils/error_injection: add a way to set parameters from error injection points
test/cluster: add test_data_resurrection_in_memtable.py
test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
replica/mutation_dump: don't assume cells are live
replica/database: do_apply() add error injection point
replica: improve memtable overlap checks for the cache
replica/memtable: add is_merging_to_cache()
db/row_cache: add overlap-check for cache tombstone garbage collection
mutation/mutation_compactor: copy key passed-in to consume_new_partition()
in 57683c1a50 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error
Fixes: https://github.com/scylladb/scylladb/issues/22765Closesscylladb/scylladb#23641
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.
Status quo (around the assert in truncate procedure):
1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.
Note: truncated_at is likely above low_mark_at due to steps 2 and 3.
The interesting part of the assert is:
(truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)
Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().
If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.
truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.
Reproducer:
To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.
If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).
So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.
Proposed solution:
The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.
We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).
But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.
Fixes#18059.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23560
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.
To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.
This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).
The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.
"
The series contains fixes to gossiper conversion to host id. There are
two fixes where we could erroneously send outdated entry in a gossiper
message and a fix for force_remove_endpoint which was not converted to
work on host id and this caused it to not delete the entry in some cases
(in replace with the same ip case).
"
* 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev:
gossiper: send newest entry in a digest message
gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter
gossiper: move force_remove_endpoint to work on host id
gossiper: do not send outdated endpoint in gossiper round
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.
Fixes#23348
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#23579
There are two snapshot-on-all-shards methods on the database -- the one
that snapshots a keyspace and the one that snapshots a vector of tables.
The latter snapshots a single table with a neat helper, while the former
has the helper open-coded.
Re-using the helper in keyspace snapshot is worth it, but needs to patch
the helper to work on uuid, rather than ks:cf pair of strings.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23532
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.
The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.
Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).
Fixes#23534
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23547
When streaming files using multipart upload, switch from using
`output_stream::write(const char*, size_t)` to passing buffer objects
directly to `output_stream::write()`. This eliminates unnecessary memory
copying that occurred when the original implementation had to
defensively copy data before sending.
The buffer objects can now be safely reused by the output stream instead
of creating deep copies, which should improve performance by reducing
memory operations during S3 file uploads.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23567
The incremental reader selector maintains an unordered_set of
sstables that are already engaged, and uses std::views::filter
to filter those out. It adds the sstable under consideration to the
set, and if addition failed (because it's already in) then it
filters it out.
This breaks if the filter view is executed twice - the first pass
will add every sstable to the set, and the second will consider
every sstable already filtered. This is what happens with
libstdc++ 15 (due to the addition of vector(from_range_t) constructor),
which uses the first pass to calculate the vector size
and the second pass to insert the elements into a correctly-sized
vector.
Fix by open-coding the loop.
Closesscylladb/scylladb#23597
In cases where two entries have the same ip address send information
only for the newest one. Now we send both which make the receiver use
one of them at random and it may be outdated one (though it should only
cause more data than needed to be requested).
Since the gossiper works on host ids now it is incorrect to leave this
function to work on ip. It makes it impossible to delete outdated entry
since the "gossiper.get_host_id(endpoint) != id" check will always be
false for such entries (get_host_id() always returns most up -to-date
mapping.
This kind of benchmark was superseded by perf-alternator
which has more options, workflows and most importantly
measures overhead of http server layer (including json parsing).
There is no need to maintain additional code in perf-simple-query.
Closesscylladb/scylladb#23474
All tablets configuration was moved into its own "with tablets" section,
this option name cannot be met among replication factors.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23555
Now that the gossiper map is id based there can be a situation where two
entries have the same ip, Shadow round should send the newest one in
this cased. The patch makes it so.
Fixes: #23553
A user complained that he couldn't read or write an item with more than
16 attributes (!) in Alternator. This isn't true, but I realized that we
don't have a simple test for this case - all test use just a few attributes.
So let's add such a test, doing PutItem, UpdateItem and GetItem with 400
attributes. Unsurprisingly, the test passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23568
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.
We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.
So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.
Fixes#23569.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23578
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.
Refs #23378Closesscylladb/scylladb#23478
* github.com:scylladb/scylladb:
tablets: Make tablet allocation equalize per-shard load
tablets: load_balancer: Fix reporting of total load per node
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
enabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option
`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`
Refs scylladb/scylla-enterprise#4355
* Requires backport to 2025.1
Closesscylladb/scylladb#22273
* github.com:scylladb/scylladb:
boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
db/config: add tablets_mode_for_new_keyspaces option
Remove 'virtual' specifiers from member functions in final classes where
they can never be overridden. This addresses Clang errors like:
```
/home/kefu/dev/scylladb/cql3/column_identifier.hh:85:21: error: virtual method 'to_string' is inside a 'final' class and can never be overridden [-Werror,-Wunnecessary-virtual-specifier]
85 | virtual sstring to_string() const;
| ^
1 error generated.
```
This change improves code clarity and maintainability by eliminating
redundant modifiers that could cause confusion.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23570
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.
Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.
Fixes: #23453.
Needs backport to 2025.1 that introduces the tablet repair scheduler.
Closesscylladb/scylladb#23455
* github.com:scylladb/scylladb:
\test: add test to check concurrent migration and repair of two different tablets
repair: release erm in repair_writer_impl::create_writer when possible
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.
Fixes: #23513Closesscylladb/scylladb#23558
Instead of raising std::runtime_error("Dangling queue_reader_handle_v2")
unconditionally. push() already raises _ex if set, best to be
consistent.
Unconditionally raising std::runtime_error can cause an error to be
logged, when aborting an operation involving a queue reader.
Although the original exception passed to
queue_reader_handle_v2::abort() is most likely handled by higher level
code (not logged), the generic std::runtime_error raised is not and
therefore is logged.
Fixes: #23550Closesscylladb/scylladb#23554
Fixes#22925
Refs #22885
Some providers in EAR were written before seastar got its own native http connector (as it is). Thus hand-made connectivity is used there.
This PR unifies the code paths, and also extract some abstraction between providers where possible.
One big reason for this is the handling of abrupt disconnects and retries; Seastar has some handling of things like EPIPE and ECONNRESET situations, that can be safely ignored in a REST call iff data was in fact transferred etc.
This PR mainly takes the usage of seastar httpclient from gcp connector, makes a wrapper matching most of the usage of local client in kms connector, ensures common functionality and the replaces the code in the individual connectors.
Closesscylladb/scylladb#22926
* github.com:scylladb/scylladb:
encryption::gcp: Use seastar http client wrapper
encryption::kms: Drop local http client and use seastar wrapper
encryption: Break out a "httpclient" wrapper for seastar httpclient
After switching to subfolders the filter `run_in_debug` for
random failures test was just copied as is, but need to include
the subfolder, actually.
Also, `test_old_ip_notification_repro` was deleted, so, we
don't need it in the `skip_in_debug` list.
Closesscylladb/scylladb#23492
Improve the GitHub workflow to prevent premature email notifications
about missing labels. Previously, contributors without write permissions
to the scylladb repo would receive immediate notification emails about
missing required backport labels, even if they were in the process of
adding them.
This change introduces a 1-minute grace period before checking for
required labels, giving contributors sufficient time to add necessary
labels (like backport labels) to their pull requests before any warning
notifications are sent.
The delay makes the experience more user-friendly for non-maintainer
contributors while maintaining the labeling requirements.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23539
Add a size check for BatchItemWrite command - if the item count is
bigger than configuration value `alternator_maximum_batch_write_size`,
an error will be raised and no modification will happen.
This is done to synchronize with DynamoDB, where maximum size of
BatchItemWrite is 25. To avoid complaints from clients, who use
our feature of BatchWriteItem being limitless we set default value
to 100.
Fixes#5057Closesscylladb/scylladb#23232
"
The series makes endpoint state map in the gossiper addressable by host
id instead of ips. The transition has implication outside of the
gossiper as well. Gossiper based topology operations are affected by
this change since they assume that the mapping is ip based.
On wire protocol is not affected by the change as maps that are sent by
the gossiper protocol remain ip based. If old node sends two different
entries for the same host id the one with newer generation is applied.
If new node has two ids that are mapped to the same ip the newer one is
added to the outgoing map.
Interoperability was verified manually by running mixed cluster.
The series concludes the conversion of the system to be host id based.
"
* 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev:
gossiper: make examine_gossiper private
gossiper: rename get_nodes_with_host_id to get_node_ip
treewide: drop id parameter from gossiper::for_each_endpoint_state
treewide: move gossiper to index nodes by host id
gossiper: drop ip from replicate function parameters
gossiper: drop ip from apply_new_states parameters
gossiper: drop address from handle_major_state_change parameter list
gossiper: pass rpc::client_info to gossiper_shutdown verb handler
gossiper: add try_get_host_id function
gossiper: add ip to endpoint_state
serialization: fix std::map de-serializer to not invoke value's default constructor
gossiper: drop template from wait_alive_helper function
gossiper: move get_supported_features and its users to host id
storage_service: make candidates_for_removal host id based
gossiper: use peers table to detect address change
storage_service: use std::views::keys instead of std::views::transform that returns a key
gossiper: move _pending_mark_alive_endpoints to host id
gossiper: do not allow to assassinate endpoint in raft topology mode
gossiper: fix indentation after previous patch
gossiper: do not allow to assassinate non existing endpoint
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.
fixes: #23231
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23483
the option of `uuid_sstable_identifier_enabled` was introduced in
f014ccf3 . the first version which has this change was 5.4, and
6.1 has been branched. during the discussion of backup and restore,
we realized that we've been taking efforts to address problems which
could have been addressed with the sstable with UUID-based identifier.
see also #10459 which is the issue which proposed to implement UUID-v1
based sstable identifier.
now that two major releases passed, we should have the luxury to mark
this option "unused". this option which was previously introduced to
keep the backward compatibility, and to allow user to opt-out of the
feature for some reasons.
so in this change, mark the option unused, so that if any user still
sets this option with command line, they will get a clear error. but
we still parse and handle this setting in `scylla.yaml`, so that this
option is still respected for existing settings, and for existing tests,
which are not yet prepared for the uuid-based sstable identifiers.
Refs #10459Fixes#20337
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20341