During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes https://github.com/scylladb/scylladb/issues/20044.
Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.
(cherry picked from commit bcd358595f)
(cherry picked from commit 93815e0649)
Refs https://github.com/scylladb/scylladb/pull/20939Closesscylladb/scylladb#21204
* github.com:scylladb/scylladb:
replica: Fix tombstone GC during tablet split preparation
service: Improve error handling for split
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes#20044.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.
Note that this requires loading and processing
the base directory first.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f47b5e60bc)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
# Conflicts:
# sstables/sstable_directory.hh
Fixes#20862
With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.
Test included.
(cherry picked from commit ee5e71172f)
Closesscylladb/scylladb#20913
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.
Ignore cleanup of deallocated storage groups.
Fixes: https://github.com/scylladb/scylladb/issues/19752.
Needs to be backported to all branches with tablets (6.0 and later)
(cherry picked from commit 20d6cf55f2)
(cherry picked from commit 2c4b1d6b45)
Refs https://github.com/scylladb/scylladb/pull/20584Closesscylladb/scylladb#20628
* github.com:scylladb/scylladb:
test: check if cleanup of deallocated sg is ignored
replica: ignore cleanup of deallocated storage group
This is a manual backport of #20212 to 6.0, superseding #20346 (which had run into conflicts).
Please see the individual commit messages for backport notes.
Fixes#10305Closesscylladb/scylladb#20349
* github.com:scylladb/scylladb:
generic_server: make server::stop() idempotent
generic_server: coroutinize server::shutdown()
generic_server: make server::shutdown() idempotent
test/generic_server: add test case
configure, cmake: sort the lists of boost unit tests
generic_server: convert connection tracking to seastar::gate
Consider the following:
```
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.
Fixes https://github.com/scylladb/scylladb/issues/19378.
Fixes https://github.com/scylladb/scylladb/issues/19416.
Please replace this line with justification for the backport/* labels added to this PR
(cherry picked from commit 239344ab55)
(cherry picked from commit 74612ad358)
Refs https://github.com/scylladb/scylladb/pull/19427Closesscylladb/scylladb#20593
* github.com:scylladb/scylladb:
tablets: Fix race between repair and split
compaction: Allow "offline" sstable to be split
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.
In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes#15559
(cherry picked from commit ea8441dfa3)
(cherry picked from commit edb3068ecf)
A dialect is a different way to interpret the same CQL statement.
Examples:
- how duplicate bind variable names are handled (later in this series)
- whether `column = NULL` in LWT can return true (as is now) or
whether it always returns NULL (as in SQL)
Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.
The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.
(cherry picked from commit d69bf4f010)
Check whether we can stop a generic server without first asking it to
listen.
The test fails currently; the failure mode is a hang, which triggers the 5
minute timeout set in the test:
> unknown location(0): fatal error: in "stop_without_listening":
> seastar::timed_out_error: timedout
> seastar/src/testing/seastar_test.cc(43): last checkpoint
> test/boost/generic_server_test.cc(34): Leaving test case
> "stop_without_listening"; testing time: 300097447us
Backport notes for 6.0:
- Replace
#include "utils/assert.hh"
SCYLLA_ASSERT(false);
with
#include <cassert>
assert(false);
due to 6.0 lacking commit aa1270a00c ("treewide: change assert() to
SCYLLA_ASSERT()", 2024-08-05). The header file "utils/assert.hh"
wouldn't be difficult to backport, but separating it from the treewide
changes in commit aa1270a00c might not be the best idea.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit dbc0ca6354)
Both lists were obviously meant to be sorted originally, but by today
we've introduced many instances of disorder -- thus, inserting a new test
in the proper place leaves the developer scratching their head. Sort both
lists.
Backport notes for 6.0:
- Conflicts in "configure.py" and "test/boost/CMakeLists.txt",
unsurprisingly. For the backport, I sorted the boost unit test list in
each file manually, from scratch.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 931f2f8d73)
This series fixes an issue where histogram Summaries return an infinite value.
It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.
The series adds a test for summaries with a specific test case for this scenario.
Fixes#20255
Need backport to 6.0, 6.1 and 2023.1 and above
(cherry picked from commit 011aa91a8c)
(cherry picked from commit 644e6f0121)
Refs #20257Closesscylladb/scylladb#20304
* github.com:scylladb/scylladb:
test/estimated_histogram_test Add summary tests
utils/histogram.hh: Make summary support inifinite bucket.
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.
Fixes#19879
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#19883
(cherry picked from commit 27b305b9d1)
Closesscylladb/scylladb#19963
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
single tablet replica
In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.
Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes. This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.
Refs scylladb/scylladb#18862
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 26abad23d9)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19920
This patch adds tests for summary calculation. It adds two tests, the
first is a basic calculation for P50, P95, P99 by adding 100 elements
into 20 buckets.
The second test look that if elements are found in the infinite bucket,
the result would be the lower limit (33s) and not infinite.
Relates to #20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 644e6f0121)
In order to fix the race between split and repair, we must introduce
the ability to split an "offline" sstable, one that wasn't added
to any of the table's sstable set yet.
It's not safe to split a sstable after adding it to the set, because
a failure to split can result in unsplit data left in the set, causing
split to fail down the road, since the coordinator thinks this replica
has only split data in the set.
Refs #19378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 239344ab55)
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.
Fixes#18506.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ad5c5bca5f)
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.
This also violates invariants about replica placement and this state
cannot be fixed by topology operations.
One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:
load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at: ...
Fixes#20032
(cherry picked from commit f5c74a5df2)
Closesscylladb/scylladb#20067
this change was created in the same spirit of 505900f18f. because
we are deprecating the operator<< for vector and unorderd_map in
Seastar, some tests do not compile anymore if we disable these
operators. so to be prepared for the change disabling them, let's
include test/lib/test_utils.hh for accessing the printer dedicated
for Boost.test. and also '#include <fmt/ranges.h>' when necessary,
because, in order to format the ranges using {fmt}, we need to
use fmt/ranges.h.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Before these changes, compilation was failing with the following
error:
In file included from test/boost/hint_test.cc:12:
/usr/include/fmt/ranges.h:298:7: error: no member named 'parse' in 'fmt::formatter<db::hints::sync_point::host_id_or_addr>'
298 | f.parse(ctx);
| ~ ^
We add the missing callback.
Closesscylladb/scylladb#19375
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
There are two schemas associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
This series fixes the known mixups between the two — when setting up compression,
and when setting up the bloom filters.
Fixesscylladb/scylladb#16065
The bug is present in all supported versions, so the patch has to be backported to all of them.
(cherry picked from commit a1834efd82)
(cherry picked from commit d10b38ba5b)
(cherry picked from commit 1a8ee69a43)
Refs scylladb/scylladb#19695Closesscylladb/scylladb#19877
* github.com:scylladb/scylladb:
sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.
(cherry picked from commit 1a8ee69a43)
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.
This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.
(cherry picked from commit d10b38ba5b)
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722
Closes scylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
(cherry picked from commit 591876b44e)
Backported from #19717 to 6.0
Closesscylladb/scylladb#19830
Allows loading the schema from an sstable's serialization header. This
schema is incomplete, but it is enough to parse and display the content
of the sstable.
(cherry picked from commit 8f2ba03465)
mutation_partition_v2::apply_monotonically() needs to perform some allocations
in a destructor, to ensure that the invariants of the data structure are
restored before returning. But it is usually called with reclaiming disabled,
so the allocations might fail even in a perfectly healthy node with plenty of
reclaimable memory.
This patch adds a mechanism which allows to reserve some LSA memory (by
asking the allocator to keep it unused) and make it available for allocation
right when we need to guarantee allocation success.
(cherry picked from commit 7b3f55a65f)
This is currently a lambda in a test, hoist it into the global scope and
make it into a function, so other tests can use it too (in the next
patch).
(cherry picked from commit 9cbdd8ef92)
This patch adds a check if aggregation query is doing single-partition read and if so, makes the query to not use forward_service and do not parallelize the request.
Fixesscylladb/scylladb#19349
(cherry picked from commit e9ace7c203)
(cherry picked from commit 8eb5ca8202)
Refs scylladb/scylladb#19350Closesscylladb/scylladb#19499
* github.com:scylladb/scylladb:
test/boost/cql_query_test: add test for single-partition aggregation
cql3/select_statement: do not parallelize single-partition aggregations
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.
To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.
Fixes#16514.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ace4e5111e)
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0ce8ee03f1)
That will be used in turn to restrict reshape to 10% of available space
in underlying storage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 51c7ee889e)
Update the maximum size tested by the testcase. The test always created
only one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (12.8 KB). So, use twice the
max_chunk_capacity as the test size distribution upper limit to verify
that partial_reserve can reserve multiple chunks.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 310c5da4bb)
Fix the method comment and return types of chunked_managed_vector's
reserve_partial() similar to chunked_vector's reserve_partial() as it
has the same issues mentioned in #19254. Also update the usage in the
chunked_managed_vector_test.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d4f8b91bd6)
Fix the usage of reserve_partial in the testcase. Also update the
maximum chunk size used by the testcase. The test always created only
one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (128 KB). So, use smaller chunk size,
512 bytes, to verify that partial_reserve can reserve multiple chunks.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 29f036a777)
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.
after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.
Fixes#19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 223fba3243)
before this change, we verify the behavior of design under test using
`BOOST_ASSERT()`, which is a wrapper around `assert()`, so if a test
fails, the test just aborts. this is not very helpful for postmortem
debugging.
after this change, we use `BOOST_REQUIRE` macro for verifying the
behavior, so that Boost.Test prints out the condition if it does not
hold when we test it.
Refs #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 2df4e9cfc2)
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closesscylladb/scylladb#19014
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.
The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 83d491af02)
Closesscylladb/scylladb#19012
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
(cherry picked from commit 2ab143fb40)
[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
Note we're suppressing a UBSanitizer overflow error in UTs. That's
because our linter complains about a possible overflow, which never
happens, but tests are still failing because of it.
User-defined types can depend on each other, creating directed acyclic graph.
In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.
This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema.
Fixes#18539Closesscylladb/scylladb#18302
* github.com:scylladb/scylladb:
test/cql-pytest/test_describe: add test for UDTs ordering
cql3/statements/describe_statement: UDTs topological sorting
cql3/statements/describe_statement: allow to skip alphabetical sorting
types: add a method to get all referenced user types
db/cql_type_parser: use generic topological sorting
db/cql_type_parses: futurize raw_builder::build()
test/boost: add test for topological sorting
utils: introduce generic topological sorting algorithm
in this change, we trade the `boost_test_print_type()` overloads
for the generic template of `boost_test_print_type()`, except for
those in the very small tests, which presumably want to keep
themselves relative self-contained.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18727