Commit Graph

172 Commits

Author SHA1 Message Date
Calle Wilund
7c6b4bed97 tests::boost: Add GS object storage cases to mirror S3 ones
I.e. run same remote storage backend unit tests for GS backend
2025-10-13 08:53:27 +00:00
Calle Wilund
956d26aa34 test_services: Allow testing with GS object storage parameters 2025-10-13 08:53:26 +00:00
Calle Wilund
5d4558df3b sstables: Use object_storage_client for remote storage
Replaces direct s3 interfaces with the abstraction layer, and open
for having multiple implentations/backends
2025-10-13 08:53:25 +00:00
Calle Wilund
78d9dda060 config: break out object_storage_endpoint_param preparing for multi storage
Moves the config wrapper to own file (to reduce recompilation for modifying)
and refactors to handle extending this parameter to non-s3 endpoint configs.
2025-10-13 08:53:24 +00:00
Botond Dénes
1999d8e3d3 compaction: remove using namespace {compaction,sstables}
Some files in compaction/ have using namespace {compaction,sstables}
clauses, some even in headers. This is considered bad practice and
muddies the namespace use. Remove them.
2025-09-25 15:03:57 +03:00
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Aleksandra Martyniuk
17707d0e6b compaction: stop compaction module in really_do_stop
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.

Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.

Stop compaction module in really_do_stop, after ongoing compactions
are stopped.

It's a preparation for further patches.
2025-09-23 14:21:15 +02:00
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Avi Kivity
66173c06a3 Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy
Remove support for generating numerical sstable generation for new sstables.
Loading such sstables is still supported but new sstables are always created with a uuid generation.
This is possible since:
* All live versions (since 5.4 / f014ccf369) now support uuid sstable generations.
* The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / 6da758d74c) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`.

Fixes #24248

* Enhancement, no backport needed

Closes scylladb/scylladb#24512

* github.com:scylladb/scylladb:
  streaming: stream_blob: use the table sstable_generation_generator
  replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
  sstables: sstable_generation_generator: stop tracking highest generation
  replica: table: get rid of update_sstables_known_generation
  sstables: sstable_directory: stop tracking highest_generation
  replica: distributed_loader: stop tracking highest_generation
  sstables: sstable_generation: get rid of uuid_identifiers bool class
  sstables_manager: drop uuid_sstable_identifiers
  feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
  test: cql_query_test: add test_sstable_load_mixed_generation_type
  test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
  test: database_test: move table_dir helper to test/lib/test_utils
2025-08-14 11:54:33 +03:00
Benny Halevy
6cc964ef16 sstables: sstable_generation: get rid of uuid_identifiers bool class
Now that all call sites enable uuid_identifiers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
0ad1898f0a feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
The feature is supported by all live versions since
version 5.4 / 2024.1.

(Although up to 6da758d74c
it could be disabled using the config option)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:15 +03:00
Raphael S. Carvalho
beaaf00fac test: Add test that compaction doesn't cross logical group boundary
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:01 +03:00
Raphael S. Carvalho
9d3755f276 replica: Futurize retrieval of sstable sets in compaction_group_view
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.

Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00
Pavel Emelyanov
8220974e76 code: Update callers generating feature service config
Instead of requesting it from gms code, create it "by hand" with the
help of get_disabled_features_from_db_config() method. This is how other
services are configured by main/tools/testing code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-21 19:19:09 +03:00
Avi Kivity
b33dd2bd7d Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

Closes scylladb/scylladb#24492

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-06-29 18:18:36 +03:00
Botond Dénes
ebd9420687 sstables: add corrupt_data_handler to sstables::sstables
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
2025-06-25 08:41:26 +03:00
Benny Halevy
15bee9f232 sstables: sstable_generation_generator: set last_generation=0 by default
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
f0f7c83705 test: lib: test_env: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Raphael S. Carvalho
2d716f3ffe replica: Fix truncate assert failure
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.

1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.

So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.

Fixes #23771.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24426
2025-06-08 15:59:15 +03:00
Michał Chojnowski
8649adafa8 test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
In next patches, make_sstable_compressor_factory() will have to
disappear.
In preparation for that, we switch to a seastar::thread-dependent
replacement.
2025-05-07 14:43:04 +02:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
59dad2121f compaction: Introduce token_range() to table_state
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Michał Chojnowski
30a9d471fa sstables: plug an sstable_compressor_factory into sstables_manager
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.

In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
2025-04-01 00:07:28 +02:00
Robert Bindar
b647196121 Remove db::config::object_storage_config
That map became redundant once we added
object_storage_endpoints in the config, this patch removes
it and switches all the user code to use the new option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 17:15:12 +03:00
Ernest Zaslavsky
d534051bea aws creds: add env. and file credentials providers
This commit entirely removes credentials from the endpoint configuration. It also eliminates all instances of manually retrieving environment credentials. Instead, the construction of file and environment credentials has been moved to their respective providers. Additionally, a new aws_credentials_provider_chain class has been introduced to support chaining of multiple credential providers.
2025-02-05 14:57:19 +02:00
Ernest Zaslavsky
c911fc4f34 s3 creds: move credentials out of endpoint config
This commit refactors the way AWS credentials are managed in Scylla. Previously, credentials were included in the endpoint configuration. However, since credentials and endpoint configurations serve different purposes and may have different lifetimes, it’s more logical to manage them separately. Moving forward, credentials will be completely removed from the endpoint_config to ensure clear separation of concerns.
2025-02-04 16:45:23 +02:00
Kefu Chai
353b522ca0 treewide: migrate from boost::adaptors::reversed to std::views::reverse
now that we are allowed to use C++23. we now have the luxury of using
`std::views::reverse`.

- replace `boost::adaptors::transformed` with `std::views::transform`
- remove unused `#include <boost/range/adaptor/reversed.hpp>`

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-07 13:22:00 +02:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Lakshmi Narayanan Sreethar
67dad99ab5 compaction::table_state: implement get_token_range_after_split() wrapper
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:24:00 +05:30
Botond Dénes
e942c074f2 compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.

Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
2024-11-03 10:17:11 +02:00
Avi Kivity
c3be2489ce treewide: drop includes of <boost/range/adaptors.hpp>
This includes way too much, including <boost/regex.hpp>, which is huge.
Drop includes of adaptors.hpp and replace by what is needed.

Closes scylladb/scylladb#21187
2024-10-20 17:17:11 +03:00
Pavel Emelyanov
031893259a treewide: Rename sstable registry location field to be owner
This is sort of continuation of the previous patch. The partition key in
the registry is now table_id, not string, and is better called "owner",
not "location". This patch is s/location/owner/ over specific places
that include field name in the schema, argument names in registry
maintenance classes and tests accessing the selected row fields by name.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 14:11:28 +03:00
Pavel Emelyanov
3315e3a2a9 system_keyspace: Change sstables registry partition key type
Today, the system.sstables schema uses string as partition key. Callers,
in turn, use table's datadir value to reference entries in it. That's
wrong, S3-backed sstables don't have any local paths to work with. The
table's ID is better in this role.

This patch only changes the field type to be table_id and fixes the
callers to provide one. In particular, see init_table_storage() change
-- instead of generating a datadir string, it sets table.id() as the
options' location. Other fixed places are tests. Internally, this id
value is propagated via s3_storage::owner() method, that's fixed as
well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:48:09 +03:00
Pavel Emelyanov
1181b6b082 storage_options: Use variant on S3 options
Describing S3 storage for an sstables nowadays has two options -- via
sstables registry entry and by using the direct prefix string. The
former is used when putting a keyspace on S3. In this case each sstable
has the corresponding entry in the system.sstables table. The latter is
used by "restore from object storage" code. In that case, sstables don't
have entries in the registry, but are accessed by a specific S3 object
path.

This patch reflects this difference by making s3_options::location be
variant of string prefix and table_id owner. The owner needs more
explanation, here it is.

Today, the system.sstables schema defines partition key to be "string
location" and clustering key to be "UUID generation". The partition key
is table's datadir string, but it's wrong to use it this way. Next
patches will change the partition key to be table's ID (there's table_id
type for it), and before doing it storage options must be prepared to
carry it onboard. This patch does it, but the table_id alternative of
the location is still unused, the rest of the code keeps using the
string location to reference a row in the registry table. Next patches
will eventually make use of the table_id value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:04:52 +03:00
Avi Kivity
7dad248ac7 Merge 'Fix sstables registry mock' from Pavel Emelyanov
There are two issues in it. First, listing the registry with a consumer callback passes wrong argument to the consumer. Second, the primary key of the registry is wrong. Both issues don't show up, because existing tests that use mock don't read from it, only write. Tests that read from registry are python tests that start scylla and thus use real registry.

Closes scylladb/scylladb#20946

* github.com:scylladb/scylladb:
  test: Use corrcet key in sstables registry mock
  test: Pass entry status to mock registry consumer
2024-10-07 13:56:26 +03:00
Pavel Emelyanov
1da1d131b2 test: Use corrcet key in sstables registry mock
The "real" registry defines its primary key as (location, generation)
pair, where location is the partition key and generation is clustering
key. The registry mock uses only location part as primary key, while it
must use both.

The buggy mock works simply because the listing API is in fact not used
by unit tests. Those tests that do need it are python tests that start
scylla and thus implicitly use real registry.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-03 10:07:11 +03:00
Pavel Emelyanov
a503e2ab10 test: Pass entry status to mock registry consumer
When sstables registry is listed, the passed consumer accepts entry
status as its first argument, not its location (location is passed as a
search key)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-03 10:06:28 +03:00
Raphael S. Carvalho
93815e0649 replica: Fix tombstone GC during tablet split preparation
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes #20044.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-10-02 11:26:13 -03:00
Pavel Emelyanov
8487f2fd93 treewide: Remove table::config::datadir
It's write-only now, all the places than wanted to know where table's
storage is, already use storage_options.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
5046cfab4b test: Construct table_for_tests with table storage options
The only place that constructs table_for_tests is make_table_for_tests
helper. It can and should prepare the correct storage options, because
that's the last place where the target directory is still known.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
eaad4f348b test: Generalize pair of make_table_for_tests helpers
They only differ in a way they get target directory from -- one via
argument, andother from test_env. Respectively, the latter can call the
former.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
36863d4ad0 sstables_manager: Remove table_dir from make_sstable()
It used to be passed to sstable constructor, but now it doesn't need
this argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
4425cf54c6 tests: Properly initialize storage options with "dir"
Most of the tests work with local storage options. Some support S3
options as well. Whatever it is, when creating an sstable, tests need to
put proper "dir" on the options, this patch does so.

In fact, storage options for tests are created together with the
test-env, and ideally this is the place where dir should be assigned on
it. However, there are still places that explicitly specify path they
want to see sstables at, for those the new temporary options should be
constructed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Botond Dénes
c7c5817808 Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy
When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.

For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.

If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.

In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.

If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.

Fixes scylladb/scylladb#20423
Fixes scylladb/scylladb#20424

> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads

Closes scylladb/scylladb#20446

* github.com:scylladb/scylladb:
  cql-pytest: add test_compaction_tombstone_gc
  sstable_compaction_test: add mv_tombstone_purge_test
  sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
  sstable_compaction_test: tombstone_purge_test: add testlog debugging
  sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
  sstable, compaction: add debug logging for extended min timestamp stats
  compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
  compaction: define max_purgeable_fn
  tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
  sstables: scylla_metadata: add ext_timestamp_stats
  compaction_group, storage_group, table_state: add extended timestamp stats getters
  sstables, memtable: track live timestamps
  memtable_encoding_stats_collector: update row_marker: do nothing if missing
2024-09-13 08:56:51 +03:00
Benny Halevy
6f202cf48b compaction_group, storage_group, table_state: add extended timestamp stats getters
To return the minimum live timestamp and live row-marker
timestamp across a compaction_group, storage_group, or
table_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Pavel Emelyanov
cfa59ab73d test: Use single temp dir for sharded<sstables::test_env>
The test-env in question is mostly started in one-shard mode. Also there
are several boost tests that start sharded<> environment. In that case
instances on different shards live in different temp dirs. That's not
critical yet, but better to have single directory for the whole test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20412
2024-09-10 11:25:04 +03:00
Pavel Emelyanov
e9980bd6dd test: Move sstable_open_config to reusable_sst()'s argument
So that callers are able to provide custom config in the future

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-03 14:00:59 +03:00