Otherwise it is accessed right when exiting the if block.
Add a unit test reproducing the issue and validating the fix.
Fixes#25325
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#25326
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.
Fixesscylladb/scylladb#19061
Backport is not required, it is new functionality
Closesscylladb/scylladb#25063
* github.com:scylladb/scylladb:
docs: Add documentation for the nodetool dropquarantinedsstables command
nodetool: add command for dropping quarantine sstables
rest_api: add endpoint which drops all quarantined sstables
The initial support for nested containers (2d2a2ef277) worked on
my machine (tm) and even laptop, but does not work on fresh installs.
This is likely due to changes in where persistent configuration is
stored on the host between various podman versions; even though my
podman is fully updated, it uses configuration created long ago.
Make nested containers work on fresh installs by also configuring
/etc/containers/storage.conf. The important piece is to set graphroot
to the same location as the host.
Verified both on my machine and on a fresh install.
Closesscylladb/scylladb#25156
Nowadays the way to configure an internal service is
1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value
The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.
For the reference: similar changes with other services: #23705 , #20174 , #19166Closesscylladb/scylladb#25118
* github.com:scylladb/scylladb:
gms,init: Move get_disabled_features_from_db_config() from gms
code: Update callers generating feature service config
gms: Make feature_config a simple struct
gms: Split feature_config_from_db_config() into two
- Add dropquarantinedsstables command to remove quarantined SSTables
- Support both flag-based (--keyspace, --table) and positional arguments
- Allow targeting all keyspaces, specific keyspace, or keyspace with specified tables
Fixesscylladb/scylladb#19061
Whilst the coredump script checks for prerequisites, the user
experience is not ideal because you either have to go in the
script and get the list of deps and install them or wait for
the script to complain about lacking dependencies one by one.
This commit completes the list of dependencies in the
install script (some of them were already there for Fedora),
so you already have them installed by the time you
get to run the coredump script.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
[avi:
- remove trailing whitespace
- regenerate frozen toolchain
Optimized clang binaries generated and stored in
https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gzhttps://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]
Closes#22369Closesscylladb/scylladb#25203
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.
However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.
For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)
Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.
Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.
Preparation for new functionality, no backporting needed.
Closesscylladb/scylladb#25093
* github.com:scylladb/scylladb:
sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
test/boost: add a test for inexact index lookups
sstables/mx/reader: allow passing a custom index reader to the constructor
sstables/index_reader: remove advance_to
sstables/mx/reader: handle inexact lookups in `advance_context()`
sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
sstables/index_reader: make the return value of `get_partition_key` optional
sstables/mx/reader: handle "backward jumps" in forward_to
sstables/mx/reader: filter out partitions outside the queried range
sstables/mx/reader: update _pr after `fast_forward_to`
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.
Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh
Fixes: #22102Fixes: #22103Fixes: #22105Closesscylladb/scylladb#25082
Fixes: #25045
added the ability to supply the list of files to
restore from the a given file.
mainly required for local testing.
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
Closesscylladb/scylladb#25077
Instead of requesting it from gms code, create it "by hand" with the
help of get_disabled_features_from_db_config() method. This is how other
services are configured by main/tools/testing code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).
In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.
This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.
Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.
No backports needed, this is a preparation for new functionality.
Closesscylladb/scylladb#25000
* github.com:scylladb/scylladb:
sstables: add sstable::make_index_reader() and use where appropriate
sstables/mx: in readers, use abstract_index_reader instead of index_reader
sstables: in validate(), use abstract_index_reader instead of index_reader where possible
test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
sstables/index_reader: introduce abstract_index_reader
sstables/index_reader: extract a prefetch_lower_bound() method
If we add multiple index implementations, users of index readers won't
easily know which concrete index reader type is the right one to construct.
We also don't want pieces of code to depend on functionality specific to
certain concrete types, if that's not necessary.
So instead of constructing the readers by themselves, they can use a helper
function, which will return an abstract (virtual) index reader.
This patch adds such a function, as a method of `sstable`.
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance
* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`
No backport needed since it enhances functionality which has not been released yet
fixes: https://github.com/scylladb/scylladb/issues/22458Closesscylladb/scylladb#23695
* github.com:scylladb/scylladb:
sstables: Start using `make_data_or_index_source` in `sstable`
sstables: refactor readers and sources to use coroutines
sstables: coroutinize futurized readers
sstables: add `make_data_or_index_source` to the `storage`
encryption: refactor key retrieval
encryption: add `encrypted_data_source` class
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.
Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
Once we create types atomically the code which is before commit
may depend on newly added types, so it has to access both old and
new types. New storage called in_progress_types_storage was added.
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.
Fixesscylladb/scylladb#24018
Backport: all of the supported versions are affected, so we want to backport the changes there.
Closesscylladb/scylladb#24151
* github.com:scylladb/scylladb:
cql3/description: Serialize only rvalues of description
cql3: Represent create_statement using managed_string
cql3/statements/describe_statement.cc: Don't copy descriptions
cql3: Use managed_bytes instead of bytes in DESCRIBE
utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `fragmented_ostringstream` and `managed_string`
to possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
We provide a reproducer. It consistently passes with this commit,
while having about 50% chance of failure before it (based on my
own experiments). Playing with the parameters of the test
doesn't seem to improve that chance, so let's keep it as-is.
Fixesscylladb/scylladb#24018
We use patchelf to rewrite the dynamic loader (known as the interpreter)
of the binaries we ship, so we can point to our shipped dynamic loader,
which is compatible with our binaries, rather than rely on the distribution's
dynamic loader, which is likely to be incompatible.
Upstream patchelf losing compatibity [1] with Linux 5.17 and below.
This change was also picked up by Fedora 42, so we cannot update the
toolchain to that distribution until we have an alternative.
Here we add a minimal patchelf alternative. It was mostly written by
Claude. It is minimal in that it only supports --set-interpreter and
--print-interpreter, and works well enough for our needs. We still use
the original patchelf for --remove-rpath; this reduces our maintenance
needs.
[1] 43b75fbc9f
[2] 4b015255d1Closesscylladb/scylladb#24695
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
Closesscylladb/scylladb#24492
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
Before we can eradicate the numerical sstable generations,
This series completes https://github.com/scylladb/scylladb/issues/20337
by disabling the use of numerical sstable generations where we can
and making sure the feature is never disabled.
Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables.
Refs #24248
* Enhancement. No backport required
Closesscylladb/scylladb#24554
* github.com:scylladb/scylladb:
feature_service: never disable UUID_SSTABLE_IDENTIFIERS
test: sstable_move_test: always use uuid sstable generation
test: sstable_directory_test: always use uuid sstable generation
sstables: sstable_generation_generator: set last_generation=0 by default
test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
test: lib: test_env: always use uuid sstable generation
test: sstable_test: always use uuid sstable generation
test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
test: sstable_compaction_test: always use uuid sstable generation
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
nodetool repair command repairs only vnode keyspaces. If a user tries
to repair a tablet keyspace, an exception is thrown.
Closesscylladb/scylladb#23660
optimized_clang.sh trains the compiler using profile-guided optimization
(pgo). However, while doing that, it builds scylladb using its own profile
stored in pgo/profiles and decompressed into build/profile.profdata. Due
to the funky directory structure used for training the compiler, that
path is invalid during the training and the build fails.
The workaround was to build on a cloud machine instead of a workstation -
this worked because the cloud machine didn't have git-lfs installed, and
therefore did not see the stored profile, and the whole mess was averted.
To make this work on a machine that does have access to stored profiles,
disable use of the stored profile even if it exists.
Fixes#22713Closesscylladb/scylladb#24571
This PR adds an upgrade test for SSTable compression with shared dictionaries, and adds some bits to pylib and test.py to support that.
In the series, we:
1. Mount `$XDG_CACHE_DIR` into dbuild.
2. Add a pylib function which downloads and installs a released ScyllaDB package into a subdirectory of `$XDG_CACHE_DIR/scylladb/test.py`, and returns the path to `bin/scylla`.
3. Add new methods and params to the cluster manager, which let the test start nodes with historical Scylla executables, and switch executables during the test.
4. Add a test which uses the above to run an upgrade test between the released package and the current build.
5. Add `--run-internet-dependent-tests` to `test.py` which lets the user of `test.py` skip this test (and potentially other internet-dependent tests in the future).
(The patch modifying `wait_for_cql_and_get_hosts` is a part of the new test — the new test needs it to test how particular nodes in a mixed-version cluster react to some CQL queries.)
This is a follow-up to #23025, split into a separate PR because the potential addition of upgrade tests to `test.py` deserved a separate thread.
Needs backport to 2025.2, because that's where the tested feature is introduced.
Fixes#24110Closesscylladb/scylladb#23538
* github.com:scylladb/scylladb:
test: add test_sstable_compression_dictionaries_upgrade.py
test.py: add --run-internet-dependent-tests
pylib/manager_client: add server_switch_executable
test/pylib: in add_server, give a way to specify the executable and version-specific config
pylib: pass scylla_env environment variables to the topology suite
test/pylib: add get_scylla_2025_1_executable()
pylib/scylla_cluster: give a way to pass executable-specific options to nodes
dbuild: mount "$XDG_CACHE_HOME/scylladb"
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).
Fixes#24513
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`.
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649Closesscylladb/scylladb#20853
* github.com:scylladb/scylladb:
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#24409Fixes: #24365
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.
Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
Once we create types atomically the code which is before commit
may depend on newly added types, so it has to access both old and
new types. New storage called in_progress_types_storage was added.
This change adds the --scope option to nodetool refresh.
Like in the case of nodetool restore, you can pass either of:
* node - On the local node.
* rack - On the local rack.
* dc - In the datacenter (DC) where the local node lives.
* all (default) - Everywhere across the cluster.
as scope.
The feature is based on the existing load_and_stream paths, so it
requires passing --load-and-stream to the refresh command.
Also, it is not compatible with the --primary-replica-only option.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#23861
Currently, the `system.compaction_history` table miss information like the type of compaction (cleanup, major, resharding, etc), the sstable generations involved (in and out), shard's id the compaction was triggered on and statistics on purged tombstones to be collected during compaction.
The series extends the table with the following columns:
- "compaction_type" (text)
- "shard_id" (int)
- "sstables_in" (list<sstableinfo_type>)
- "sstables_out" (list<sstableinfo_type>)
- "total_tombstone_purge_attempt" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long)
with a user defined type `sstableinfo_type` that holds the information about sstable file
- generation (uuid)
- origin (text)
- size (long)
Additional statistics stored in the compaction_history have been incorporated in the API `/compaction_manager/compaction_history` and the `nodetool compactionhistory` command.
No backport is required. It extends the existing compaction history output.
Fixes https://github.com/scylladb/scylladb/issues/3791Closesscylladb/scylladb#21288
* github.com:scylladb/scylladb:
nodetool: Refactor of compactionhistory_operation
nodetool: Add more stats into compactionhistory output
api/compaction_manager: Extend compaction_history api
compaction: Collect tombstone purge stats during compaction
compacting_reader: Extend to accept tombstone purge statistics
mutation_compactor: Collect tombstone purge attempts
compaction_garbage_collector: Extend return type of max_purgeable_fn
compaction: Extend compaction_result to collect more information
system_keyspace: Upgrade compaction_history table
system_keyspace: Create UDT: sstableinfo_type
system_keyspace: Extract compaction_history struct
system_keyspace: Squeeze update_compaction_history parameters
compaction/compaction_manager: update_history accepts compaction_result as rvalue
Simplify code by using std::apply that unpacks std::array into
separate items to pass further to a callable. This simplifies
the code that looks:
fmt::print(std::cout, fmt::runtime(header_row_format.c_str()),
header_row[0], header_row[1], header_row[2], header_row[3],
header_row[4], header_row[5], header_row[6], header_row[7],
header_row[8], header_row[9], header_row[10], header_row[11],
header_row[12], header_row[13]);
into something like:
std::apply(fh, header_row);
Incorporate additional statistics stored in the compaction_history
system table. Depending on the requested format type, the output has
different form.
Remove unnecessary duplicated history_entry struct and instead use
extracted db::compaction_history_entry structure.
Running the cql command: select * from system.compaction_history;
prints sstable's generation type as UUID (e.g. 5a5cf800-b617-11ef-a97d-8438c36f0e31),
see generation_type::data_value() which is different than its fmt
format (e.g. 3glx_0srx_1pasg2ksepk902v8dt). Therefore, to unify
the outputs, generation_type is converted to data_value before
it is printed.
Negative load sizes don't make sense, but we've seen a case in
production, where a negative number was returned by ScyllaDB REST API,
so be prepared to handle these too.
Fixes: scylladb/scylladb#24134Closesscylladb/scylladb#24135
Any empty object of the json::json_list type has its internal
_set variable assigned to false which results in such objects
being skipped by the json::json_builder.
Hence, the json returned by the api GET//compaction_manager/compaction_history
does not contain the field `rows_merged` if a cell in the
system.compaction_history table is null or an empty list.
In such cases, executing the command `nodetool compactionhistory`
will result in a crash with the following error message:
`error running operation: rjson::error (JSON assert failed on condition 'false'`
The patch fixes it by checking if the json object contains the
`rows_merged` element before processing. If the element does
not exist, the nodetool will now produce an empty list.
Fixes https://github.com/scylladb/scylladb/issues/23540Closesscylladb/scylladb#23514
Pass through the local containers directory (it cannot
be bind-mounted to /var/lib/containers since podman checks
the path hasn't changed) with overrides to the paths. This
allows containers to be created inside the dbuild container,
so we can enlist pre-packaged software (such as opensearch)
in test.py. If the container images are already downloaded
in the host, they won't be downloaded again.
It turns out that the container ecosystem doesn't support
nested network namespaces well, so we configure the outer
container to use host networking for the inner containers.
It's useful anyway.
The frozen toolchain now installs podman and buildah so
there's something to actually drive those nested containers.
We disable weak dnf dependencies to avoid installing qemu.
The frozen toolchain is regenerated with optimized clang from
https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gzhttps://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gzClosesscylladb/scylladb#24020
compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.
Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.
There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.
Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.
There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).
It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.
If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.
So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
New functionality, added to a feature which isn't in any stable branch yet. No backporting.
Closesscylladb/scylladb#23590
* github.com:scylladb/scylladb:
test: add test/boost/sstable_compressor_factory_test
compress: add some test-only APIs
compress: rename sstable_compressor_factory_impl to dictionary_holder
compress: fix indentation
compress: remove sstable_compressor_factory_impl::_owner_shard
compress: distribute compression dictionaries over shards
test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
test: remove sstables::test_env::do_with()
There are only two callers of the method and the one that wants
validation (the sstable::load()) can do it on its own. This helps the
other caller (schema loader) being simpler and shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24038
In next patches, make_sstable_compressor_factory() will have to
disappear.
In preparation for that, we switch to a seastar::thread-dependent
replacement.