This patch adds separate group for vector search parameters in the
documentation and fixes small typos and formatting.
Fixes: SCYLLADB-77.
Closesscylladb/scylladb#27385
This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are:
- introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams,
- reduce splitting of simple Alternator mutations,
- correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs.
To summarize, the emitted events of the data manipulation operations should be as follows:
- DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK)
- DeleteItem of nonexistent item: nothing (OK)
- BatchWriteItem.DeleteItem of nonexistent item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK)
No backport is necessary.
Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26396
Refs https://github.com/scylladb/scylladb/issues/26382
Fixes https://github.com/scylladb/scylladb/issues/6918
Closes scylladb/scylladb#26121
* github.com:scylladb/scylladb:
test/alternator: Enable the tests failing because of #6918
alternator, cdc: Don't emit events for no-op removes
alternator, cdc: Don't emit an event for equal items
alternator/streams, cdc: Differentiate item replace and item update in CDC
alternator: Change the return type of rmw_operation_return
config: Add alternator_streams_strict_compatibility flag
cdc: Don't split a row marker away from row cells
This PR adds support for limiting the maximum shares allocated to a
compaction scheduling class by the compaction controller. It introduces
a new configuration parameter, compaction_max_shares, which, when set
to a non zero value, will cap the shares allocated to compaction jobs.
This PR also exposes the shares computed by the compaction controller
via metrics, for observability purposes.
Fixes https://github.com/scylladb/scylladb/issues/9431
Enhancement. No need to backport.
NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696
Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50)
Closesscylladb/scylladb#27024
* github.com:scylladb/scylladb:
db/config: introduce new config parameter `compaction_max_shares`
compaction_manager:config: introduce max_shares
compaction_controller: add configurable maximum shares
compaction_controller: introduce `set_max_shares()`
Add support for the new configuration parameter `compaction_max_shares`,
and update the compaction manager to pass it down to the compaction
controller when it changes. The shares allocated to compaction jobs will
be limited by this new parameter.
Fixes#9431
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file
To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.
Fixes: VECTOR-327
This change adds support for secondary vector store clients, typically
located in different availability zones. Secondary clients serve as
fallback targets when all primary clients are unavailable.
New configuration option allows specifying secondary client addresses
and ports.
Fixes: VECTOR-187
Closesscylladb/scylladb#26484
This reverts commit 43738298be.
This commit causes instability in dtests. Several non-gating dtests
started failing, as well as some gating ones, see #27047.
Closesscylladb/scylladb#27067Fixes#27047
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: #24664Closesscylladb/scylladb#26330
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.
For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.
In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys.
This series introduces a solution for users who want to **safely** switch `alternator_enforce_authorization` from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors.
The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages, the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true.
Fixes#25308
This is a feature that users need, so it should probably be backported to live branches.
Closesscylladb/scylladb#25457
* github.com:scylladb/scylladb:
docs/alternator: explain alternator_warn_authorization
test/alternator: tests for new auth failure metrics and log messages
alternator: add alternator_warn_authorization config
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).
Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
With this flag enabled, Alternator Streams produces more accurate event
types:
- nop operations (i.e. replacing an item with an identical one, deleting
a nonexistent item) don't produce an event,
- updates of an existing item produce a MODIFY event, instead of INSERT,
- etc.
This flag affects the internal behaviour of some operations, i.e.
Alternator may select a preimage and propagate it to CDC (in contrary to
CDC making the request), or do extra item comparisons (i.e. compare the
existing item with the new one). These operations may be costly, and
users that don't use Streams won't need them.
This flag is live-updatable. An operation reads this flag once, and uses
its value for the entire operation.
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435
As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.
Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.
For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Before this patch, the configuration alternator_enforce_authorization
is a boolean: true means enforce authentication checks (i.e., each
request is signed by a valid user) and authorization checks (the user
who signed the request is allowed by RBAC to perform this request).
This patch adds a second boolean configuration option,
alternator_warn_authorization. When alternator_enforce_authorization
is false but alternator_warn_authorization is true, authentication and
authorization checks are performed as in enforce mode, but failures
are ignored and counted in two new metrics:
scylla_alternator_authentication_failures
scylla_alternator_authorization_failures
additionally,also each authentication or authorization error is logged as
a WARN-level log message. Some users prefer those log messages over
metrics, as the log messages contain additional information about the
failure that can be useful - such as the address of the misconfigured
client, or the username attempted in the request.
All combinations of the two configuration options are allowed:
* If just "enforce" is true, auth failures cause a request failure.
The failures are counted, but not logged.
* If both "enforce" and "warn" are true, auth failures cause a request
failure. The failures are both counted and logged.
* If just "warn" is true, auth failures are ignored (the request
is allowed to compelete) but are counted and logged.
* If neither "enforce" nor "warn" are true, no authentication or
authorization check are done at all. So we don't know about failures,
so naturally we don't count them and don't log them.
This patch is fairly straightforward, doing mainly the following
things:
1. Add an alternator_warn_authorization config parameter.
2. Make sure alternator_enforce_authorization is live-updatable (we'll
use this in a test in the next patch). It "almost" was, but a typo
prevented the live update from working properly.
3. Add the two new metrics, and increment them in every type of
authentication or authorization error.
Some code that needs to increment these new metrics didn't have
access to the "stats" object, so we had to pass it around more.
4. Add log messages when alternator_warn_authorization is true.
5. If alternator_enforce_authorization is false, allow the auth check
to allow the request to proceed (after having counted and/or logged
the auth error).
A separate patch will follow and add documentation suggesting to users
how to use the new "warn" options to safely switch between non-enforcing
to enforcing mode. Another patch will add tests for the new configuration
options, new metrics and new log messages.
Fixes#25308.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In #24031 users complained, that trace message is truncated, namely it's
no longer json parsable and table name might not be part of the output.
This path enables users to configure maximum size of trace message.
In case user wanted `table` name, but didn't care about message size,
#26634 will help.
- add configuration varable `alternator_max_users_query_size_in_trace_output`
with default value of 4096 (4 times old default value).
- modify `truncated_content_view` function to use new configuration
variable for truncation limit
- update `truncated_content_view` to consistently truncate at given
size, previously trunctation would also happen when data arrived in
more than one chunk
- update `truncated_content_view` to better handle truncated value
(limit number of copies)
- fix `scylla_config_read` call - call to `query` for a configuration
name that is not existing will return `Items` array empty
(but present) - this would raise array access exception few lines
below.
- add test
Refs #26634
Refs #24031Closesscylladb/scylladb#26618
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.
Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.
This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.
Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).
Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.
Fixes#25359Fixes#26453Closesscylladb/scylladb#26186
* github.com:scylladb/scylladb:
docs::dev::object_storage: Add some initial info on GS storage
docs/dev: Add mention of (nested) docker usage in testing.md
sstables::object_storage_client: Forward memory limit semaphore to GS instance
utils::gcp::object_storage: Add optional memory limits to up/download
sstables::object_storage_client: Add multi-upload support for GS
utils::gcp::storage: Add merge objects operation
test_backup/test_basic: Make tests multiplex both s3 and gs backends
test::cluster::conftest: Add support for multiple object storage backends
boost::gcs_storage_test: reindent
boost::gcs_storage_test: Convert to use fixture
tests::boost: Add GS object storage cases to mirror S3 ones
tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
sstables::object_storage_client: Add google storage implementation
test_services: Allow testing with GS object storage parameters
utils::gcp::gcp_credentials: Add option to create uninitialized credentials
utils::gcp::object_storage: Make create_download_source return seekable_data_source
utils::gcp::object_storage: Add defensive copies of string_view params
utils::gcp::object_storage: Add missing retry backoff increate
utils::gcp::object_storage: Add timestamp to object listing
utils::gcp::object_storage: Add paging support to list_objects
object_storage_client: Add object_name wrapper type
utils::gcp::object_storage: Add optional abort_source
utils::rest::client: Add abort_source support
sstables: Use object_storage_client for remote storage
sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
s3::upload_progress: Promote to general util type
storage_options: Abstract s3 to "object_storage" and add gs as option
sstables::file_io_extension: Change "creator" callback to just data_source
utils::io-wrappers: Add ranged data_source
utils::io-wrappers: Add file wrapper type for seekable_source
utils::seekable_source: Add a seekable IO source type
object_storage_endpoint_param: Add gs storage as option
config: break out object_storage_endpoint_param preparing for multi storage
The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.
Closesscylladb/scylladb#26116
* github.com:scylladb/scylladb:
schema: Allow configuring consistency setting for a keyspace
db: experimental consistent-tablets option
In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or
secondary index, it needs to perform internal scans. It uses an "internal
page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000.
There was an ad-hoc and undocumented way to override this default in C++
tests, using functions in test/lib/select_statement_utils.hh, but it
was so non-obvious that the test that most needed to override this
default - the very slow test test_indexing_paging_and_aggregation which
would have been must faster with a lower setting - never used it.
So in this patch we replace the ad-hoc configuration functions by a
bona-fide Scylla configuration option named "select_internal_page_size".
The few C++ tests that used the old configuration functions were
modified to use the new configuration parameters. The slow test
test_indexing_paging_and_aggregation still doesn't use the new
configuration to become faster - we'll do this in the next patch.
Another benefit of having this "internal page size" as a configuration
option is that one day a user might realize that the default choice
10,000 is bad for some reason (which I can't envision right now), so
having it configurable might come it handy.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Moves the config wrapper to own file (to reduce recompilation for modifying)
and refactors to handle extending this parameter to non-s3 endpoint configs.
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.
After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.
In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.
After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.
High-level implementation strategy:
1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
and add a new one.
5. Update the user documentation.
Fixesscylladb/scylladb#23030
Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.
Closesscylladb/scylladb#25802
* github.com:scylladb/scylladb:
view: Stop requiring experimental feature
db/view: Verify valid configuration for tablet-based views
db/view: Require rf_rack_valid_keyspaces when creating view
test/cluster/random_failures: Skip creating secondary indexes
test/cluster/mv: Mark test_mv_rf_change as skipped
test/cluster: Adjust MV tests to RF-rack-validity
test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
db/view: Name requirement for views with tablets
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.
We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.
Fixesscylladb/scylladb#23030
The directory utils/ is supposed to contain general-purpose utility
classes and functions, which are either already used across the project,
or are designed to be used across the project.
This patch moves 8 files out of utils/:
utils/advanced_rpc_compressor.hh
utils/advanced_rpc_compressor.cc
utils/advanced_rpc_compressor_protocol.hh
utils/stream_compressor.hh
utils/stream_compressor.cc
utils/dict_trainer.cc
utils/dict_trainer.hh
utils/shared_dict.hh
These 8 files together implement the compression feature of RPC.
None of them are used by any other Scylla component (e.g., sstables have
a different compression), or are ready to be used by another component,
so this patch moves all of them into message/, where RPC is implemented.
Theoretically, we may want in the future to use this cluster of classes
for some other component, but even then, we shouldn't just have these
files individually in utils/ - these are not useful stand-alone
utilities. One cannot use "shared_dict.hh" assuming it is some sort of
general-purpose shared hash table or something - it is completely
specific to compression and zstd, and specifically to its use in those
other classes.
Beyond moving these 8 files, this patch also contains changes to:
1. Fix includes to the 5 moved header files (.hh).
2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt
for the three moved source files (.cc).
3. In the moved files, change from the "utils::" namespace, to the
"netw::" namespace used by RPC. Also needed to change a bunch
of callers for the new namespace. Also, had to add "utils::"
explicitly in several places which previously assumed the
current namespace is "utils::".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25149
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.
This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.
(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).
The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.
This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.
Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 70 18.0 18 128
500001 1 1 647 19.0 19 132
0 1000000 1 748 15.0 15 116
0 500000 2 372 29.0 29 284
0 250000 4 227 56.0 56 504
0 125000 8 116 106.0 106 928
0 62500 16 67 195.0 195 1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 51 5.1 5 20
500001 1 1 64 5.3 5 20
0 1000000 1 679 4.0 4 16
0 500000 2 492 8.0 8 88
0 250000 4 804 16.0 16 232
0 125000 8 409 31.0 31 516
0 62500 16 97 54.0 54 1056
```
Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.
External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.
New functionality, no backport needed.
Closesscylladb/scylladb#26215
* github.com:scylladb/scylladb:
test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
test/cluster: add test_bti_index.py
test: prepare bypass_cache_test.py for `ms` sstables
sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
db/config: expose "ms" format to the users via database config
test: in Python tests, prepare some sstable filename regexes for `ms`
sstables: add `ms` to `all_sstable_versions`
test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
test/lib/index_reader_assertions: skip some row index checks for BTI indexes
test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
test/resource: add `ms` sample sstable files for relevant tests
test/boost/sstable_compaction_test: prepare for `ms` sstables.
test/boost/index_reader_test: prepare for `ms` sstables
test/boost/bloom_filter_tests: prepare for `ms` sstables
test/boost/sstable_datafile_test: prepare for `ms` sstables
test/boost/sstable_test: prepare for `ms` sstables.
sstables: introduce `ms` sstable format version
tools/scylla-sstable: default to "preferred" sstable version, not "highest"
sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
sstables/mx: make Index and Summary components optional
sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
sstables: implement an alternative way to rebuild bloom filters for sstables without Index
utils/bloom_filter: add `add(const hashed_key&)`
sstables: adapt estimated_keys_for_range to sstables without Summary
sstables: make `sstable::estimated_keys_for_range` asynchronous
sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
replica/database: add table::estimated_partitions_in_range()
sstables/mx: implement sstable::has_partition_key using a regular read
sstables: use BTI index for queries, when present and enabled
sstables/mx/writer: populate BTI index files
sstables: create and open BTI index files, when enabled
sstables: introduce Partition and Rows component types
sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`
Extend the `sstable_format` config enum with a "ms" value,
and, if it's enabled (in the config and in cluster features),
use it for new sstables on the node.
(Before this commit, writing `ms` sstables should only be possible
in unit tests, via internal APIs. After this commit, the format
can be enabled in the config and the database will write it during
normal operation).
As of this commit, the new format is not the default yet.
(But it will become the default in a later commit in the same series).
Before this patch, every expression in Alternator's requests was parsed from string to adequate structure.
This patch enables caching, where input expression strings are mapped to parsed template structures.
Every new valid (parsable) expression is added to the cache. The cache has limited (configurable) size - when it is reached, the least recently used entry is removed.
When requested expression is in the cache, the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values).
Caching is implemented for all expression types. The cache is per shard - shared for all operations, expression types, tables, users.
Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled).
Basic metrics (total count of hits and misses for each expression type and number of evicted enries) are implemented.
Cache features are tested in boost unit tests and overall expression caching is tested with Python tests - both mostly rely on metrics.
refs #5023
`perf-alternator` test shows improvement (median):
| test | throughput | instructions_per_op | cpu_cycles_per_op | allocs_per_op |
| ------ | ---------------- | ----------------------------- | --------------------------- | ------------------- |
| read | +6.0% | -8.5% | -7.0% | -4.9% |
| write | +13.4% | -17.6% | -14.7% | -7.4% |
| write(lwt) | +12.7% | -7.9% | -6.9% | -2.8% |
| write_rwm | +5.4% | -10.5% | -7.3% | -4.1% |
"read" had a ProjectionExpression with 10 column names, "write" had a UpdateExpression with 10 column names and "write_rmw" had both ConditionExpression and UpdateExpression.
This patch also includes minor refactoring of other expressions related tests (https://github.com/scylladb/scylladb/issues/22494) - use `test_table_ss` instead of `test_table`.
Fixes#25855.
This is new feature - no backporting.
Closesscylladb/scylladb#25176
* github.com:scylladb/scylladb:
alternator: use expression caching
alternator: adds expression cache implementation
utils: extend lru_string_map
utils: add lru_string_map
alternator/expressions: error on parsing empty update expression
alternator/expressions: fix single value condition expression parsing
test/alternator: use `test_table_ss` instead of `test_table` in expressions related tests.
Before this patch, every expression in Alternator's requests was parsed from string to adequate structure.
This patch enables caching - all calls to parse an expression (all types) are proxied through the cache.
New expression is added to the cache, the least recently used entry (above cache size) is removed.
For existing entries the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values).
The cache is per shard - shared for all operations, expression types, tables, users.
Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled).
Added Python tests are based on metrics.
ScyllaDB offers the `compression` DDL property for configuring
compression per user table (compression algorithm and chunk size). If
not specified, the default compression algorithm is the LZ4Compressor
with a 4KiB chunk size (refer to the default constructor for
`compression_parameters`). The same default applies to system tables as
well.
Add a new configuration option to allow customizing the default for user
tables. Use the previously hardcoded default as the new option's default
value.
Note that the option has no effect on ALTER TABLE statements. An altered
table either inherits explicit compression options from the CQL
statement, or maintains its existing options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
SSTable compression is currently configurable only per table, via the
`compression` property in CREATE/ALTER TABLE statements. This is
represented internally via the `compression_parameters` class. We plan
to offer the same options via the configuration as well, to make the
default compression method for user tables configurable.
This patch prepares the ground by making the `compression_parameters`
usable as a `config_file::named_value`, namely:
* Define an extraction operator (required by `boost::program_options`
for parsing the options from command line).
* Define a formatter (required by `named_value::operator()`).
* Define a template specialization for `config_type_for` (required by
`named_value` constructor).
* Define a yaml converter (required for parsing the options from
scylla.yaml).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.
This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.
This change must be included before the next release.
Otherwise, users will be affected by a breaking change.
References: VECTOR-187
Closesscylladb/scylladb#26033
Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
The option defines the threshold at which the defensive mechanisms
preventing nodes from running out of space, e.g. rejecting user
writes shall be activated.
Its default value is 98% of the disk capacity.
This series adds support for a DynamoDB-compatible Write Capacity Unit (WCU) calculation in Alternator by introducing an optional forced read-before-write mechanism.
Alternator's model differs from DynamoDB, and as a result, some write operations may report lower WCU usage compared to what DynamoDB would report. While this is acceptable in many cases, there are scenarios where users may require accurate WCU reporting that aligns more closely with DynamoDB's behavior.
To address this, a new configuration option, alternator_force_read_before_write, is introduced. When enabled, Alternator will perform a read before executing PutItem, UpdateItem, and DeleteItem operations. This allows it to take the existing item size into account when computing the WCU. BatchWriteItem support is also extended to use this mechanism. Because BatchWriteItem does not support returning old items directly, several internal changes were made to support reading previous item sizes with minimal overhead. Reads are performed at consistency level LOCAL_ONE for efficiency, and the WCU calculation is now done in multiple stages to accurately account for item size differences.
In addition to the implementation changes, test coverage was added to validate the new behavior. These tests confirm that WCU is calculated based on the larger of the old and new items when read-before-write is active, including for BatchWriteItem.
This feature comes with performance overhead and is therefore disabled by default. It can be enabled at runtime via the system.config table and should be used only when precise WCU tracking is necessary.
**New feature, no need to backport**
Closesscylladb/scylladb#24436
* github.com:scylladb/scylladb:
alternator/test_returnconsumedcapacity.py: Test forced read before write
alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write
executor.cc: get_previous_item with consistency level
executor: Extend API of put_or_delete_item
alternator/executor.cc: Accurate WCU for put, update, delete
config: add alternator_force_read_before_write
Enable runtime updates of vector_store_uri configuration without
requiring server restart.
This allows to dynamically enable, disable, or switch the vector search node endpoint on the fly.
In commit 44a1daf we added the ability to read system tables through
the DynamoDB API (actually, the Scan and Query requests only).
This ability is useful for tests, and can also be useful to users who
want to read information that is only available through system tables.
This patch adds support also for *writing* into system tables. This will
be useful for Alternator tests, were we want to temporarily change
some live-updatable configuration option - and so far haven't been
able to do that like we did do in some cql-pytest tests.
For reasons explained in issue #23218, only superuser roles are allowed to
write to system tables - it is not enough for the role to be granted
MODIFY permissions on the system table or on ALL KEYSPACES. Moreover,
the ability to modify system tables carries special risks, so this
patch only allows writes to the system tables if a new configuration
option "alternator_allow_system_table_write" turned on. This option is
turned off by default.
This patch also includes a test for this new configuration-writing
capability. The test scripts test/alternator/run and test.py now
run Scylla with alternator_allow_system_table_write turned on, but
the new test can also run without this option, and will be skipped
in that case (to allow running the test suite against some manually-
run instance of Scylla).
Fixes: #12348
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch introduces a new configuration parameter,
`alternator_force_read_before_write`, which forces Alternator to perform
a read-before-write on all write operations (`PutItem`, `UpdateItem`, and
`DeleteItem`), even when not strictly required.
Enabling this option ensures abetter DynamoDB compatibility in WCU
calculation. by accounting for the size of the existing item, as done
in DynamoDB.
This option introduces performance overhead and should be used with care.
The parameter is runtime-configurable and can be toggled via CQL:
UPDATE system.config SET value = 'true' WHERE name = 'alternator_force_read_before_write';
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
`scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
first).
These steps are not intuitive and more complicated than they could be.
In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.
Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.
Fixesscylladb/scylladb#25015
The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.
Closesscylladb/scylladb#25032
* github.com:scylladb/scylladb:
docs: document the option to set recovery_leader later
test: delay setting recovery_leader in the recovery procedure tests
gossip: add recovery_leader to gossip_digest_syn
db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
db/config, gms/gossiper: change recovery_leader to UUID
db/config, utils: allow using UUID as a config option
This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds'
that allows overriding the default value without using error injection.
Fixesscylladb/scylladb#24641Closesscylladb/scylladb#24746
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.
Fixesscylladb/scylladb#24481
We change the type of the `recovery_leader` config parameter and
`gossip_config::recovery_leader` from sstring to UUID. `recovery_leader`
is supposed to store host ID, so UUID is a natural choice.
After changing the type to UUID, if the user provides an incorrect UUID,
parsing `recovery_leader` will fail early, but the start-up will
continue. Outside the recovery procedure, `recovery_leader` will then be
ignored. In the recovery procedure, the start-up will fail on:
```
throw std::runtime_error(
"Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. "
"If you are trying to run the Raft-based recovery procedure, you must set recovery_leader.");
```
CPU scheduling has been with us since 641aaba12c
(2017), and no one ever disables it. Likely nothing really works without
it.
Make it mandatory and mark the option unused.
Closesscylladb/scylladb#24894
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.
It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.
For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.
This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.
Reference: VS-47 VS-45
For reasons, we want to be able to disallow dictionary-aware compressors
in chosen deployments.
This patch adds a knob for that. When the knob is disabled,
dictionary-aware compressors will be rejected in the validation
stage of CREATE and ALTER statements.
Closesscylladb/scylladb#24355
Back in 2017 (5a2439e702), we introduced a check for large
allocations as they can stall the memory allocator. The warning
threshold was set at 1 MB. Since then many fixes for large allocations
went in and it is now time to reduce the threshold further.
We reduce it here to 128 kB, the natural allocation size for the
system. A quick run showed no warnings.
Closesscylladb/scylladb#23975