The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
This file provides translation routines for ring positions and clustering positions
from Scylla's native in-memory structures to BTI's byte-comparable encoding.
This translation is performed whenever a new decorated key or clustering block
are added to a BTI index, and whenever a BTI index is queried for a range of positions.
For a description of the encoding, see
fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)
The translation logic, with all the fragment awareness, lazy
evaluation and avoidable copies, is fairly bloated for the common cases
of simple and small keys. This is a potential optimization target for later.
`trie::node_reader`, added in a previous series, contains
encoding-aware logic for traversing a single node
(or a batch of nodes) during a trie search.
This commits adds encoding-agnostic functions which drive the
the `trie::node_reader` in a loop to traverse the whole branch.
Together, the added functions (`traverse`, `step`, `step_back`)
and the data structure they modify (`ancestor_trail`) constitute
a trie cursor. We might later wrap them into some `trie_cursor`
class, but regardless of whether we are going to do that,
keeping them (also) as free functions makes them easier to test.
Closesscylladb/scylladb#25396
This is the next part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25154
Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here.
The new code added here is not used for anything yet, but it's posted as a separate PR
to keep things reviewably small.
This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes.
It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR)
into the on-disk format, and the logic for traversing the on-disk nodes during a read.
New functionality, no backporting needed.
Closesscylladb/scylladb#25317
* github.com:scylladb/scylladb:
sstables/trie: add tests for BTI node serialization and traversal
sstables/trie: implement BTI node traversal
sstables/trie: implement BTI serialization
utils/cached_file: add get_shared_page()
utils/cached_file: replace a std::pair with a named struct
Adds tests which check that nodes serialized by `bti_node_sink`
are readable by `bti_node_reader` with the right result.
(Note: there are no tests which check compatibility of the encoded nodes
with Cassandra or with handwritten hexdumps. There are only tests
for mutual compatibility between Scylla's writers and readers.
This can be considered a gap in testing.)
Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field.
Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator *connections*, we will list the currently active *requests*.
The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients.
Fixes#24993
This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request.
Closesscylladb/scylladb#25178
* github.com:scylladb/scylladb:
test/cqlpy: slightly strengthen test for system.clients
generic_server: use utils::scoped_item_list
docs/alternator: document the system.clients system table in Alternator
alternator: add test for Alternator clients in system.clients
alternator: list active Alternator requests in system.clients
utils: unit test for utils::scoped_item_list
utils: add a scoped_item_list utility class
utils: add "fatal" version of utils::on_internal_error()
This commit implements routines for traversal of BTI nodes in their
on-disk format.
The `node_reader` concept is currently unused (i.e. not asserted by any
template).
It will only be used in the next PR, which will implement trie cursor
routines parametrized `node_reader`.
But I'm including it in this PR to make it clear which functions
will be needed by the higher layer.
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).
As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.
This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:
1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).
It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.
This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:
1. The writer operates on chains of characters, rather than single characters.
In Cassandra's implementation, the writer creates one node per character.
A single long key can be translated to thousands of nodes.
We create only one node per key. (Actually we split very long keys into
a few nodes, but that's arbitrary and beside the point).
For BTI's partition key index this doesn't matter.
Since it only stores a minimal unique prefix of each key,
and the trie is very balanced (due to token randomness),
the average number of new characters added per key is very close to 1 anyway.
(And the string-based logic might actually be a small pessimization, since
manipulating a 1-byte string might be costlier than manipulating a single byte).
But the row index might store arbitrarily long entries, and in that case the
character-based logic might result in catastrophically bad performance.
For reference: when writing a partition index, the total processing cost
of a single node in the trie_writer is on the order of 800 instructions.
Total processing cost of a single tiny partition during a `upgradesstables`
operation is on the order of 10000 instructions. A small INSERT is on the
order of 40000 instructions.
So processing a single 1000-character clustering key in the trie_writer
could cost as much as 20 INSERTs, which is scary. Even 100-character keys
can be very expensive. With extremely long keys like that, the string-based
logic is more than ~100x cheaper than character-based logic.
(Note that only *new* characters matter here. If two index entries share a
prefix, that prefix is only processed once. And the index is only populated
with the minimal prefix needed to distinguish neighbours. So in practice,
long chains might not happen often. But still, they are possible).
I don't know if it makes sense to care about this case, but I figured the
potential for problems is too big to ignore, so I switched to chain-based logic.
2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
than a full page after revising the estimate, Cassandra splits it in a
different way than us.
For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.
The serialization logic is passed to trie_writer via a template parameter.
There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
Refs scylladb/scylladb#19191
New functionality, no backporting needed.
Closesscylladb/scylladb#25154
* github.com:scylladb/scylladb:
sstables: introduce trie_writer
utils/bit_cast: add object_representation()
The previous test introduced a new utility class, utils::scoped_item_list.
This patch adds a comprehensive unit test for the new class.
We test basic usage of scoped_item_list, its size() and empty() methods,
how items are removed from the list when their handle goes out of scope,
how a handle's move constructor works, how items can be read and written
through their handles, and finally that removing an item during a
for_each_gently() iteration doesn't break the iteration.
One thing I still didn't figure out how to properly test is how removing
an item during *multiple* iterations that run concurrently fixes
multiple iterators. I believe the code is correct there (we just have a
list of ongoing iterations - instead of just one), but haven't found
yet a way to reproduce this situation in a test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).
As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.
This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:
1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).
It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.
This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:
1. The writer operates on chains of characters, rather than single characters.
In Cassandra's implementation, the writer creates one node per character.
A single long key can be translated to thousands of nodes.
We create only one node per key. (Actually we split very long keys into
a few nodes, but that's arbitrary and beside the point).
For BTI's partition key index this doesn't matter.
Since it only stores a minimal unique prefix of each key,
and the trie is very balanced (due to token randomness),
the average number of new characters added per key is very close to 1 anyway.
(And the string-based logic might actually be a small pessimization, since
manipulating a 1-byte string might be costlier than manipulating a single byte).
But the row index might store arbitrarily long entries, and in that case the
character-based logic might result in catastrophically bad performance.
For reference: when writing a partition index, the total processing cost
of a single node in the trie_writer is on the order of 800 instructions.
Total processing cost of a single tiny partition during a `upgradesstables`
operation is on the order of 10000 instructions. A small INSERT is on the
order of 40000 instructions.
So processing a single 1000-character clustering key in the trie_writer
could cost as much as 20 INSERTs, which is scary. Even 100-character keys
can be very expensive. With extremely long keys like that, the string-based
logic is more than ~100x cheaper than character-based logic.
(Note that only *new* characters matter here. If two index entries share a
prefix, that prefix is only processed once. And the index is only populated
with the minimal prefix needed to distinguish neighbours. So in practice,
long chains might not happen often. But still, they are possible).
I don't know if it makes sense to care about this case, but I figured the
potential for problems is too big to ignore, so I switched to chain-based logic.
2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
than a full page after revising the estimate, Cassandra splits it in a
different way than us.
For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.
The serialization logic is passed to trie_writer via a template parameter.
There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
Fixes#22106
Moves the shared compress components to sstables, and rename to
match class type.
Adjust includes, removing redundant/unneeded ones where possible.
Closesscylladb/scylladb#25103
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.
However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.
For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)
Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.
Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.
Preparation for new functionality, no backporting needed.
Closesscylladb/scylladb#25093
* github.com:scylladb/scylladb:
sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
test/boost: add a test for inexact index lookups
sstables/mx/reader: allow passing a custom index reader to the constructor
sstables/index_reader: remove advance_to
sstables/mx/reader: handle inexact lookups in `advance_context()`
sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
sstables/index_reader: make the return value of `get_partition_key` optional
sstables/mx/reader: handle "backward jumps" in forward_to
sstables/mx/reader: filter out partitions outside the queried range
sstables/mx/reader: update _pr after `fast_forward_to`
This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping").
In more detail, this patch series consists of:
* Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI).
* The Azure host - the Key Vault endpoint bridge.
* The Azure Key Provider - the interface for the Azure host.
* Unit tests using real Azure resources (credentials and Vault keys).
* Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens).
This is part of the overall effort to support Azure deployments.
Testing done:
* Unit tests.
* Manual test on an Azure VM with a Managed Identity.
* Manual test with credentials from Azure CLI.
* Manual test of `--azure-hosts` cmdline option.
* Manual test of log filtering.
Remaining items:
- [x] Create necessary Azure resources for CI.
- [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201).
Closes https://github.com/scylladb/scylla-enterprise/issues/1077.
New feature. No backport is needed.
Closesscylladb/scylladb#23920
* github.com:scylladb/scylladb:
docs: Document the Azure Key Provider
test: Add tests for Azure Key Provider
pylib: Add mock server for Azure Key Vault
encryption: Define and enable Azure Key Provider
encryption: azure: Delegate hosts to shard 0
encryption: Add Azure host cache
encryption: Add config options for Azure hosts
encryption: azure: Add override options
encryption: azure: Add retries for transient errors
encryption: azure: Implement init()
encryption: azure: Implement get_key_by_id()
encryption: azure: Add id-based key cache
encryption: azure: Implement get_or_create_key()
encryption: azure: Add credentials in Azure host
encryption: azure: Add attribute-based key cache
encryption: azure: Add skeleton for Azure host
encryption: Templatize get_{kmip,kms,gcp}_host()
encryption: gcp: Fix typo in docstring
utils: azure: Get access token with default credentials
utils: azure: Get access token from Azure CLI
utils: azure: Get access token from IMDS
utils: azure: Get access token with SP certificate
utils: azure: Get access token with SP secret
utils: rest: Add interface for request/response redaction logic
utils: azure: Declare all Azure credential types
utils: azure: Define interface for Azure credentials
utils: Introduce base64url_{encode,decode}
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.
Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh
Fixes: #22102Fixes: #22103Fixes: #22105Closesscylladb/scylladb#25082
Define the Azure Key Provider to connect the core EaR business logic
with the Azure-based Key Management implementation (Azure host).
Introduce "AzureKeyProviderFactory" as a new `key_provider` value in the
configuration.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The Azure host manages cryptographic keys using Azure Key Vault.
This patch only defines the API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The goal is to mimic the Azure C++ SDK, which offers a variety of
credentials, depending on their type and source.
Declare the following credentials:
* Service Principal credentials
* Managed Identity credentials
* Azure CLI credentials
* Default credentials
Also, define a common exception for SP and MI credentials which are
network-based.
This patch only defines the API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Azure authentication is token based - the client obtains an access token
with their credentials, and uses it as a bearer token to authorize
requests to Azure services.
Define a common API for all credential types. The API will consist of a
single `get_access_token()` function that will be returning a new or a
cached access token for some resource URI (defines token scope).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Fixes#24873
In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors.
This just makes sure `release` closes the connection if it neither retains or caches it.
Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors.
Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs.
This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency.
Closesscylladb/scylladb#24874
* github.com:scylladb/scylladb:
encryption_test: Make PyKMIP run under seastar::experimental::process
test/lib: Add wrapper helper for test process fixtures
kmip_host: Close connections properly if dropped by pool being full
encryption_at_rest_test: Do port check using TLS
Adds a wrapper for seastar::experimental::process, to help
use external process fixtures in unit test. Mainly to share
concepts such as line reading of stdout/err etc, and sync
the shutdown of these. Also adds a small path searcher to
find what you want to run.
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531Closesscylladb/scylladb#24886
[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]
* github.com:scylladb/scylladb:
test: add type creation to test_snapshot
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.
Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.
It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.
For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.
This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.
Reference: VS-47 VS-45
This patch implements a new class, `comparable_bytes`, designed to
implement methods for converting data values to and from byte-comparable
formats. The class stores the comparable bytes as `managed_bytes` and
currently provides the structure for all required methods. The actual
logic for converting various data types will be implemented in subsequent
patches.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
We use patchelf to rewrite the dynamic loader (known as the interpreter)
of the binaries we ship, so we can point to our shipped dynamic loader,
which is compatible with our binaries, rather than rely on the distribution's
dynamic loader, which is likely to be incompatible.
Upstream patchelf losing compatibity [1] with Linux 5.17 and below.
This change was also picked up by Fedora 42, so we cannot update the
toolchain to that distribution until we have an alternative.
Here we add a minimal patchelf alternative. It was mostly written by
Claude. It is minimal in that it only supports --set-interpreter and
--print-interpreter, and works well enough for our needs. We still use
the original patchelf for --remove-rpath; this reduces our maintenance
needs.
[1] 43b75fbc9f
[2] 4b015255d1Closesscylladb/scylladb#24695
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
Closesscylladb/scylladb#24492
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
Fix an issue introduced in commit 083f7353 where p11-kit's compiler flags were
incorrectly added to `args.user_cflags` instead of `user_cflags`. This created
the following problem:
When using CMake generation mode, these flags were added to `CMAKE_CXX_FLAGS`,
causing them to be passed to all compiler invocations including linking stages
where they were irrelevant.
This change moves p11-kit's cflags to `user_cflags`, which ensures the flags are
correctly included in compilation commands but not in linking commands. This
maintains the proper behavior in the ninja build system while fixing the issue in
the CMake build system.
`args.user_cflags` is preserved for its intended purpose of storing user-specified
compiler flags passed via command line options.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23988
Similar to large_data_handler, this interface allows sstable writers to
delegate the handling of corrupt data.
Two implementations are provided:
* system_table_corrupt_data_handler - saved corrupt data in
system.corrupt_data, with a TTL=10days (non-configurable for now)
* nop_corrupt_data_handler - drops corrupt data
A future user of position_in_partition.idl doesn't need full_position
and so doesn't want to include full_position.hh to fix compile errors
when including position_in_partition.idl.hh.
Extract it to a separate idl file: it has a single user in a
storage_proxy VERB.
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).
Fixes#24513
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`.
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649Closesscylladb/scylladb#20853
* github.com:scylladb/scylladb:
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
Fixes#24447
This factory type, which is really more a data holder/connection producer
per connection instance, creates, if using https, a new certificate_credentials
on every instance. Which when used by S3 client is per client and
scheduling groups.
Which eventually means that we will do a set_system_trust + "cold" handshake
for every tls connection created this way.
This will cause both IO and cold/expensive certificate checking -> possible
stalls/wasted CPU. Since the credentials object in question is literally a
"just trust system", it could very well be shared across the shard.
This PR adds a thread local static cached credentials object and uses this
instead. Could consider moving this to seastar, but maybe this is too much.
Closesscylladb/scylladb#24448
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.
Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
Fixed#24296
While the HTTP client used for REST calls in AWS/GCP KMS integration (EAR)
is not general enough to be called a HTTP client as such, it is general
enough to be called a REST client (limited to stateless, single-op REST
calls).
Other code, like general auth integrations (hello Azure) and similar
could reuse this to lessen code duplication.
This patch simply moves the httpclient class from encryption to "rest"
namespace, and explicitly "limits" it to such usage. Making an alias
in encryption to avoid touching more files than needed.
Closesscylladb/scylladb#24297
In this patch we add an abstract class, "custom_index", with a validate() method.
Each CUSTOM INDEX class needs to implement a concrete subclass of custom_index
which is used to validate if this type of custom index class may be used,
and whether the optional parameters passed to it are valid.
We change the existing CUSTOM INDEX validation code to use this new mechanism.
Finally this patch implements one concrete subclass for vector index.
Before this patch, the custom index type "vector_index" was allowed,
but after this patch it gains more validation of its optional parameters
(we support 4 specific parameters, with some rules on their values).
Of course, the vector index isn't actually implemented in this patch,
we are just improving the validation of the index creation statement.
compile_commands.json is used by LSPs (e.g. `clangd` in VS Code) for
code navigation. `merge-compdb.py`, called by `configure.py`, merges
these files from Scylla, Seastar, and Abseil. The script filters
entries by checking the output attribute against a given prefix. This
is needed because Scylla’s compile_commands.json is generated by Ninja
and includes all build modes, in case the user specified multiple
ones in the call to configure.py. Seastar and Abseil databases,
generated by CMake, used to omit the output attribute, so filtering
did not apply. Starting with `CMake 3.20+`, output attributes are now
included and do not match the expected prefix. For example, they
could be of the form
`absl/synchronization/CMakeFiles/synchronization.dir/internal/futex_waiter.cc.o`.
This causes relevant entries from Seastar and Abseil to be filtered out.
This patch refactors `merge-compdb.py` to allow specifying an
optional prefix per input file, preserving the intent of applying
the output filtering logic only for ninja-generated
Scylla compdb file.
Closesscylladb/scylladb#24211
If Scylla is compiled without "-g" (this is, for example, the default
in dev build mode), any static library that we link with it and contains
any debugging information will cause the resulting executable to
incorrectly look (e.g., to file(1) or to gdb) like it has debugging
information.
For more than three years now (see #10863 for historical context),
the wasmtime.a library, which has debugging symbols, has caused this
to happen.
In this patch, if a certain build is compiled WITHOUT "-g", we add the
"--strip-debug" option to the linker to remove the partial debugging
information from the executable. Note that --strip-debug is not added
in build modes which do use "-g", or if the user explicitly asked to
add -g (e.g., "configure.py --cflags=-g").
Before this patch:
$ file build/dev/scylla
build/dev/scylla: ELF 64-bit LSB executable ... , with debug_info, not stripped
Ater this patch:
$ file build/dev/scylla
build/dev/scylla: ELF 64-bit LSB executable ... , not stripped
Fixes#23832.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23840
Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated).
The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks.
After each node addition or removal the voters are recalculated and rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked).
* When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution)
Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution.
At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR.
The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures.
Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested).
Tests added:
* boost/group0_voter_registry_test.cc: run time on CI: ~3.5s
* topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total
Fixes: scylladb/scylladb#18793
No backport: This is a new feature that will not be backported.
Closesscylladb/scylladb#21969
* https://github.com/scylladb/scylladb:
raft: distribute voters by rack inside DC
raft/test: fix lint warnings in `test_raft_no_quorum`
raft/test: add the upgrade test for limited voters feature
raft topology: handle on_up/on_down to add/remove node from voters
raft: fix the indentation after the limited voters changes
raft: implement the limited voters feature
raft: drop the voter removal from the decommission
raft/test: disable the `stop_before_becoming_raft_voter` test
raft/test: stop the server less gracefully in the voters test
Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency.
It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed.
New connections_shed and connections_blocked metrics are added for tracking.
Testing:
- manually via simple program creating high number of connection and constantly re-connecting
- added benchmark
Following are benchmark results:
Before:
```
> build/release/test/perf/perf_generic_server --smp=1
170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors)
[...]
throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54
instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96
cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15
```
After:
```
> build/release/test/perf/perf_generic_server --smp=1
167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors)
[...]
throughput:
mean= 171199.55 standard-deviation=2484.58
median= 171667.06 median-absolute-deviation=2087.63
maximum=173689.11 minimum=167904.76
instructions_per_op:
mean= 4801.90 standard-deviation=16.54
median= 4796.78 median-absolute-deviation=9.32
maximum=4830.71 minimum=4789.81
cpu_cycles_per_op:
mean= 3245.26 standard-deviation=32.28
median= 3230.44 median-absolute-deviation=16.52
maximum=3297.39 minimum=3215.62
```
The patch adds around 67 insns/op so it's effect on performance should be negligible.
Fixes: https://github.com/scylladb/scylladb/issues/22844Closesscylladb/scylladb#22828
* github.com:scylladb/scylladb:
transport: move on_connection_close into connection destructor
test: perf: make aggregated_perf_results formatting more human readable
transport: add blocked and shed connection metrics
generic_server: throttle and shed incoming connections according to semaphore limit
generic_server: add data source and sink wrappers bookkeeping network IO
generic_server: coroutinize part of server::do_accepts
test: add benchmark for generic_server
test: perf: add option to count multiple ops per time_parallel iteration
generic_server: add semaphore for limiting new connections concurrency
generic_server: add config to the constructor
generic_server: add on_connection_ready handler
Changes in configure.py are needed becuase we don't want to embed
this benchmark in scylla binary as perf_simple_query or perf_alternator,
it doesn't directly translate to Scylla performance but we want to use
aggregated_perf_results for precise cpu measurements so we need
different dependecies.
Currently if raft is enabled all nodes are voters in group0. However it
is not necessary to have all nodes to be voters - it only slows down
the raft group operation (since the quorum is large) and makes
deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along
1 DC with 10 nodes will lose the majority if large DC is isolated).
The topology coordinator will now maintain a state where there are only
limited number of voters, evenly distributed across the DCs and racks.
After each node addition or removal the voters are recalculated and
rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the
current distribution of voters - either if there are still some voter
"slots" available, or if the new node is a better candidate than some
existing voter (in which case the existing node voter status might be
revoked).
* When a voter node is removed or stopped (shut down), its voter status
is revoked and another node might become a voter instead (this can also
depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of datacenters
(DCs) or racks, the rebalance action might become wider (as there are
some special rules applying to 1 vs 2 vs more DCs, also changing the
number of racks might cause similar effects in the voters distribution)
Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible),
meaning that we can tolerate a loss of the DC with the smaller number
of voters (if both would have the same number of voters we'd lose the
majority if any of the DCs is lost).
For example, if we have 2 DCs with 2 nodes each, one of them will only
have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has
more racks than the other and the node count allows it, the DC with
the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every
DC has strictly less than half of the total voters (so a loss of any
of the DCs cannot lead to the majority loss). Again, DCs with more
racks are being preferred in the voter distribution.
At the moment we will be handling the zero-token nodes in the same way
as the regular nodes (i.e. the zero-token nodes will not take any
priority in the voter distribution). Technically it doesn't make much
sense to have a zero-token node that is not a voter (when there are
regular nodes in the same DC being voters), but currently the intended
purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs,
creating a third DC with zero-token nodes only), so for that intended
purpose no special handling is needed and will work out of the box.
If a preference of zero token nodes will eventually be needed/requested,
it will be added separately from this PR.
Currently the voter limits will not be configurable (we might introduce
configurable limits later if that would be needed/requested).
The feature is enabled by the `group0_limited_voters` feature flag
to avoid issues with cluster upgrade (the feature will be only enabled
once all nodes in the cluster are upgraded to the version supporting
the feature).
Fixes: scylladb/scylladb#18793