As in the AWS and GCP hosts, make all Azure hosts delegate their traffic
to shard 0 to avoid creating too many data encryption keys and API
calls to Key Vault.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The encryption context maintains a cache per host type per thread.
Add a cache for the Azure host as well. Initialize the cache with Azure
hosts from the configuration, while registering the extensions for
encryption.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend `get_or_create_key()` to accept host options that override the
config options. This will be used to pass encryption options from the
table schema. Currently, only the master key can be overridden.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Inject a few fast retries to quickly recover from short-lived transient
errors. If a request is unauthorized, retry with no delay, since it may
be caused by expired tokens.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Implement the `azure_host::init()` API that performs the async
initialization of the host.
Since the Azure host has no state that needs to be initialized, just
verify that we have access to the Vault key. This will cause the system
to fail earlier if not properly configured (e.g., the key does not
exist, the credentials have insufficient permissions, etc.).
Do not run any verification steps if no master key is configured in
`scylla.yaml`. The master key can be specified later or overridden
through the encryption options in table schema.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Implement the `azure_host::get_key_by_id()` API, which retrieves a data
encryption key from a key ID.
Use a loading cache to reduce the API calls to Key Vault. When the cache
needs to refresh or reload a key, extract the ciphertext from the key ID
and unwrap it with the Vault key that is also encoded in the key ID.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a cache to store data encryption keys based on their IDs. This will
be plugged into `get_key_by_id()` in a later patch to avoid unwrapping
keys that have been encountered recently, thereby reducing the API calls
to Key Vault.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Implement the `azure_host::get_or_create_key()` API, which returns a
data encryption key for a given algorithm descriptor (cipher algorithm
and key length).
Use a loading cache to reduce the API calls to Key Vault. When the cache
needs to refresh or reload a key, always create a new one and wrap it
with the Vault key.
For the REST API calls to Key Vault, use an ephemeral HTTP client and
configure it to not wait for the server's response when terminating a
TLS connection. Although the TLS protocol requires clients to wait on
the server's response to a close_notify alert, the Key Vault service
ignores this, causing the client to block for 10 seconds (hardcoded)
before timing out.
Use the following identifier for each key:
<vault name>/<key name>/<key version>:<base64 encoded ciphertext of data encryption key>
The key version is required to support Vault key rotations.
Finally, define an exception for Vault errors.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The Azure host needs credentials to communicate with Key Vault.
First search for credentials in the host options, and then fall back to
default credentials if the former are non-existent or incomplete.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a cache to store data encryption keys based on their attributes
(cipher algorithm + key length). This will be plugged into
`get_or_create_key()` in a later patch to reuse the same keys in
multiple requests, thereby reducing the API calls to Key Vault.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The Azure host manages cryptographic keys using Azure Key Vault.
This patch only defines the API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Attempt to detect credentials from the system.
Inspired from the `DefaultAzureCredential` in the Azure C++ SDK, this
credential type detects credentials from the following sources (in this
order):
* environment variables (SP credentials - same variables as in Azure C++ SDK)
* Azure CLI
* IMDS
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Implement token request with Azure CLI.
Inspired from the Azure C++ SDK's `AzureCliCredential`, this credential
type attempts to run the Azure CLI in a shell and parse the token from
its output. This is meant for development purposes, where a user has
already installed the Azure CLI and logged in with their user account.
Pass the following environment to the process:
* PATH
* HOME
* AZURE_CONFIG_DIR
Add a token factory to construct a token from the process output. Unlike
in Azure Entra and IMDS, the CLI's JSON output does not contain
'expires_in', and the token key is in camel case.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Implement token request from IMDS.
No credentials are required for that - just a plain HTTP request on the
IMDS token endpoint.
Since the IMDS endpoint is a raw IP, it's not possible to reliably
determine whether IMDS is accessible or not (i.e., whether the node is
an Azure VM). Azure provides no node-local indication either. In lack of
a better choice, attempt to connect and declare failure if the
connection is not established within 3 seconds. Use a raw TCP socket for
this check, as the HTTP client currently lacks timeout or cancellation
support. Perform the check only once, during the first token refresh.
For the time being, do not support nodes with multiple user-assigned
managed identities. Expect the token request to fail in this case (IMDS
requires the identifier of the desired Managed Identity).
Add a token factory to correctly parse the HTTP response. This addresses
a discrepancy between token requests on IMDS and Azure Entra - the
'expires_in' field is a string in the former and an integer in the
latter.
Finally, implement a fail-fast retry policy for short-lived transient
errors.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Implement token request for Service Principals with a certificate.
The request is the same as with a secret, except that the secret is
replaced with an assertion. The assertion is a JWT that is signed with
the certificate.
To be consistent with the Azure C++ SDK, expect the certificate and the
associated private key to be encoded in PEM format and be provided in a
single file.
The docs suggest using 'PS256' for the JWT's 'alg' claim. Since this is
not supported by our current JWT library (jwt-cpp), use 'RS256' instead.
The JWT also requires a unique identifier for the 'jti' claim. Use a
random UUID for that (it should suffice for our use cases).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Implement token request for Service Principals with a secret.
The token request requires a TLS connection. When closing the
connection, do not wait for a response to the TLS `close_notify` alert.
Azure's OAuth server would ignore it and the Seastar `connected_socket`
would hang for 10 seconds.
Add log redaction logic to not expose sensitive data from the request
and response payloads.
Add a token factory to parse the HTTP response. This cannot be shared
with other credential types because the JSON format is not consistent.
Finally, implement a fail-fast retry policy for short-lived transient
errors.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The rest http client, currently used by the AWS and GCP key providers,
logs the HTTP requests and responses unaltered. This causes some
sensitive data to be exposed (plaintext data encryption keys,
credentials, access tokens).
Add an interface to optionally redact any sensitive data from HTTP
headers and payloads.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The goal is to mimic the Azure C++ SDK, which offers a variety of
credentials, depending on their type and source.
Declare the following credentials:
* Service Principal credentials
* Managed Identity credentials
* Azure CLI credentials
* Default credentials
Also, define a common exception for SP and MI credentials which are
network-based.
This patch only defines the API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Azure authentication is token based - the client obtains an access token
with their credentials, and uses it as a bearer token to authorize
requests to Azure services.
Define a common API for all credential types. The API will consist of a
single `get_access_token()` function that will be returning a new or a
cached access token for some resource URI (defines token scope).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add helpers for base64url encoding.
base64url is a variant of base64 that uses a URL-safe alphabet. It can
be constructed from base64 by replacing the '+' and '/' characters with
'-' and '_' respectively. Many implementations also strip the padding,
although this is not required by the spec [1].
This will be used in upcoming patches for Azure Key Vault requests that
require base64url-encoded payloads.
[1] https://datatracker.ietf.org/doc/html/rfc4648#section-5
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).
The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.
While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.
Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.
Fixes: scylladb/scylladb#24975Closesscylladb/scylladb#24976
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance
* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`
No backport needed since it enhances functionality which has not been released yet
fixes: https://github.com/scylladb/scylladb/issues/22458Closesscylladb/scylladb#23695
* github.com:scylladb/scylladb:
sstables: Start using `make_data_or_index_source` in `sstable`
sstables: refactor readers and sources to use coroutines
sstables: coroutinize futurized readers
sstables: add `make_data_or_index_source` to the `storage`
encryption: refactor key retrieval
encryption: add `encrypted_data_source` class
To discover what tests are included into combined_tests, pytest check this at
the very beginning. In the case if combined_tests binary is missing, it will
fail discovery and will not run test, even when it was not included into
combined_tests. This PR changes behavior, so it will not fail when
combined_tests is missing and only fail in case someone tries to run test from
it.
Closesscylladb/scylladb#24761
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.
For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.
These counters are no longer accounted by io-queue code and are always
zero. Even more -- accounting removal happened years ago and we don't
have Scylla versions built with seastar older than that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24835
In #24442 it was noticed that accidentally, for a year now, test.py and CI were running the Alternator functional tests (test/alternator) using one write isolation mode (`only_rmw_uses_lwt`) while the manual test/alternator/run used a different write isolation mode (`always_use_lwt`). There is no good reason for this discrepancy, so in the second patch of this 2-patch series we change test/alternator/run to use the write isolation mode that we've had in CI for the last year.
But then, discussion on #24442 started: Instead of picking one mode or the other, don't we need test both modes? In fact, all four modes?
The honest answer is that running **all tests** with **all combinations of options** is not practical - we'll find ourselves with an exponentially growing number of tests. What we really need to do is to run most tests that have nothing to do with write isolation modes on just one arbitrary write isolation mode like we're doing today. For example, numerous tests for the finer details of the ConditionExpression syntax will run on one mode. But then, have a separate test that verifies that one representative example of ConditionExpression (for example) works correctly on all four write isolation modes - rejected in forbid_rmw mode, allowed and behaves as expected on the other three. We had **some** tests like that in our test suite already, but the first patch in this series adds many more, making the test much more exhaustive and making it easier to review that we're really testing all four write isolation modes in every scenario that matters.
Fixes#24442
No need to backport this patch - it's just adding more tests and changing developer-only test behavior.
Closesscylladb/scylladb#24493
* github.com:scylladb/scylladb:
test/alternator: make "run" script use only_rmw_uses_lwt
test/alternator: improve tests for write isolation modes
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
responding before it applied mutations from any of the writes.
We fix the test by not expecting reads with CL=ONE to return a row.
We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.
Fixesscylladb/scylladb#22967
The fix addresses CI flakiness and only changes the test, so it
should be backported.
Closesscylladb/scylladb#23518
Before this series, it is possible to crash Scylla (due to an I/O error) by creating an Alternator table close to the maximum name length of 222, and then enabling Alternator Streams. This series fixes this bug in two ways:
1. On a pre-existing table whose name might be up to 222 characters, enabling Streams will check if the resulting name is too long, and if it is, fail with a clear error instead of crashing. This case will effect pre-existing tables whose name has between 207 and 222 characters (207 is `222 - strlen("_scylla_cdc_log")`) - for such tables enabling Streams will fail, but no longer crash.
2. For new tables, the table name length limit is lowered from 222 to 192. The new limit is still high enough, but ensures it will be possible to enable streams any new table. It will also always be possible to add a GSI for such a table with name up to 29 characters (if the table name is shorter, the GSI name can be longer - the sum can be up to 221 characters).
No need to backport, Alternator Streams is still an experimental feature and this patch just improves the unlikely situation of extremely long table names.
Fixes#24598Closesscylladb/scylladb#24717
* github.com:scylladb/scylladb:
alternator: lower maximum table name length to 192
alternator: don't crash when adding Streams to long table name
alternator: split length limit for regular and auxiliary tables
alternator: avoid needlessly validating table name
Fixes#24873
In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors.
This just makes sure `release` closes the connection if it neither retains or caches it.
Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors.
Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs.
This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency.
Closesscylladb/scylladb#24874
* github.com:scylladb/scylladb:
encryption_test: Make PyKMIP run under seastar::experimental::process
test/lib: Add wrapper helper for test process fixtures
kmip_host: Close connections properly if dropped by pool being full
encryption_at_rest_test: Do port check using TLS
This PR extends the KMS host to support temporary AWS security credentials provided externally via the Scylla configuration file, environment variables, or the AWS credentials file.
The KMS host already supports:
* Temporary credentials obtained automatically from the EC2 instance metadata service or via IAM role assumption.
* Long-term credentials provided externally via configuration, environment, or the AWS credentials file.
This PR is about temporary credentials that are external, i.e., not generated by Scylla. Such credentials may be issued, for example, through identity federation (e.g., Okta + gimme-aws-creds).
External temporary credentials are useful for short-lived tasks like local development, debugging corrupted SSTables with `scylla-sstable`, or other local testing scenarios. These credentials are temporary and cannot be refreshed automatically, so this method is not intended for production use.
Documentation has been updated to mention these additional credential sources.
Fixes#22470.
New feature, no backport is needed.
Closesscylladb/scylladb#22465
* github.com:scylladb/scylladb:
doc: Expose new `aws_session_token` option for KMS hosts
kms_host: Support authn with temporary security credentials
encryption_config: Mention environment in credential sources for KMS
Previously, nodes would become voters immediately after joining, ensuring voter status was established before bootstrap completion. With the limited voters feature, voter assignment became deferred, creating a timing gap where nodes could finish bootstrapping without becoming voters.
This timing issue could lead to quorum loss scenarios, particularly observed in tests but theoretically possible in production environments.
This commit reorders voter assignment to occur before the `update_topology_state()` call, ensuring nodes achieve voter status before bootstrap operations are marked complete. This prevents the problematic timing gap while maintaining compatibility with limited voters functionality.
If voter assignment succeeds but topology state update fails, the operation will raise an exception and be retried by the topology coordinator, maintaining system consistency.
This commit also fixes issue where the `update_nodes` ignored leaving voters potentially exceeding the voter limit and having voters unaccounted for.
Fixes: scylladb/scylladb#24420
No backport: Fix of a theoretical bug + CI stability improvement (we can backport eventually later if we see hits in branches)
Closesscylladb/scylladb#24843
* https://github.com/scylladb/scylladb:
raft: fix voter assignment of transitioning nodes
raft: improve comments in group0 voter handler
Adds a wrapper for seastar::experimental::process, to help
use external process fixtures in unit test. Mainly to share
concepts such as line reading of stdout/err etc, and sync
the shutdown of these. Also adds a small path searcher to
find what you want to run.
Print the keyspace.table names, issue trace log messages also
when returning early if tombstone_gc is disabled or
when gc_check_only_compacting_sstables is set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#24914
Fixes#24873
Note: this happens like never. But if we, in KMIP host, do release
of a connection (socket) due to our connection pool for the host being
full, we currently don't close the connection properly, only rely on
destructors.
While not very serious, this would lead to possible TLS errors in the
KMIP host used, which should be avoided if possible.
Fix is simple, just make release close the connection if it neither retains
nor caches it.
If we connect using just a socket, and don't terminate connection
nicely, we will get annoying errors in PyKMIP log. These distract
from real errors. So avoid them.
Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork.
If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open.
Fixes: https://github.com/scylladb/scylladb/issues/24953Closesscylladb/scylladb#24954
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531Closesscylladb/scylladb#24886
[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]
* github.com:scylladb/scylladb:
test: add type creation to test_snapshot
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator.
Fixes: #24827Closesscylladb/scylladb#24830
Destructor of database_sstable_write_monitor, which is created
in table::try_flush_memtable_to_sstable, tries to get the compaction
state of the processed compaction group. If at this point
the compaction group is already stopped (and the compaction state
is removed), e.g. due to concurrent tablet merge, an exception is
thrown and a node coredumps.
Add flush gate to compaction group to wait for flushes in
compaction_group::stop. Hold the gate in seal function in
table::make_memtable_list. seal function is turned into
a coroutine to ensure it won't throw.
Wait until async_gate is closed before flushing, to ensure that
all data is written into sstables. Stop ongoing compactions
beforehand.
Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber.
Stop method already flushes the compaction group.
Fixes: #23911.
Closesscylladb/scylladb#24582
Since set and unordered_set do not allow modifying
their stored object in place, we need to first extract
each object, clear it gently, and only then destroy it.
To achieve that, introduce a new Extractable concept,
that extracts all items in a loop and calls clear_gently
on each extracted item, until the container is empty.
Add respective unit tests for set and unordered_set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#24608
Previously, nodes would become voters immediately after joining, ensuring
voter status was established before bootstrap completion. With the limited
voters feature, voter assignment became deferred, creating a timing gap
where nodes could finish bootstrapping without becoming voters.
This timing issue could lead to quorum loss scenarios, particularly
observed in tests but theoretically possible in production environments.
This commit reorders voter assignment to occur before the
`update_topology_state()` call, ensuring nodes achieve voter status
before bootstrap operations are marked complete. This prevents the
problematic timing gap while maintaining compatibility with limited
voters functionality.
If voter assignment succeeds but topology state update fails, the
operation will raise an exception and be retried by the topology
coordinator, maintaining system consistency.
This commit also fixes issue where the `update_nodes` ignored leaving
voters potentially exceeding the voter limit and having voters
unaccounted for.
Fixes: scylladb/scylladb#24420