During development of #22428 we decided that we have
no need for `object-storage.yaml`, and we'd rather store
the endpoints in `scylla.yaml` and get a REST api to exopose
the endpoints for free.
This patch removes the credentials provider used to read the
aws keys from this yaml file.
Followup work will remove the `object-storage.yaml` file
altogether and move the endpoints to `scylla.yaml`.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#22951
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.
Fixes: scylladb/scylladb#22919
Bug is present in all live versions, backports are required.
Closesscylladb/scylladb#23044
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
Previously, variables were marked as const, causing std::move() calls to
be redundant as reported by GCC warnings. This change either removes
const qualifiers or marks related lambdas as mutable, allowing the
compiler to properly utilize move constructors for better performance.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23066
As a part of the moving to bare pytest we need to extract the required test
environment preparation steps into pytest's hooks/fixtures.
Do this for S3 mock stuff (MinioServer, MockS3Server, and S3ProxyServer)
and for directories with test artifacts.
For compatibility reason add --test-py-init CLI option for bare pytest
test runner: need to add it to pytest command if you need test.py
stuff in your tests (boost, topology, etc.)
Also, postpone initialization of TestSuite.artifacts and TestSuite.hosts
from import-time to runtime.
Closesscylladb/scylladb#23087
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.
Fixes: scylladb/scylladb#22919
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.
While generally better to reduce inline code, here we get
rid of the clustering_interval_set.hh dependency, which in turns
depends on boost interval_set, a large dependency.
incremental_compaction_test.cc is adjusted for a missing header.
Closesscylladb/scylladb#22957
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22997
rebalance_tablets() was performing migrations and merges automatically
but not splits, because splits need to be acked by replicas via
load_stats. It's inconvenient in tests which want to rebalance to the
equilibrium point. This patch changes rebalance_tablets() to split
automatically by default, can be disabled for tests which expect
differently.
shared_load_stats was introduced to provide a stable holder of
load_stats which can be reused across rebalance_tablets() calls.
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
This makes decisions made by the scheduler consistent with decisions
made on table creation, with regard to tablet count.
We want to avoid over-allocation of tablets when table is created,
which would then be reduced by the scheduler's scaling logic. Not just
to avoid wasteful migrations post table creation, but to respect the
per-shard goal. To respect the per-shard goal, the algorithm will no
longer be as simple as looking at hints, and we want to share the
algorithm between the scheduler and initial tablet allocator. So
invoke the scheduler to get the tablet count when table is created.
Refs #22628
Adds exception handler + cleanup for the case where we have a bad config/env vars (hint minio) or similar, such that we fail with exception during setting up the EAR context. In a normal startup, this is ok. We will report the exception, and the do a exit(1).
In tests however, we don't and active context will instead be freed quite proper, in which case we need to call stop to ensure we don't crash on shared pointer destruction on wrong shard. Doing so will hide the real issue from whomever runs the test.
Adds some verbosity to track issues with the network proxy used to test EAR connector difficulties. Also adds an earlier close in input stream to help network usage.
Note: This is a diagnostic helper. Still cannot repro the issue above.
Closesscylladb/scylladb#22810
* github.com:scylladb/scylladb:
gcp/aws kms: Promote service_error to recoverable + use malformed_response_error
encryption_at_rest_test: Add verbosity + earlier stream close to proxy
encryption: Add exception handler to context init (for tests)
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.
This PR first refactors the reclaim and reload logic into a single
background fiber. It then updates the sstable manager to subscribe to
changes in the `components_memory_reclaim_threshold` configuration value
and immediately triggers the reclaim/reload fiber when a change is
detected.
Fixes#21947
This is an improvement and does not need to be backported.
Closesscylladb/scylladb#22725
* github.com:scylladb/scylladb:
sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update
sstables_manager: maybe_reclaim_components: yield between iterations
sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()`
sstables_manager: move reclaim logic into `components_reclaim_reload_fiber()`
sstables_manager: rename `_sstable_deleted_event` condition variable
sstables_manager: rename `components_reloader_fiber()`
sstables_manager: fix `maybe_reclaim_components()` indentation
sstables_manager: reclaim components memory until usage falls below threshold
sstables_manager: introduce `get_components_memory_reclaim_threshold()`
sstables_manager: extract `maybe_reclaim_components()`
sstables_manager: fix `maybe_reload_components()` indentation
sstables_manager: extract out `maybe_reload_components()`
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.
This patch updates the sstable manager to subscribe to any changes in
the above mentioned config value and immediately trigger the
reclaim/reload fiber when a change occurs. Also, adds a testcase to
verify the fix.
Fixes#21947
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Refs #22628
Adds some verbosity to track issues with the network proxy used to test
EAR connector difficulties. Also adds an earlier close in input stream
to help network usage.
Note: This is a diagnostic helper. Still cannot repro the issue above.
Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support.
Fixes https://github.com/scylladb/scylladb/issues/22417
New feature. No backport is needed.
Closesscylladb/scylladb#22621
* github.com:scylladb/scylladb:
test: add test to check dcs and hosts repair filter
test: add repair dc selection to test_tablet_metadata_persistence
repair: Introduce Host and DC filter support
docs: locator: update the docs and formatter of tablet_task_info
result_set_row is a heavyweight object containing multiple cell types:
regular columns, partition keys, and static values. To prevent expensive
accidental copies, delete the copy constructor and replace it with:
1. A move constructor for efficient vector reallocation
2. An explicit copy() method when copies are actually needed
This change reduces overhead in some non-hot paths by eliminating implicit
deep copies. Please note, previously, in `create_view_from_mutation()`,
we kept a copy of `result_set_row`, and then reused `table_rs` for
holding the mutation for `scylla_tables`. Because we don't copy
the `result_set_row` in this change, in order to avoid invalidating
the `row` after reusing `table_rs` in the outer scope, we define a
new `table_rs` shadowing the one in the out scope.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22741
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22857
In a rolling upgrade, nodes that weren't upgraded yet will not recognize
the new tablet_resize_finalization state, that serves both split and
merges, leading to a crash. To fix that, coordinator will pick the
old tablet_split_finalization state for serving split finalization,
until the cluster agrees on merge, so it can start using the new
generic state for resize finalization introduced in merge series.
Regression was introduced in e00798f.
Fixes#22840.
Reported-by: Tomasz Grabiec <tgrabiec@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22845
Move the sstable reclaim logic into `components_reclaim_reload_fiber()`
in preparation for the fix for #21947. This also simplifies the overall
reclaim/reload logic by preventing multiple fibers from attempting to
reclaim/reload component memory concurrently.
Also, update the existing test cases to adapt to this change.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Replace boost::copy() with the standard library's std::ranges::copy()
to reduce external dependencies and simplify the codebase. This change
eliminates the requirement for boost::range and makes the implementation
more maintainable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22789
This PR converts boost load balancer tests in preparation for load balancer changes
which add per-table tablet hints. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.
Topology is created in tests via group0 commands, which is abstracted by
the new `topology_builder` class.
Tests cannot modify token_metadata only in memory now as it needs to be
consistent with the schema and on-disk metadata. That's why modifications to
tablet metadata are now made under group0 guard and save back metadata to disk.
Closesscylladb/scylladb#22648
* github.com:scylladb/scylladb:
test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario
tests: tablets: Set initial tablets to 1 to exit growing mode
test: tablets_test: Create proper schema in load balancer tests
test: lib: Introduce topology_builder
test: cql_test_env: Expose topology_state_machine
topology_state_machine: Introduce lock transition
Add the possibility to run boost and unit tests with pytest
test.py should follow the next paradigm - the ability to run all test cases sequentially by ONE pytest command.
With this paradigm, to have the better performance, we can split this 1 command into 2,3,4,5,100,200... whatever we want
It's a new functionality that does not touch test.py way of executing the boost and unit tests.
It supports the main features of test.py way of execution: automatic discovery of modes, repeats.
There is an additional requirement to execute tests in parallel: pytest-xdist. To install it, execute `pip install pytest-xdist`
To run test with pytest execute `pytest test/boost`. To execute only one file, provide the path filename `pytest test/boost/aggregate_fcts_test.cc` since it's a normal path, autocompletion will work on the terminal. To provide a specific mode, use the next parameter `--mode dev`, if parameter will not be provided pytest will try to use `ninja mode_list` to find out the compiled modes.
Parallel execution controlled by pyest-xdist and the parameter `-n 12`.
The useful command to discover the tests in the file or directory is `pytest --collect-only -q --mode dev test/boost/aggregate_fcts_test.cc`. That will return all test functions in the file. To execute only one function from the test, you can invoke the output from the previous command, but suffix for mode should be skipped, for example output will be `test/boost/aggregate_fcts_test.cc::test_aggregate_avg.dev`, so to execute this specific test function, please use the next command `pytest --mode dev test/boost/aggregate_fcts_test.cc::test_aggregate_avg`
There is a parameter `--repeat` that used to repeat the test case several times in the same way as test.py did.
It's not possible to run both boost and unit tests directories with one command, so we need to provide explicitly which directory should be executed. Like this `pytest --mode dev test/unit` or `pytest --mode dev test/boost`
Fixes: https://github.com/scylladb/qa-tasks/issues/1775Closesscylladb/scylladb#21108
* github.com:scylladb/scylladb:
test.py: Add possibility to run ldap tests from pytest
test.py: Add the possibility to run unit tests from pytest
test.py: Add the possibility to run boost test from pytest
test.py: Add discovery for C++ tests for pytest
test.py: Modify s3 server mock
test.py: Add method to get environment variables from MinIO wrapper
test.py: Move get configured modes to common lib
This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.
* New feature, no backport required
Closesscylladb/scylladb#22090
* github.com:scylladb/scylladb:
tablets: resize_decision: get rid of initial_decision
tablet_allocator: consider tablet options for resize decision
tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
cql3/create_keyspace_statement: add deprecation warning for initial tablets
test: cqlpy: test_tablets: add tests for per-table tablet options
schema: add per-table tablet options
feature_service: add TABLET_OPTIONS cluster schema feature
`set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature eviction.
Fixes: https://github.com/scylladb/scylladb/issues/22629
Backport required to all active releases, they are all affected.
Closesscylladb/scylladb#22701
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: set_notify_handler(): disable timeout
reader_permit: mark check_abort() as const
Add the possibility to run boost test from pytest.
Boost facade based on code from https://github.com/pytest-dev/pytest-cpp, but enhanced and rewritten to suite better.
This scenario is invoked in a loop in the
test_load_balancing_merge_colocation_with_random_load test case, which
will cause accumulation of tablet maps making each reload slower in
subsequent iterations.
It wasn't a problem before because we overwritten tablet_metadata in
each iteration to contain only tablets for the current table, but now
we need to keep it consistent with the schema and don't do that.
After tablet hints, there is no notion of leaving growing mode and
tablet count is sustained continuously by initial tablet option, so we
need to lower it for merge to happen.
This is in preparation for load balancer changes needed to respect
per-table tablet hints and respecting per-shard tablet count
goal. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.
set_notify_handler() is called after a querier was inserted into the
querier cache. It has two purposes: set a callback for eviction and set
a TTL for the cache entry. This latter was not disabling the
pre-existing timeout of the permit (if any) and this would lead to
premature eviction of the cache entry if the timeout was shorter than
TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature
eviction.
Fixes: #scylladb/scylladb#22629
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.
The patch contains:
- implementation of vector_type_impl class
- necessary functionalities similar to other data types
- support for serialization and deserialization of vectors
- support for Lua and JSON format
- valid CQL syntax for `vector<>` type
- `type_parser` support for vectors
- expression adjustments such as:
- add `collection_constructor::style_type::vector`
- rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector`
- vector type encoding (for drivers)
- unit tests
- cassandra compatibility tests
- necessary documentation
Co-authored-by: @janpiotrlakomy
Fixes https://github.com/scylladb/scylladb/issues/19455Closesscylladb/scylladb#22488
* github.com:scylladb/scylladb:
docs: add vector type documentation
cassandra_tests: translate tests covering the vector type
type_codec: add vector type encoding
boost/expr_test: add vector expression tests
expression: adjust collection constructor list style
expression: add vector style type
test/boost: add vector type cql_env boost tests
test/boost: add vector type_parser tests
type_parser: support vector type
cql3: add vector type syntax
types: implement vector_type_impl
Do not merge tablets if that would drop the tablet_count
below the minimum provided by hints.
Split tablets if the current tablet_count is less than
the minimum tablet count calculated using the table's tablet options.
TODO: override min_tablet_count if the tablet count per shard
is greater than the maximum allowed. In this case
the tables tablet counts should be scaled down proportionally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This update introduces four types of credential providers:
1. Environment variables
2. Configuration file
3. AWS STS
4. EC2 Metadata service
The first two providers should only be used for testing and local runs. **They must NEVER be used in production.**
The last two providers are intended for use on real EC2 instances:
- **AWS STS**: Preferred method for obtaining temporary credentials using IAM roles.
- **EC2 Metadata Service**: Should be used as a last resort.
Additionally, a simple credentials provider chain is created. It queries each provider sequentially until valid credentials are obtained. If all providers fail, it returns an empty result.
fixes: #21828Closesscylladb/scylladb#21830
* github.com:scylladb/scylladb:
docs: update the `object_storage.md` and `admin.rst`
aws creds: add STS and Instance Metadata service credentials providers
aws creds: add env. and file credentials providers
s3 creds: move credentials out of endpoint config
Use the keyspace initial_tablets for min_tablet_count, if the latter
isn't set, then take the maximum of the option-based tablet counts:
- min_tablet_count
- and expected_data_size_in_gb / target_tablet_size
- min_per_shard_tablet_count (via
calculate_initial_tablets_from_topology)
If none of the hints produce a positive tablet_count,
fall back to calculate_initial_tablets_from_topology * initial_scale.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unlike with vnodes, each tablet is served only by a single
shard, and it is associated with a memtable that, when
flushed, it creates sstables which token-range is confined
to the tablet owning them.
On one hand, this allows for far better agility and elasticity
since migration of tablets between nodes or shards does not
require rewriting most if not all of the sstables, as required
with vnodes (at the cleanup phase).
Having too few tablets might limit performance due not
being served by all shards or by imbalance between shards
caused by quantization. The number of tabelts per table has to be
a power of 2 with the current design, and when divided by the
number of shards, some shards will serve N tablets, while others
may serve N+1, and when N is small N+1/N may be significantly
larger than 1. For example, with N=1, some shards will serve
2 tablet replicas and some will serve only 1, causing an imbalance
of 100%.
Now, simply allocating a lot more tablets for each table may
theoretically address this problem, but practically:
a. Each tablet has memory overhead and having too many tablets
in the system with many tables and many tablets for each of them
may overwhelm the system's and cause out-of-memory errors.
b. Too-small tablets cause a proliferation of small sstables
that are less efficient to acces, have higher metadata overhead
(due to per-sstable overhead), and might exhaust the system's
open file-descriptors limitations.
The options introduced in this change can help the user tune
the system in two ways:
1. Sizing the table to prevent unnecessary tablet splits
and migrations. This can be done when the table is created,
or later on, using ALTER TABLE.
2. Controlling min_per_shard_tablet_count to improve
tablet balancing, for hot tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For example, nodes which are being decommissioned should not be
consider as available capacity for new tables. We don't allocate
tablets on such nodes.
Would result in higher per-shard load then planned.
Closesscylladb/scylladb#22657
in order to reduce the external header dependency, let's switch to
the standardlized std::ranges::min_element().
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22572
Since mid December, tests started failing with ENOMEM while
submitting I/O requests.
Logs of failed tests show IO uring was used as backend, but we
never deliberately switched to IO uring. Investigation pointed
to it happening accidentaly in commit 1bac6b75dc,
which turned on IO uring for allowing native tool in production,
and picked linux-aio backend explicitly when initializing Scylla.
But it missed that seastar-based tests would pick the default
backend, which is io_uring once enabled.
There's a reason we never made io_uring the default, which is
that it's not stable enough, and turns out we made the right
choice back then and it apparently continue to be unstable
causing flakiness in the tests.
Let's undo that accidental change in tests by explicitly
picking the linux-aio backend for seastar-based tests.
This should hopefully bring back stability.
Refs #21968.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22695
This commit introduces two new credentials providers: STS and Instance Metadata Service. The S3 client's provider chain has been updated to incorporate these new providers. Additionally, unit tests have been added to ensure coverage of the new functionality.
This commit entirely removes credentials from the endpoint configuration. It also eliminates all instances of manually retrieving environment credentials. Instead, the construction of file and environment credentials has been moved to their respective providers. Additionally, a new aws_credentials_provider_chain class has been introduced to support chaining of multiple credential providers.
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.
Fixes: #22588Closesscylladb/scylladb#22624
This commit refactors the way AWS credentials are managed in Scylla. Previously, credentials were included in the endpoint configuration. However, since credentials and endpoint configurations serve different purposes and may have different lifetimes, it’s more logical to manage them separately. Moving forward, credentials will be completely removed from the endpoint_config to ensure clear separation of concerns.