This is mostly a refactoring commit to make the test
more readable, as a byproduct of
scylladb/scylladb#17369 investigation.
We add the check for specific type of exceptions that
can be thrown (bad_property_file_error).
We also fix the potential race - the test may write
to res from multiple cores with no locks.
Closesscylladb/scylladb#17371
Adds a test reproducing https://github.com/scylladb/scylladb/issues/16759, and the instrumentation needed for it.
Closesscylladb/scylladb#17208
* github.com:scylladb/scylladb:
row_cache_test: test cache consistency during memtable-to-cache merge
row_cache: use preemption_source in update()
utils: preempt: add preemption_source
Commit 904bafd069 consolidated the two
existing for_each_tablet() overloads, to the one which has a future<>
returning callback. It also added yields to the bodies of said
callbacks. This is unnecessary, the loop in for_each_tablet() already
has a yield per tablet, which should be enough to prevent stalls.
This patch is a follow-up to #17118Closesscylladb/scylladb#17284
To facilitate testing the state of cache after the update is preempted
at various points, pass a preemption_source& to update() instead of
calling the reactor directly.
In release builds, the calls to preemption_source methods should compile
to the same direct reactor calls as today. Only in dev mode they should
add an extra branch. (However, the `preemption_source&` argument has
to be shoveled in any mode).
The table query param is added to get the describe_ring result for a
given table.
Both vnode table and tablet table can use this table param, so it is
easier for users to user.
If the table param is not provided by user and the keyspace contains
tablet table, the request will be rejected.
E.g.,
curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles"
curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1"
Refs #16509Closesscylladb/scylladb#17118
* github.com:scylladb/scylladb:
tablets: Convert to use the new version of for_each_tablet
storage_service: Add describe_ring support for tablet table
storage_service: Mark host2ip as const
tablets: Add for_each_tablet_gently
Reduces footprint from hundreds of MB to a very few MB.
Issue could be reproduced with:
./build/dev/test/boost/mutation_writer_test --run_test=test_token_group_based_splitting_mutation_writer -- -m 500M --smp 1 --random-seed 1848215131
Fixes#17076.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17187
When creating a keyspace, scylla allows setting RF value smaller than there are nodes in the DC. With vnodes, when new nodes are bootstrapped, new tokens are inserted thus catching up with RF. With tablets, it's not the case as replica set remains unchanged.
With tablets it's good chance not to mimic the vnodes behavior and require as many nodes to be up and running as the requested RF is. This patch implementes this in a lazy manned -- when creating a keyspace RF can be any, but when a new table is created the topology should meet RF requirements. If not met, user can bootstrap new nodes or ALTER KEYSPACE.
closes: #16529Closesscylladb/scylladb#17079
* github.com:scylladb/scylladb:
tablets: Make sure topology has enough endpoints for RF
cql-pytest: Disable tablets when RF > nodes-in-DC
test: Remove test that configures RF larger than the number of nodes
keyspace_metadata: Include tablets property in DESCRIBE
also disable some more warnings which are failing the build after `-Wextra` is enabled. we can fix them on a case-by-case basis, if they are geniune issues. but before that, we just disable them.
this goal of this change is to reduce the discrepancies between the compile options used by CMake and those used by configure.py. the side effect is that we enable some more warning enabeld by `-Wextra`, for instance, `-Wsign-compare` is enable now. for the full list of the enabled warnings when building with Clang, please see https://clang.llvm.org/docs/DiagnosticsReference.html#wextra.
Closesscylladb/scylladb#17131
* github.com:scylladb/scylladb:
configure.py: add -Wextra to cflags
test/tablets: do not compare signed and unsigned
get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
Closesscylladb/scylladb#17130
* github.com:scylladb/scylladb:
treewide: replace seastar::future::get0() with seastar::future::get()
sstable: capture return value of get0() using auto
utils: result_loop: define result_type with decayed type
[avi: add another one that snuck in while this was cooking]
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
this change should silence following warning:
```
test/boost/tablets_test.cc:1600:27: error: comparison of integers of different signs: 'int' and 'unsigned int' [-Werror,-Wsign-compare]
19:47:04 for (int i = 0; i < smp::count * 20; i++) {
19:47:04 ~ ^ ~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This reverts commit 370fbd346c, reversing
changes made to 0912d2a2c6.
This makes scylla-manager mis-interpret the data_file_directories
somehow, issue #17078
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.
If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.
Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.
A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.
Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).
A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.
When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.
When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).
Fixes#16536.
Closesscylladb/scylladb#16580
* github.com:scylladb/scylladb:
test/topology_experimental_raft: Add tablet split test
replica: Bypass reshape on boot with tablets temporarily
replica: Fix table::compaction_group_for_sstable() for tablet streaming
test/topology_experimental_raft: Disable load balancer in test fencing
replica: Remap compaction groups when tablet split is finalized
service: Split tablet map when split request is finalized
replica: Update table split status if completed split compaction work
storage_service: Implement split monitor
topology_cordinator: Generate updates for resize decisions made by balancer
load_balancer: Introduce metrics for resize decisions
db: Make target tablet size a live-updateable config option
load_balancer: Implement resize decisions
service: Wire table_resize_plan into migration_plan
service: Introduce table_resize_plan
tablet_mutation_builder: Add set_resize_decision()
topology_coordinator: Wire load stats into load balancer
storage_service: Allow tablet split and migration to happen concurrently
topology_coordinator: Periodically retrieve table_load_stats
locator: Introduce topology::get_datacenter_nodes()
storage_service: Implement table_load_stats RPC
replica: Expose table_load_stats in table
replica: Introduce storage_group::live_disk_space_used()
locator: Introduce table_load_stats
tablets: Add resize decision metadata to tablet metadata
locator: Introduce resize_decision
`db::config` is a class, that is used in many places across the code base. When it is changed, its clients' code need to be recompiled. It represents the configuration of the database. Some fields of the configuration that describe the location of directories may be empty. In such cases `db::config::setup_directories()` function is called - it modifies the provided configuration. Such modification is not good - it is better to keep `db::config` intact.
This PR:
- extends the public interface of utils::directories class to provide required directory paths to the users
- removes 'db::config::setup_directories()' to avoid altering the fields of configuration object
- replaces usages of db::config object with utils::directories object in places that require obtaining paths to dirs
Fixes: scylladb#5626
Closesscylladb/scylladb#16787
* github.com:scylladb/scylladb:
utils/directories: make utils::directories::set an internal type
db::config: keep dir paths unchanged
cql_transport/controler: use utils::directories to get paths of dirs
service/storage_proxy: use utils::directories to get paths of dirs
api/storage_service.cc: use utils::directories to get paths of dirs
tools/scylla-sstable.cc: use utils::directories to get paths
db/commitlog: do not use db::config to get dirs
Use utils::directories to get dirs paths in replica::database
Allow utils::directories to provide paths to dirs
Clean-up of utils::directories
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.
One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.
In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).
Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
The PR adds the API to `raft::server` and a HTTP endpoint that uses it.
In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)
Closesscylladb/scylladb#16816
* github.com:scylladb/scylladb:
raft: remove `empty()` from `fsm_output`
test: add test for manual triggering of Raft snapshots
api: add HTTP endpoint to trigger Raft snapshots
raft: server: add `trigger_snapshot` API
raft: server: track last persisted snapshot descriptor index
raft: server: framework for handling server requests
raft: server: inline `poll_fsm_output`
raft: server: fix indentation
raft: server: move `io_fiber`'s processing of `batch` to a separate function
raft: move `poll_output()` from `fsm` to `server`
raft: move `_sm_events` from `fsm` to `server`
raft: fsm: remove constructor used only in tests
raft: fsm: move trace message from `poll_output` to `has_output`
raft: fsm: extract `has_output()`
raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
raft: server: pass `*_aborted` to `set_exception` call
This change extends utils::directories class in
the following way:
- adds new member variables that correspond to
fields from db::config that describe paths
of directories
- introduces a public interface to retrieve the
values of the new members
- allows construction of utils::directories
object based on db::config to setup internal
member variables related to paths to dirs
The new members of utils::directories are overriden
when the provided values are empty. The way of setting
paths is taken from db::config.
To ensure that the new logic works correctly
`utils_directories_test` has been created.
Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `db::replay_position`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17014
This implements the ability in load balancer to emit split or merge
requests, cancel ongoing ones if they're no longer needed, and
also finalize those that are ready for the topology changes.
That's all based on average tablet size, collected by coordinator
from all nodes, and split and merge thresholds.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The new metadata describes the ongoing resize operation (can be either
of merge, split or none) that spans tablets of a given table.
That's managed by group0, so down nodes will be able to see the
decision when they come back up and see the changes to the
metadata.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay.
This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations.
Fixes#16752Closesscylladb/scylladb#16888
* github.com:scylladb/scylladb:
test: test_tablets: add a test for cleanup after migration
test: pylib: add ScyllaCluster.wipe_sstables
test: boost: add commitlog_cleanup_test
db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
replica: table: garbage-collect irrelevant system.commitlog_cleanups records
db: commitlog: add min_position()
replica: table: populate system.commitlog_cleanups on tablet cleanup
db: system_keyspace: add system.commitlog_cleanups
replica: table: refresh compound sstable set after tablet cleanup
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.
Fixes#15902Closesscylladb/scylladb#16264
* github.com:scylladb/scylladb:
tests: utils: error injection: print time duration instead of count
error_injection: do not cast to milliseconds when injecting timeout
New tablet replicas are allocated and rebuilt synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.
Tablet scheduler is responsible for scheduling tablet rebuilding transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.
Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.
New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.
In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.
Fixes https://github.com/scylladb/scylladb/issues/16690.
Closesscylladb/scylladb#16894
* github.com:scylladb/scylladb:
tests: tablets: Add tests for removenode and replace
tablets: Add support for removenode and replace handling
topology_coordinator: tablets: Do not fail in a tight loop
topology_coordinator: tablets: Avoid warnings about ignored failured future
storage_service, topology: Track excluded state in locator::topology
raft topology: Introduce param-less topology::get_excluded_nodes()
raft topology: Move get_excluded_nodes() to topology
tablets: load_balancer: Generalize load tracking
tablets: Introduce get_migration_streaming_info() which works on migration request
tablets: Move migration_to_transition_info() to tablets.hh
tablets: Extract get_new_replicas() which works on migraiton request
tablets: Move tablet_migration_info to tablets.hh
tablets: Store transition kind per tablet
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)
test/boost/error_injection_test.cc(167): fatal error:
in "test_inject_future_disabled":
critical check wait_time > sleep_msec has failed [23839ns <= 10ms]
Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Loading schemas of views and indexes was not supported, with either `--schema-file`, or when loading schema from schema sstables.
This PR addresses both:
* When loading schema from CQL (file), `CREATE MATERIALIZED VIEW` and `CREATE INDEX` statements are now also processed correctly.
* When loading schema from schema tables, `system_schema.views` is also processed, when the table has no corresponding entry in `system_schema.tables`.
Tests are also added.
Fixes: #16492Closesscylladb/scylladb#16517
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add schema-loading tests for MV/SI
test/cql-pytest: test_tools.py: extract some fixture logic to functions
test/cql-pytest: test_tools.py: extract common schema-loading facilities into base-class
tools/schema_loader: load_schema_from_schema_tables(): add support for MV/SI schemas
tools/schema_loader: load_one_schema_from_file(): add support for view/index schemas
test/boost/schema_loader_test: add test for mvs and indexes
tools/schema_loader: load_schemas(): implement parsing views/indexes from CQL
replica/database: extract existing_index_names and get_available_index_name
tools/schema_loader: make real_db.tables the only source of truth on existing tables
tools/schema_loader: table(): store const keyspace&
tools/schema_loader: make database,keyspace,table non-movable
cql3/statements/create_index_statement: build_index_schema(): include index metadata in returned value
cql3/statements/create_index_statement: make build_index_schema() public
cql3/statements/create_index_statement: relax some method's dependence on qp
cql3/statements/create_view_statement: make prepare_view() public
There are currently two options how to "request" the number of initial tables for a table
1. specify it explicitly when creating a keyspace
2. let scylla calculate it on its own
Both are not very nice. The former doesn't take cluster layout into consideration. The latter does, but starts with one tablet per shard, which can be too low if the amount of data grows rapidly.
Here's a (maybe temporary) proposal to facilitate at least perf tests -- the --tablets-initial-scale-factor option that enhances the option number two above by multiplying the calculated number of tablets by the configured number. This is what we currently do to run perf tests by patching scylla, with the option it going to be more convenient.
Closesscylladb/scylladb#16919
* github.com:scylladb/scylladb:
config: Add --tablets-initial-scale-factor
tablet_allocator: Add initial tablets scale to config
tablet_allocator: Add config
When allocating tablets for table for the frist time their initial count
is calculated so that each shard in a cluster gets one tablet. It may
happen that more than one initial tablet per shard is better, e.g. perf
tests typically rely on that.
It's possible to specify the initial tablets count when creating a
keyspace, this number doesn't take the cluster topology into
consideration and may also be not very nice.
As a temporary solution (e.g. for perf tests) we may add a configurable
that scales the initial number of calculated tablets by some factor
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The name of the keyspace being part of the partition key is not useful,
the table_id already uniquely identifies the table. The keyspace name
being part of the key, means that code wanting to interact with this
table, often has to resolve the table id, just to be able to provide the
keyspace name. This is counter productive, so make the keyspace_name
just a static column instead, just like table_name already is.
Fixes: #16377Closesscylladb/scylladb#16881
Standard containers don't have constructors that take ranges;
instead people use boost::copy_range or C++23 std::ranges::to.
Make the API more uniform by removing this special constructor.
The only caller, in a test, is adjusted.
Closesscylladb/scylladb#16905
`server` was the only user of this function and it can now be
implemented using `fsm`'s public interface.
In later commits we'll extend the logic of `io_fiber` to also subscribe
to other events, triggered by `server` API calls, not only to outputs
from `fsm`.
Refs #16757
Allows waiting for all previous and pending segment deletes to finish.
Useful if a caller of `discard_completed_segments` (i.e. a memtable
flush target) not only wants to ensure segments are clean and released,
but thoroughly deleted/recycled, and hence no treat to resurrecting
data on crash+restart.
Test included.
Closesscylladb/scylladb#16801
To enable tablets replication one needs to turn on the (experimental) feature and specify the `initial_tablets: N` option when creating a keyspace. We want tablets to become default in the future and allow users to explicitly opt it out if they want to.
This PR solves this by changing the CREATE KEYSPACE syntax wrt tablets options. Now there's a new TABLETS options map and the usage is
* `CREATE KEYSPACE ...` will turn tablets on or off based on cluster feature being enabled/disabled
* `CREATE KEYSPACE ... WITH TABLETS = { 'enabled': false }` will turn tablets off regardless of what
* `CREATE KEYSPACE ... WITH TABLETS = { 'enabled': true }` will try to enable tablets with default configuration
* `CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> }` is now the replacement for `REPLICATION = { ... 'initial_tablets': <int> }` thing
fixes: #16319Closesscylladb/scylladb#16364
* github.com:scylladb/scylladb:
code: Enable tablets if cluster feature is enabled
test: Turn off tablets feature by default
test: Move test_tablet_drain_failure_during_decommission to another suite
test/tablets: Enable tables for real on test keyspace
test/tablets: Make timestamp local
cql3: Add feature service to as_ks_metadata_update()
cql3: Add feature service to ks_prop_defs::as_ks_metadata()
cql3: Add feature service to get_keyspace_metadata()
cql: Add tablets on/off switch to CREATE KEYSPACE
cql: Move initial_tablets from REPLICATION to TABLETS in DDL
network_topology_strategy: Estimate initial_tablets if 0 is set
This change is intended to remove the dependency to
operator<<(std::ostream&, const std::unordered_set<seastar::sstring>&)
from test/boost/cql_auth_query_test.cc.
It prepares the test for removal of the templated helpers.
Such removal is one of goals of the referenced issue that is linked below.
Refs: #13245
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#16758
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we
* define a formatter for `auth::resource` and friends,
* update their callers of `operator<<` to use `fmt::print()`.
* drop `operator<<`, as they are not used anymore.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16765
When started cql_test_env creates a test keyspace. Some tablets test
cases create a table in this keyspace, but misuse the whole feature. The
thing is that while tablets feature is ON in those test cases, the
keyspace itself doesn _not_ have the initial_tables option and thus
tablets are not enabled for the ks' table for real. Currently test cases
work just because this table is only used as a transparent table ID
placeholder. If turning on tablets for the keyspace, several test cases
would get broken for two reasons.
First, the tables map will no longer be empty on test start.
Second, applying changes to tablet metadata may not be visible, becase
test case uses "ranom" timestamp, that can be less that the initial
metadata mutations' timestamp.
This patch fixes all three places:
1. enables tables for the test keyspace
2. removes assumption that the initial metadata is empty
3. uses large enough timestamp for subsequent mutations
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch changes the syntax of enabling tablets from
CREATE KEYSPACE ... WITH REPLICATION = { ..., 'initial_tablets': <int> }
to be
CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> }
and updates all tests accordingly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change is intended to remove the dependency to
operator<<(std::ostream&, const std::unordered_set<T>&)
from auth_resource_test.cc.
It prepares the test for removal of the templated helpers
from utils/to_string.hh, which is one of goals of the
referenced issue that is linked below.
Refs: #13245
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#16754
this is to mimic the formatting of `human_readable_value`, and to prepare for consolidating these two formatters, so we don't have two pretty printers in the tree.
Closesscylladb/scylladb#16726
* github.com:scylladb/scylladb:
utils/pretty_printers: add "I" specifier support
utils/pretty_printers: use the formatting of to_hr_size()