Previously, the prev_ip check caused problems for bootstrapping nodes.
Suppose a bootstrapping node A appears in the system.peers table of
some other node B. Its record has only ID and IP of the node A, due to
the special handling of bootstrapping nodes in raft_topology_update_ip.
Suppose node B gets temporarily isolated from the topology coordinator.
The topology coordinator fences out node B and succesfully finishes
bootstrapping of the node A. Later, when the connectivity is restored,
topology_state_load runs on the node B, node A is already in
normal state, but the gossiper on B might not yet have any state for
it yet. In this case, raft_topology_update_ip would not update
system.peers because the gossiper state is missing. Subsequently,
on_join/on_restart/on_alive events would skip updates because the IP
in gossiper matches the IP for that node in system.peers.
Removing the check avoids this issue, with negligible overhead:
* on_join/on_restart/on_alive happen only once in a
node’s lifetime
* topology_state_load already updates all nodes each time it runs.
This problem was found by a fencing test, which crashed a
node while another node was going through the bootstrapping
process. After restart the node saw that other node already
is in normal state, since the topology coordinator fenced out
this node and managed to finish the bootstrapping process
successfully. This test will be provided in a separate
fencing-for-paxos PR.
Closesscylladb/scylladb#25596
- Move the initialization of log_done inside the try block to catch any
exceptions it may throw.
- Relocate the failure warning log after sink.close() cleanup
to guarantee sink.close() is always called before logging errors.
Refs #25497Closesscylladb/scylladb#25591
Commit 60d2cc886a changed
get_all_ranges to return start-bound ranges and pre-calculate
the wrapping range, and then construct_range_to_endpoint_map
to pass r.start() (that is now always engaged) as the vnode token.
However, as can be seen in token_metadata_impl::first_token
the token ranges (a.k.a. vnodes) **end** with the sorted tokens,
not start with them, so an arbitrary token t belongs to a
vnode in some range `sorted_tokens[i-1] < t <= sorted_tokens[i]`
Fixes#25541
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#25580
The previous implementation did not handle topology changes well:
* In `node_local_only` mode with CL=1, if the current node is pending, the CL is increased to 2, causing
`unavailable_exception`.
* If the current tablet is in `write_both_read_old` and we try to read with `node_local_only` on the new node, the replica list will be empty.
This patch changes `node_local_only` mode to always use `my_host_id` as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise `on_internal_error` is called.
backport: not needed, since `node_local_only` is only used in LWT for tablets and it hasn't been released yet.
Closesscylladb/scylladb#25508
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_during_migration
storage_proxy: node_local_only: always use my_host_id
The http_context object carries sharded<database> reference and all handlers in the api/ code can use it they way they want. This creates potential use-after-free, because the context is initialized very early and is destroyed very late. All other services are used by handlers differently -- after a service is initialized, the relevant endpoints are registered and the service reference is captured on handlers. Since endpoint deregistration is defer-scheduled at the same place, this guarantees that handlers cannot use the service after it's stopped.
This PR does the same for api/ handlers -- the sharded<database> reference is captured inside set_server_column_family() and then used by handlers lambdas.
Similar changes for other services: #21053, #19417, #15831, etc
It's a part of the on-going cleanup of service dependencies, no need to backport
Closesscylladb/scylladb#25467
* github.com:scylladb/scylladb:
api/column_family: Capture sharded<database> to call get_cf_stats()
api: Patch get_cf_stats to get sharded<database>& argument
api: Drop CF map-reducers ability to work with http context
api: Patch callers of map_reduce_cf(_raw)? to use sharded<database>
api: Use captured sharded<database> reference in handlers
api/column_family: Make map_reduce_cf_time_histogram() use sharded<database>
api/column_famliy: Make sum_sstable() use sharded<database>
api/column_family: Make get_cf_unleveled_sstables() use sharded<database>
api/column_famliy: Make get_cf_stats_count() use sharded<database>
api/column_family: Make get_cf_rate_and_histogram() use sharded<database>
api/column_family: Make get_cf_histogram() use sharded<database>
api/column_family: Make get_cf_stats_sum() use sharded<database>
api/column_family: Make set_tables_tombstone_gc() use sharded<database>
api/column_family: Make set_tables_autocompaction() use sharded<database>
api/column_family: Make for_tables_on_all_shards() use sharded<database>
api: Capture sharded<database> for set_server_column_family()
api: Make CF map-reducers work on sharded<database> directly
api: Make map_reduce_cf_time_histogram() file-local
api: Remove unused ctx argument from run_toppartitions_query()
To avoid dependency proliferation, switch to forward declarations.
In one case, we introduce indirection via std::unique_ptr and
deinline the constructor and destructor.
Ref #1Closesscylladb/scylladb#25584
Copy `commitlog_test.py` from scylla-dtest test suite and make it works with `test.py`
As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `skip`, 'skip_if', and `xfail` markers.
test.py uses `commitlog` directory instead of dtest's `commitlogs`.
Also, add `commitlog_segment_size_in_mb: 32` option to test_stop_failure_policy to make _provoke_commitlog_failure
work.
Tests `test_total_space_limit_of_commitlog_with_large_limit` and `test_total_space_limit_of_commitlog_with_medium_limit` use too much disk space and have too big execution time. Keep them in scylla-dtest for now.
Enable the test in `suite.yaml` (run in dev mode only.)
Additional modifications to test.py/dtest shim code:
- add ScyllaCluster.flush() method
- add ScyllaNode.stress() method
- add tools/files.py::corrupt_file() function
- add tools/data.py::run_query_with_data_processing() function
- copy some assertions from dtest
Also add missed mode restriction for auth_test.py file.
Closesscylladb/scylladb#24946
* github.com:scylladb/scylladb:
test.py: dtest: remove slow and greedy tests from commitlog_test.py
test.py: dtest: make commitlog_test.py run using test.py
test.py: dtest: add ScyllaCluster.flush() method
test.py: dtest: add ScyllaNode.stress() method
test.py: dtest: add tools/data.py::run_query_with_data_processing() function
test.py: dtest: add tools/files.py::corrupt_file() function
test.py: dtest: copy some assertions from dtest
test.py: dtest: copy unmodified commitlog_test.py
The previous implementation did not handle topology changes well:
* In node_local_only mode with CL=1, if the current node is pending,
the CL is raised to 2, causing unavailable_exception.
* If the current tablet is in write_both_read_old and we read with
node_local_only on the new node, the replica list is empty.
This patch changes node_local_only mode to always use my_host_id as
the replica list. An explicit check ensures the current node is a
replica for the operation; otherwise on_internal_error is called.
This PR extends the `tmpdir` class with an option to preserve the directory if the destructor is called during stack unwinding. It also uses this feature in KMIP tests, where the tmpdir contains PyKMIP server logs, which may be useful when diagnosing test failures.
Fixes#25339.
Not so important to be backported.
Closesscylladb/scylladb#25367
* github.com:scylladb/scylladb:
encryption_at_rest_test: Preserve tmpdir from failing KMIP tests
test/lib: Add option to preserve tmpdir on exception
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
When using automatic rust build tools in IDE,
the files generated in `rust/target/` directory
has been treated by git as unstaged changes.
After the change, the generated files will not
pollute the git changes interface.
Closesscylladb/scylladb#25389
endpoint_filter() is used by batchlog to select nodes to replicate
to.
It contains an unordered_multimap data structure that maps rack names
to nodes.
It misuses std::unordered_map::bucket_count() to count the number of
racks. While values that share a key in a multimap will definitly
be in the same bucket, it's possible for values that don't share a
key to share a bucket. Therefore bucket_count() undercounts the
number of racks.
Fix this by using a more accurate data structure: a map of a set.
The patch changes validated.bucket_count() to validated.size()
and validated.size() to a new variable nr_validated.
The patch does cause an extra two allocations per rack (one for the
unordered_map node, one for the unordered_set bucket vector), but
this is only used for logged batches, so it is amortized over all
the mutations in the logged batch.
Closesscylladb/scylladb#25493
When the user disables CDC on a table, the CDC log table is not removed.
Instead, it's detached from the base table, and it functions as a normal
table (with some differences). If that log table lives up to the point
when the user re-enabled CDC on the base table, instead of creating a new
log table, the old one is re-attached to the base.
For more context on that, see commit:
scylladb/scylladb@adda43edc7.
In this commit, we add validation tests that check whether the changes
on the base table after disabling CDC are reflected on the log table
after re-enabling CDC. The definition of the log table should be the same
as if CDC had never been disabled.
Closesscylladb/scylladb#25071
This pull request introduces minor code refactoring and aesthetic improvements to the S3 client and its associated test suite. The changes focus on enhancing readability, consistency, and maintainability without altering any functional behavior.
No backport is required, as the modifications are purely cosmetic and do not impact functionality or compatibility.
Closesscylladb/scylladb#25490
* github.com:scylladb/scylladb:
s3_client: relocate `req` creation closer to usage
s3_client: reformat long logging lines for readability
s3_test: extract file writing code to a function
Flush failure with seastar::named_gate_closed_exception is expected
if a respective compaction group was already stopped.
Lower the severity of a log in dirty_memory_manager::flush_one
for this exception.
Fixes: https://github.com/scylladb/scylladb/issues/25037.
Closesscylladb/scylladb#25355
Currently, when a container or smart pointer holds a const payload
type, utils::clear_gently does not detect the object's clear_gently
method as the method is non-const and requires a mutable object,
as in the following example in class tablet_metadata:
```
using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>;
using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>;
```
That said, when a container is cleared gently the elements it holds
are destroyed anyhow, so we'd like to allow to clear them gently before
destruction.
This change still doesn't allow directly calling utils::clear_gently
an const objects.
And respective unit tests.
Fixes#24605Fixed#25026
* This is an optimization that is not strictly required to backport (as https://github.com/scylladb/scylladb/pull/24618 dealt with clear_gently of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>` well enough)
Closesscylladb/scylladb#24606
* github.com:scylladb/scylladb:
utils: stall_free: detect clear_gently method of const payload types
utils: stall_free: clear gently a foreign shared ptr only when use_count==1
Tests test_total_space_limit_of_commitlog_with_large_limit and
test_total_space_limit_of_commitlog_with_medium_limit use too much
disk space and have too big execution time. Keep them in
scylla-dtest for now.
As a part of the porting process, remove unused imports and
markers, remove non-next_gating tests and tests marked with
`skip`, 'skip_if', and `xfail` markers.
test.py uses `commitlog` directory instead of dtest's
`commitlogs`.
Remove test_stop_failure_policy test because the way how it
provoke commitlog failure (change file permission) doesn't
work on CI.
Enable the test in suite.yaml (run in dev mode only)
Implement repetition of files using `pytest_collect_file` hook: run file collection as many times as needed to cover all `--mode`/`--repeat` combinations. Store build mode and run ID to the stash of repeated item.
Some additional changes done:
- Add `TestSuiteConfig` class to handle all operations with `test_config.yaml`
- Add support for `run_first` option in `test_config.yaml`
- Move disabled test logic to `pytest_collect_file` hook.
These changes allow to to remove custom logic for `--mode`, `--repeat`, and disabled tests in the code for C++ tests and prepare for switching of Python/CQLApproval/Topology tests to pytest runner.
Also, this PR includes required refactoring changes and fixes:
- Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: `base.py`, `boost.py`, and `unit.py`
- Remove unused imports in `test.py`
- Use the constant for `"suite.yaml"` string
- Some test suites have own test runners based on pytest, and they don't need all stuff we use for `test.py`. Move all code related to `test.py` framework to `test/pylib/runner.py` and use it as a plugin conditionally (by using `SCYLLA_TEST_RUNNER` env variable.)
- Add `cwd` parameter to `run_process()` methods in `resource_gather` module to avoid using of `os.chdir()` (and sort parameters in the same order as in `subprocess.Popen`.)
- `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item. Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule.
This PR is a refactoring step for https://github.com/scylladb/scylladb/pull/25443Closesscylladb/scylladb#25465
* github.com:scylladb/scylladb:
test.py: pytest: support --mode/--repeat in a common way for all tests
test.py: pytest: streamline suite configuration handling
test.py: refactor: remove unused imports in test.py
test.py: fix run with bare pytest after merge of scylladb/scylladb#24573
test.py: refactor: move framework-related code to test.pylib.runner
test.py: resource_gather: add cwd parameter to run_process()
test.py: refactor: use proper format for extra_scylla_cmdline_options
Update more handlers not to get databse from context, but to capture it
directly on handlers' lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it accepts http context and immediately gets the database from it to
pass to map_reduce_cf. Callers are updated to pass database from where
the context they already have.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25396
Next part: implementing sstable index writers and readers on top of the abstract trie writers/readers.
The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.
This series provides translation routines for ring positions and clustering positions
from Scylla's native in-memory structures to BTI's byte-comparable encoding.
This translation is performed whenever a new decorated key or clustering block
are added to a BTI index, and whenever a BTI index is queried for a range of positions.
For a description of the encoding, see
fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)
The translation logic, with all the fragment awareness, lazy
evaluation and avoidable copies, is fairly bloated for the common cases
of simple and small keys. This is a potential optimization target for later.
No backports needed, new functionality.
Closesscylladb/scylladb#25506
* github.com:scylladb/scylladb:
sstables/trie: add BTI key translation routines
tests/lib: extract generate_all_strings to test/lib
tests/lib: extract nondeterministic_choice_stack to test/lib
sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file
sstables/mx: move clustering_info from writer.cc to types.hh
sstables/trie: allow `comparable_bytes_iterator` to return a mutable span
dht/ring_position: add ring_position_view::weight()
This patch finalizes the change started by the previous patch of the
similar title -- the map_reduce_cf(_raw) is switched to work only with
sharded<replica::database> reference. All callers were updated by
previous patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some of them left that still pass http_context. These handlers
will eventually get their captured sharded database reference, but for
now make them explicitly use one from context. This will allow to
de-templatize map_reduce_cf... helpers making the code simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Not all of them can switch from ctx to database, so in few places both,
the database and ctx, are captured. However, the ctx.db reference is no
longer used by the column_family handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to other API handlers, instead of using a database from http
context, patch the setting methods to capture the database from main
code and pass it around to handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches are going to change a bunch of map_reduce_cf_... callers to
pass sharded<database> reference to it, not the http context. Not to
patch all the api/ code at once, keep the ability to call it with ctx at
hand. Eventually only the sharded<database>& overload will be kept.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>