In test_exception_safety_of_update_from_memtable, we have a potential
throw from external_updater.
external_updater is supposed to be infallible.
Scylla currently aborts when an external_updater throws, so a throw from
there just fails the test.
This isn't intended. We aren't testing external_updater in this test.
Fixes#18163Closesscylladb/scylladb#18171
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.
The feature can be manually verified by doing the following :
1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB
The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.
Fixes#17747Closesscylladb/scylladb#17771
* github.com:scylladb/scylladb:
test_bloom_filter.py: disable reclaiming memory from components
sstable_datafile_test: add tests to verify auto reclamation of components
test/lib: allow overriding available memory via test_env_config
sstables_manager: support reclaiming memory from components
sstables_manager: store available memory size
sstables_manager: add variable to track component memory usage
db/config: add a new variable to limit memory used by table components
sstable_datafile_test: add testcase to verify reclamation from sstables
sstables: support reclaiming memory from components
this series includes test related changes to enable us to drop `FMT_DEPRECATED_OSTREAM` deprecated in {fmt} v10.
Refs #13245Closesscylladb/scylladb#18054
* github.com:scylladb/scylladb:
test: unit: add fmt::formatter for test_data in tests
test/lib: do not print with fmt::to_string()
test/boost: print runtime_error using e.what()
This series provides a reallocate_tablets function, that's initially called by allocate_tablets_for_new_table.
The new allocation implementation is independent of vnodes/token ownership.
Rather than using the natural_endpoints_tracker, it implements its own tracking
based on dc/rack load (== number of replicas in rack), with the additional benefit
that tablet allocation will balance the allocation across racks, using a heap structure,
similar to the one we use to balance tablet allocation across shards in each node.
reallocate_tablets may also be called with an optional parameter pointing the the current tablet_map.
In this case the function either allocates more tablet replicas in datacenters for which the replication factor was increased,
or it will deallocate tablet replicas from datacenters for which replication factor was decreased.
The NetworkTopologyStrategy_tablets_test unit test was extended to cover replication factor changes.
Closesscylladb/scylladb#17846
* github.com:scylladb/scylladb:
network_topology_strategy: reallocate_tablets: consider new_racks before existing racks
network_topology_startegy_test: add NetworkTopologyStrategy_tablet_allocation_balancing_test
network_topology_strategy: reallocate_tablets: support deallocation via rf change
network_topology_startegy_test: tablets_test: randomize cases
network_topology_strategy: allocate_tablets_for_new_table: do not rely on token ownership
network_topology_startegy_test: add NetworkTopologyStrategy_tablets_negative_test
network_topology_strategy_test: endpoints_check: use particular BOOST_CHECK_* functions
network_topology_strategy_test: endpoints_check: verify that replicas are placed on unique nodes
network_topology_strategy_test: endpoints_check: strictly check rf for tablets
network_topology_strategy_test: full_ring_check for tablets: drop unused options param
This change adds the missing Cassandra compaction option unchecked_tombstone_compaction.
Setting this option to true causes the compaction to ignore tombstone_threshold, and decide whether to do a compaction only based on the value of tombstone_compaction_interval
Fixes#1487Closesscylladb/scylladb#17976
* github.com:scylladb/scylladb:
removed forward declaration of resharding_descriptor
compaction options and troubleshooting docs
cql-pytest/test_compaction_strategy_validation.py
test/boost/sstable_compaction_test.cc
compaction: implement unchecked_tombstone_compaction
Calling `_next_row.get_iterator_in_latest()` is illegal when `_next_row` is not
pointing at a row. In particular, the iterator returned by such call might be
dangling.
We have observed this to cause a use-after-free in the field, when a reverse
read called `maybe_add_to_cache` after `_latest_it` was left dangling after
a dead row removal in `copy_from_cache_to_buffer`.
To fix this, we should ensure that we only call `_next_row.get_iterator_in_latest`
is pointing at a row.
Only the occurrences of this problem in `maybe_add_to_cache` are truly dangerous.
As far as I can see, other occurrences can't break anything as of now.
But we apply fixes to them anyway.
Closesscylladb/scylladb#18046
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter. but fortunately, fmt v10 brings the builtin
formatter for classes derived from `std::exception`. but before
switching to {fmt} v10, and after dropping `FMT_DEPRECATED_OSTREAM`
macro, we need to print out `std::runtime_error`. so far, we don't
have a shared place for formatter for `std::runtime_error`. so we
are addressing the needs on a case-by-case basis.
in this change, we just print it using `e.what()`. it's behavior
is identical to what we have now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add support for deallocating tablet replicas when the
datacenter replication factor is decreased.
We deallocate replicas back-to-front order to maintain
replica pairing between the base table and
its materialized views.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of deterministically testing a very small set of cases,
randomize the the shard_count per node, the cluster topology
and the NetworkTopologyStrategy options.
The next patch will extend the test to also test
`reallocate_tablets` with randomized options.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test that we attempting to allocate tablets
throws an error when there are not enough nodes
for the configured replication factor.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Using e.g. `BOOST_CHECK_EQUAL(endpoints.size(), total_rf)`
rather than `BOOST_CHECK(endpoints.size() == total_rf)`
prints a more detailed error message that includes the
runtime valies, if it fails.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With tablet we want to verify that the number of
replicas allocated per tablet per dc exactly matches
the replication strategy per-dc replication factor options.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
feature_service.hh is a high-level header that integrates much
of the system functionality, so including it in lower-level headers
causes unnecessary rebuilds. Specifically, when retiring features.
Fix by removing feature_service.hh from headers, and supply forward
declarations and includes in .cc where needed.
Closesscylladb/scylladb#18005
without `FMT_DEPRECATED_OSTREAM` macro, `UUID::to_sstring()` is
implemented using its `fmt::formatter`, which is not available
at the end of this header file where `UUID` is defined. at this moment,
we still use `FMT_DEPRECATED_OSTREAM` and {fmt} v9, so we can
still use `UUID::to_sstring()`, but in {fmt} v10, we cannot.
so, in this change, we change all callers of `UUID::to_sstring()`
to `fmt::to_string()`, so that we don't depend on
`FMT_DEPRECATED_OSTREAM` and {fmt} v9 anymore.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The test creates ut4 with a lot of fields,
this may take a while in debug builds,
to avoid raft operation timeout set the threshold
to some big value.
The error injector is disabled in release builds,
so this settings won't be applied to them.
This shouldn't be a problem since release builds
are fast enough, even on arm.
Fixesscylladb/scylladb#17987Closesscylladb/scylladb#17997
This patch introduces raft-based service levels.
The difference to the current method of working is:
- service levels are stored in `system.service_levels_v2`
- reads are executed with `LOCAL_ONE`
- writes are done via raft group0 operation
Service levels are migrated to v2 in topology upgrade.
After the service levels are migrated, `key: service_level_v2_status; value: data_migrated` is written to `system.scylla_local` table. If this row is present, raft data accessor is created from the beginning and it handles recovery mode procedure (service levels will be read from v2 table even if consistent topology is disabled then)
Fixes#17926Closesscylladb/scylladb#16585
* github.com:scylladb/scylladb:
test: test service levels v2 works in recovery mode
test: add test for service levels migration
test: add test for service levels snapshot
test:topology: extract `trigger_snapshot` to utils
main: create raft dda if sl data was migrated
service:qos: store information about sl data migration
service:qos: service levels migration
main: assign standard service level DDA before starting group0
service:qos: fix `is_v2()` method
service:qos: add a method to upgrade data accessor
test: add unit_test_raft_service_levels_accessor
service:storage_service: add support for service levels raft snapshot
service:qos: add abort_source for group0 operations
service:qos: raft service level distributed data accessor
service:qos: use group0_guard in data accessor
cql3:statements: run service level statements on shard0 with raft guard
test: fix overrides in unit_test_service_levels_accessor
service:qos: fix indentation
service:qos: coroutinize some of the methods
db:system_keyspace: add `SERVICE_LEVELS_V2` table
service:qos: extract common service levels' table functions
In this PR, we ensure unpublished CDC generation's data is
never removed, which was theoretically possible. If it happened,
it could cause problems. CDC generation publisher would then try
to publish the generation with its data removed. In particular, the
precondition of calling `_sys_ks.read_cdc_generation` wouldn't be
satisfied.
We also add a test that passes only after the fix. However, this test
needs to block execution of the CDC generation publisher's loop
twice. Currently, error injections with handlers do not allow it
because handlers always share received messages. Apart from the
first created handler, all handlers would be instantly unblocked by
a message from the past that has already unblocked the first
handler. This seems like a general limitation that could cause
problems in the future, so in this PR, we extend injections with
handlers to solve it once and for all. We add the `share_messages`
parameter to the `inject` (with handler) function. Depending on its
value, handlers will share messages (as before) or not.
Fixesscylladb/scylladb#17497Closesscylladb/scylladb#17934
* github.com:scylladb/scylladb:
topology_coordinator: clean_obsolete_cdc_generations: fix log
topology_coordinator: do not clear unpublished CDC generation's data
topology_coordinator: cdc_generation_publisher_fiber injection: make handlers share messages
error_injection: allow injection handlers to not share messages
Migrate data from `system_distributes.service_levels` to
`system.service_levels_v2` during raft topology upgrade.
Migration process reads data from old table with CL ALL
and inserts the data to the new table via raft.
For a single injection, all created injection handlers share all
received messages. In particular, it means that one received message
unblocks all handlers waiting for the first message. This behavior
is often desired, for example, if multiple fibers execute the
injected code and we want to unblock them all with a single message.
However, there is a problem if we want to block every execution
of the injected code. Apart from the first created handler, all
handlers will be instantly unblocked by messages from the past that
have already unblocked the first handler.
In one of the following commits, we add a test that needs to block
the CDC generation publisher's loop twice. Since it looks like there
are no good workarounds for this arguably general problem, we extend
injections with handlers in a way that solves it. We introduce the
new `share_messages` parameter. Depending on its value, handlers
will share messages or not. The details are described in the new
comments in `error_injection.hh`.
We also add some basic unit tests for the new funcionality.
The group0 state machine calls `merge_topology_snapshot` from
`transfer_snapshot`. It feeds it with `raft_topology_snapshot` returned
from `raft_pull_topology_snapshot`. This snapshot includes the entire
`system.cdc_generations_v3` table. It can be huge and break the
commitlog `max_record_size` limit.
The `system.cdc_generations_v3` is a single-partition table, so all the
data is contained in one mutation object. To fit the commitlog limit we
split this mutation into many smaller ones and apply them in separate
`database::apply` calls. That means we give up the atomicity guarantee,
but we actually don't need it for `system.cdc_generations_v3` and
`system.topology_requests`.
This PR fixes the dtest
`update_cluster_layout_tests.py::TestLargeScaleCluster::test_add_many_nodes_under_load`
Fixesscylladb/scylladb#17545Closesscylladb/scylladb#17632
* github.com:scylladb/scylladb:
test_cdc_generation_data: test snapshot transfer
storage_service::merge_topology_snapshot: handle big cdc_generations_v3 mutations
mutation: add split_mutation function
storage_service::merge_topology_snapshot: fix indentation
If there is a bug in the tablet scheduler which makes it never
converge for a given state of topology, rebalance_tablets() will never
complete and will generate a huge amounts of logs. This patch adds a
sanity limit so that we fail earlier.
This was observed in one of the test_load_balancing_with_random_load runs in CI.
Fixesscylladb/scylladb#17894.
Closesscylladb/scylladb#17916
The function splits the source mutation into multiple
mutations so that their size does not exceed the
max_size limit. The size of a mutation is calculated
as the sum of the memory_usage() of its constituent
mutation_fragments.
The implementation is taken from view_updating_consumer.
We use mutation_rebuilder_v2 to reconstruct mutations from
a stream of mutation fragments and recreate the output
mutation whenever we reach the limit.
We'll need this function in the next commit.
Lot's of BOOST_REQUIRES in this test require some integers to be in some
eq/gt/le relations to each other. And one place that compares rack names
as strings. Using more verbose boost checkers is preferred in such cases
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#17866
This PR fixes a problem with replacing a node with tablets when
RF=N. Currently, this will fail because tablet replica allocation for
rebuild will not be able to find a viable destination, as the replacing node
is not considered to be a candidate. It cannot be a candidate because
replace rolls back on failure and we cannot roll back after tablets
were migrated.
The solution taken here is to not drain tablet replicas from replaced
node during topology request but leave it to happen later after the
replaced node is in left state and replacing node is in normal state.
The replacing node waits for this draining to be complete on boot
before the node is considered booted.
Fixes https://github.com/scylladb/scylladb/issues/17025
Nodes in the left state will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:
1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first.
2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement.
It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.
Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).
In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet replica sets.
Currently left nodes are never removed from topology, so will
accumulate in memory. We could garbage-collect them from topology
coordinator if a left node is absent in any replica set. That means we
need a new state - left_for_real.
Closesscylladb/scylladb#17388
* github.com:scylladb/scylladb:
test: py: Add test for view replica pairing after replace
raft, api: Add RESTful API to query current leader of a raft group
test: test_tablets_removenode: Verify replacing when there is no spare node
doc: topology-on-raft: Document replace behavior with tablets
tablets, raft topology: Rebuild tablets after replacing node is normal
tablets: load_balancer: Access node attributes via node struct
tablets: load_balancer: Extract ensure_node()
mv: Switch to using host_id-based replica set
effective_replication_map: Introduce host_id-based get_replicas()
raft topology: Keep nodes in the left state to topology
tablets: Introduce read_required_hosts()
It was observed that some use cases might append old data constantly to
memtable, blocking GC of expired tombstones.
That's because timestamp of memtable is unconditionally used for
calculating max purgeable, even when the memtable doesn't contain the
key of the tombstone we're trying to GC.
The idea is to treat memtable as we treat L0 sstables, i.e. it will
only prevent GC if it contains data that is possibly shadowed by the
expired tombstone (after checking for key presence and timestamp).
Memtable will usually have a small subset of keys in largest tier,
so after this change, a large fraction of keys containing expired
tombstones can be GCed when memtable contains old data.
Fixes#17599.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17835
in gossiping_property_file_snitch_test, we use
`BOOST_REQUIRE_EQUAL(dc_racks[i], dc_racks[0])` to check the equality
of two instances of `pair<sstring, sstring`, like:
```c++
BOOST_REQUIRE_EQUAL(dc_racks[i], dc_racks[0])
```
since the standard library does not provide the formatter for printing
`std::pair<>`, we rely on the homebrew generic formatter to
print `std::pair<>, which in turn uses operator<< to format the
elements in the `pair`, but we intend to remove this formatter
in future, as the last step of #13245 .
so in order to enable Boost.test to print out lhs and rhs when
`BOOST_REQUIRE_EQUAL` check fails, we are adding
`boost_test_print_type()` for `pair<sstring,sstring>`. the helper
function uses {fmt} to print the `pair<>`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17831
Those nodes will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:
1) algorithms which work with replica sets filter nodes based on
their location. For example materialized views code which pairs base
replicas with view replicas filters by datacenter first.
2) tablet scheduler needs to identify each node's location in order
to make decisions about new replica placement.
It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.
Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).
In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet
replica sets.
We load topology infromation only for left nodes which are actually
referenced by any tablet. To achieve that, topology loading code
queries system.tablet for the set of hosts. This set is then passed to
system.topology loading method which decides whether to load
replica_state for a left node or not.
Will be used by topology loading code to determine which hosts are
needed in topology, even if they're in the left state. We want to load
only left nodes if they are referenced by any tablet, which may happen
temporarily until the replacement replica is rebuilt.
before this change, we rely on the homebrew generic formatter to
print unordered_set<>, which in turn uses operator<< to format the
elements in the `unordered_set`, but we intend to remove this formatter
in future, as the last step of #13245 .
so enable Boost.test to print out lhs and rhs when `BOOST_REQUIRE_EQUAL`
check fails, we are adding `boost_test_print_type()` for
`unordered_set<fruit>`. the helper function uses {fmt} to print the
`unordered_set<>`, so we are adding a fmt::formatter for `fruit`, the
operator<< for this type is dropped, as it is not used anymore.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17813
Tablet transition would get stuck anyway for such nodes, so it's not worth trying
refs: #16372 (not fixes, because there's also repair transitions with same problem)
Closesscylladb/scylladb#17796
* github.com:scylladb/scylladb:
topology_coordinator: Skip dead nodes when balancing tablets
test: Add test for load_balancer skiplist
tablet_allocator: Add skiplist to load_balancer
The method in question can have a shorter name that matches all other injections in this class, and can be non-template
Closesscylladb/scylladb#17734
* github.com:scylladb/scylladb:
error_injection: De-template inject() with handler
error_injection: Overload inject() instead of inject_with_handler()
The test is inspired by the test_load_balancing_with_empty_node one and
verifies that when a node is skiplisted, balancer doesn't put load on it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The new MX-native validator, which validates the index in tandem with the data file, was discovered to print false-positive errors, related to range-tombstones and promoted-index positions.
This series fixes that. But first, it refactors the scrub-related tests. These are currently dominated by boiler-plate code. They are hard to read and hard to write. In the first half of the series, a new `scrub_test` is introduced, which moves all the boiler-plate to a central place, allowing the tests to focus on just the aspect of scrub that is tested.
Then, all the found bugs in validate are fixed and finally a new test, checking validate with valid sstable is introduced.
Fixes: #16326Closesscylladb/scylladb#16327
* github.com:scylladb/scylladb:
test/boost/sstable_compaction_test: add validation test with valid sstable
sstablex/mx/reader: validate(): print trace message when finishing the PI block
sstablex/mx/reader: validate(): make index-data PI position check message consistent
sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
sstablex/mx/reader: validate(): reset the current PI block on partition-start
sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
test/boost/sstable_compaction_test: fix indentation
test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
test/boost/sstable_compaction_test: introduce scrub_test_framework
test/lib/random_schema: add uncompatible_timestamp_generator()
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* column_definition
* column_mapping
* ordinal_column_id
* raw_view_info
* schema
* view_ptr
their operator<<:s are dropped. but operator<< for schema is preserved,
as we are still printing `seastar::lw_shared_ptr<const schema>` with
our homebrew generic formatter for `seastar::lw_shared_ptr<>`, which
uses operator<< to print the pointee.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17768
Just a cleanup -- replace do_with_cql_env + async with do_with_cql_env_thread
Closesscylladb/scylladb#17758
* github.com:scylladb/scylladb:
test/storage_proxy: Restore indentation after previous patch
test/storage_proxy: Use do_with_cql_env_thread()
One of the test cases explicitly wraps itself into async, but there's a
convenience helper for that already.
Indentation is deliberately left broken
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>