The test is failing in CI sometimes due to performance reasons.
There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
reclaim, the `background_reclaim` scheduling group isn't actually
guaranteed to get any CPU, regardless of shares. If the process is
switched out inside the `background_reclaim` group, it might
accumulate so much vruntime that it won't get any more CPU again
for a long time.
We have seen both.
This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.
So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test). After the patch,
we only check that the background reclaim happens *eventually*.
Fixes https://github.com/scylladb/scylladb/issues/15677
Backporting this is optional. The test is flaky even in stable branches, but the failure is rare.
- (cherry picked from commit c47f438db3)
- (cherry picked from commit 1c1741cfbc)
Parent PR: #24030Closesscylladb/scylladb#24092
* github.com:scylladb/scylladb:
logalloc_test: don't test performance in test `background_reclaim`
logalloc: make background_reclaimer::free_memory_threshold publicly visible
Currently, stream_manager is initialized after storage_service and
so it is stopped before the storage_service is. In its stop method
storage_service accesses stream_manager which is uninitialized
at a time.
Move stream_manager initialization over the storage_service initialization.
Fixes: #23207.
Closesscylladb/scylladb#24008
(cherry picked from commit 9c03255fd2)
Closesscylladb/scylladb#24189
The test is failing in CI sometimes due to performance reasons.
There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
reclaim, the `background_reclaim` scheduling group isn't actually
guaranteed to get any CPU, regardless of shares. If the process is
switched out inside the `background_reclaim` group, it might
accumulate so much vruntime that it won't get any more CPU again
for a long time.
We have seen both.
This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.
So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test).
(cherry picked from commit 1c1741cfbc)
Each view update is correlated to a write that generates it (aside from view
building which is throttled separately). These writes are limited by a throttling
mechanism, which effectively works by performing the writes with CL=ALL if
ongoing writes exceed some memory usage limit
When writes generate view updates, they usually also need to perform a read. This read
goes through a read concurrency semaphore where it can get delayed or killed. The
semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue.
If the number of queued reads exceeds a specific limit, the view update will fail on
the replica, causing inconsistencies.
This limit is not necessary. When a read gets queued on the semaphore, the write that's
causing the view update is paused, so the write takes part in the regular write throttling.
If too many writes get stuck on view update reads, they will get throttled, so their
number is limited and the number of queued reads is also limited to the same amount.
In this patch we remove the specified queue length limit for the view update read concurrency
semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write
throttling mechanism. This may allow the queue grow longer than with the previous limit, but
it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the
remaining ones that get queued use a tiny amount of memory, less than the writes that generated
them and which are getting limited directly.
Fixes https://github.com/scylladb/scylladb/issues/23319Closesscylladb/scylladb#24112
(cherry picked from commit 5920647617)
Closesscylladb/scylladb#24168
We're reducing the log level in case the provided property file is incomplete.
The rationale behind this change is related to how CCM interacts with Scylla:
* The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties`
configuration every 60 seconds.
* When a new node is added to the cluster, CCM recreates the
`cassandra-rackdc.properties` file for EVERY node.
If those two processes start happening at about the same time, it may lead
to Scylla trying to read a not-completely-recreated file, and an error will
be produced.
Although we would normally fix this issue and try to avoid the race, that
behavior will be no longer relevant as we're making the rack and DC values
immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem
in the older versions of Scylla could bring a more serious regression. Having
that in mind, this commit is a compromise between making CI less flaky and
having minimal impact when backported.
We do the same for when the format of the file is invalid: the rationale
is the same.
We also do that for when there is a double declaration. Although it seems
impossible that this can stem from the same scenario the other two errors
can (since if the format of the file is valid, the error is justified;
if the format is invalid, it should be detected sooner than a doubled
declaration), let's stay consistent with the logging level.
Fixesscylladb/scylladb#20092Closesscylladb/scylladb#23956
(cherry picked from commit 9ebd6df43a)
Closesscylladb/scylladb#24142
Materialized Views and Secondary Indexes are yet another features that
keyspaces with tablets do not support, but these were not listed in a
warning message returned to the user on CREATE KEYSPACE statement. This
commit adds the 2 missing features.
Fixes: #24006Closesscylladb/scylladb#23902
(cherry picked from commit f740f9f0e1)
Closesscylladb/scylladb#24079
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.
Make sure that all the try..catch blocks in the topology coordinator
which:
- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message
...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.
Fixes: scylladb/scylladb#22649Closesscylladb/scylladb#23962
(cherry picked from commit 156ff8798b)
Closesscylladb/scylladb#24078
This PR improves and refactors the test.topology.util new_test_keyspace generator
and adds a corresponding create_new_test_keyspace function to be used by most if not
all topology unit tests in order to standardize the way the tests create keyspaces
and to mitigate the python driver create keyspace retry issue: https://github.com/scylladb/python-driver/issues/317Fixes#22342Fixes#21905
Refs https://github.com/scylladb/scylla-enterprise/issues/5060Fixes#23699
- (cherry picked from commit 50ce0aaf1c)
- (cherry picked from commit 5d448f721e)
- (cherry picked from commit f946302369)
- (cherry picked from commit 0fd1b846fe)
- (cherry picked from commit a66ddb7c04)
- (cherry picked from commit df84097a4b)
- (cherry picked from commit 59687c25e0)
- (cherry picked from commit fdb339bf28)
- (cherry picked from commit 205ed113dd)
- (cherry picked from commit 57faab9ffa)
- (cherry picked from commit 4fefffe335)
- (cherry picked from commit 480a5837ab)
- (cherry picked from commit fed078a38a)
- (cherry picked from commit c6653e65ba)
- (cherry picked from commit 9c095b622b)
- (cherry picked from commit 0668c642a2)
- (cherry picked from commit 0e11aad9c5)
- (cherry picked from commit ef85c4b27e)
- (cherry picked from commit b13e48b648)
- (cherry picked from commit a82e734110)
- (cherry picked from commit 629ee3cb46)
- (cherry picked from commit 42a104038d)
- (cherry picked from commit d5e3c578f5)
- (cherry picked from commit c05794c156)
- (cherry picked from commit 966cf82dae)
- (cherry picked from commit 11005b10db)
- (cherry picked from commit ff9c8428df)
- (cherry picked from commit 55b35eb21c)
- (cherry picked from commit 5759a97eb4)
- (cherry picked from commit c68d2a471c)
- (cherry picked from commit e05372afa4)
- (cherry picked from commit 380c5e5ac8)
- (cherry picked from commit 3f35491264)
- (cherry picked from commit e72a9d3faa)
- (cherry picked from commit 47326d01b7)
- (cherry picked from commit 72bc4016e7)
- (cherry picked from commit 4fd6c2d24e)
- (cherry picked from commit 50a8f5c1c0)
- (cherry picked from commit 005ceb77d3)
- (cherry picked from commit 649e68c6db)
- (cherry picked from commit 0b88ea9798)
- (cherry picked from commit 6b37d04aa9)
- (cherry picked from commit e59aca66bf)
- (cherry picked from commit 5ff3153912)
- (cherry picked from commit 20f7eda16e)
- (cherry picked from commit f30e4c6917)
- (cherry picked from commit 96d327fb83)
- (cherry picked from commit 16ef78075c)
- (cherry picked from commit 2d4af01281)
- (cherry picked from commit b810791fbb)
- (cherry picked from commit 46b1850f0c)
- (cherry picked from commit 0564e95c51)
- (cherry picked from commit 12f85ce57c)
- (cherry picked from commit 9829b1594f)
- (cherry picked from commit cbe79b20f7)
- (cherry picked from commit cc281ff88d)
Parent PR: #22399Closesscylladb/scylladb#23408
* github.com:scylladb/scylladb:
test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace
test/repair: create_table_insert_data_for_repair: create keyspace with unique name
topology_tasks/test_tablet_tasks: use new_test_keyspace
topology_tasks/test_node_ops_tasks: use new_test_keyspace
topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace
topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace
topology_custom/test_view_build_status: use new_test_keyspace
topology_custom/test_truncate_with_tablets: use new_test_keyspace
topology_custom/test_topology_failure_recovery: use new_test_keyspace
topology_custom/test_tablets_removenode: use create_new_test_keyspace
topology_custom/test_tablets_migration: use new_test_keyspace
topology_custom/test_tablets_merge: use new_test_keyspace
topology_custom/test_tablets_intranode: use new_test_keyspace
topology_custom/test_tablets_cql: use new_test_keyspace
topology_custom/test_tablets2: use *new_test_keyspace
topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function
test/cluster/test_tablets.py: Fix test errorneous indentation
topology_custom/test_tablets: use new_test_keyspace
topology_custom/test_table_desc_read_barrier: use new_test_keyspace
topology_custom/test_shutdown_hang: use new_test_keyspace
topology_custom/test_select_from_mutation_fragments: use new_test_keyspace
topology_custom/test_rpc_compression: use new_test_keyspace
topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace
topology_custom/test_raft_snapshot_truncation: use create_new_test_keyspace
topology_custom/test_raft_no_quorum: use new_test_keyspace
topology_custom/test_raft_fix_broken_snapshot: use new_test_keyspace
topology_custom/test_query_rebounce: use new_test_keyspace
topology_custom/test_not_enough_token_owners: use new_test_keyspace
topology_custom/test_node_shutdown_waits_for_pending_requests: use new_test_keyspace
topology_custom/test_node_isolation: use create_new_test_keyspace
topology_custom/test_mv_topology_change: use new_test_keyspace
topology_custom/test_mv_tablets_replace: use new_test_keyspace
topology_custom/test_mv_tablets_empty_ip: use new_test_keyspace
topology_custom/test_mv_tablets: use new_test_keyspace
topology_custom/test_mv_read_concurrency: use new_test_keyspace
topology_custom/test_mv_fail_building: use new_test_keyspace
topology_custom/test_mv_delete_partitions: use new_test_keyspace
topology_custom/test_mv_building: use new_test_keyspace
topology_custom/test_mv_backlog: use new_test_keyspace
topology_custom/test_mv_admission_control: use new_test_keyspace
topology_custom/test_major_compaction: use new_test_keyspace
topology_custom/test_maintenance_mode: use new_test_keyspace
topology_custom/test_lwt_semaphore: use new_test_keyspace
topology_custom/test_ip_mappings: use new_test_keyspace
topology_custom/test_hints: use new_test_keyspace
topology_custom/test_group0_schema_versioning: use new_test_keyspace
topology_custom/test_data_resurrection_after_cleanup: use new_test_keyspace
topology_custom/test_read_repair_with_conflicting_hash_keys: use new_test_keyspace
topology_custom/test_read_repair: use new_test_keyspace
topology_custom/test_compacting_reader_tombstone_gc_with_data_in_memtable: use new_test_keyspace
topology_custom/test_commitlog_segment_data_resurrection: use new_test_keyspace
topology_custom/test_change_replication_factor_1_to_0: use new_test_keyspace
topology/test_tls: test_upgrade_to_ssl: use new_test_keyspace
test/topology/util: new_test_keyspace: drop keyspace only on success
test/topology/util: refactor new_test_keyspace
test/topology/util: CREATE KEYSPACE IF NOT EXISTS
test/topology/util: new_test_keyspace: accept ManagerClient
And create_new_test_keyspace when we need drop
to be explicit.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e59aca66bf)
Some of the statements in the test are not indented properly
and, as a result, are never run. It's most likely a small mistake,
so let's fix it.
Closesscylladb/scylladb#23659
(cherry picked from commit 0ed21d9cc1)
Using the new_test_keyspace fixture is awkward for this test
as it is written to explicitly drop the created keyspaces
at certain points.
Therefore, just use create_new_test_keyspace to standardize the
creation procedure.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e72a9d3faa)
new_test_keyspace is problematic here since
the presence of the banned node can fail the automatic drop of
the test keyspace due to NoHostAvailable (in debug mode for
some reason)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 55b35eb21c)