The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.
The problem was demonstrated by the dtest
test_decommission_last_node_in_rack
(scylladb/scylla-dtest#3299).
The test set up four nodes, three on one rack
and one on another, all within a single data
center (dc). It then switched to a
'network_topology_strategy' for one keyspace
and tried to decommission the single node
on the second rack. This decomission command
with error message 'zero replica after the removal.'
This happened because unindex_node assigned
the empty list from the second rack
as a value for the single dc in
_dc_endpoints dictionary. As a result,
we got empty nodes list for single dc in
natural_endpoints_tracker::_all_endpoints,
node_count == 0 in data_center_endpoints,
_rf_left == 0, so
network_topology_strategy::calculate_natural_endpoints
rejected all the endpoints and returned an empty
endpoint_set. In
repair_service::do_decommission_removenode_with_repair
this caused the 'zero replica after the removal' error.
With this fix the test passes both with
--consistent-cluster-management option and
without it.
The specific unit test for this problem was added.
Fixes: #14184Closes#14673
In 10c1f1dc80 I fixed
`make_group0_history_state_id_mutation` to use correct timestamp
resolution (microseconds instead of milliseconds) which was supposed to
fix the flakiness of `test_group0_history_clearing_old_entries`.
Unfortunately, the test is still flaky, although now it's failing at a
later step -- this is because I was sloppy and I didn't adjust this
second part of the test to also use microsecond resolution. The test is
counting the number of entries in the `system.group0_history` table that
are older than a certain timestamp, but it's doing the counting using
millisecond resolution, causing it to give results that are off by one
sometimes.
Fix it by using microseconds everywhere.
Fixes#14653Closes#14670
The test isolates a node and then connects to it through CQL.
The `connect()` step would often timeout on ARM debug builds. This was
already dealt with in the past in the context of other tests: #11289.
The `ManagerClient.con_gen` function creates a connection in a way that
avoids the problem -- connection timeout settings are adjusted to
account for the slowness. Use it in this test to fix the flakiness.
At the same time, reduce the timeout used for the actual CQL request
(after the driver has already connected), because the test expects this
request to timeout and waiting for 200 seconds here is just a waste of
time.
Closes#14663
* github.com:scylladb/scylladb:
test: test_node_isolation: use `ManagerClient.con_gen` to create CQL connection
test: manager_client: make `con_gen` for `ManagerClient.__init__` nonoptional
When repair writes a sstable to disk, we check if the sstable needs view
update processing. If yes, the sstable will be placed into the staging
dir for processing, with the _registration_sem semaphore to prevent too
many pending unprocessed sstables.
We have seen multiple cases in the field where view update processing is
inefficient and way too slow which blocks the base table repair to
finish on time.
This patch increases the registration_queue_size to a bigger number to
mitigate the problem that slow view update processing blocks repair.
It is better to have a consistent base table + inconsistent view table
than inconsistent base table + inconsistent view table.
Currently, sstables in staging dir are not compacted. So we could not
increase the _registration_sem with too big number to avoid accumulate
too many sstables.
The view_build_test.cc is updated to make the test pass.
Closes#14241
The test isolates a node and then connects to it through CQL.
The `connect()` step would often timeout on ARM debug builds. This was
already dealt with in the past in the context of other tests: #11289.
The `ManagerClient.con_gen` function creates a connection in a way that
avoids the problem -- connection timeout settings are adjusted to
account for the slowness. Use it in this test to fix the flakiness.
At the same time, reduce the timeout used for the actual CQL request
(after the driver has already connected), because the test expects this
request to timeout and waiting for 200 seconds here is just a waste of
time.
because `lw_shared_ptr::operator=(T&&)` was deprecated. we started to
have following waring:
```
/home/kefu/dev/scylladb/test/boost/statement_restrictions_test.cc:394:41: warning: 'operator=' is deprecated: call make_lw_shared<> and assign the result instead [-Wdeprecated-declarations]
394 | definition.column_specification = std::move(specification);
| ^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:346:7: note: 'operator=' has been explicitly marked deprecated here
346 | [[deprecated("call make_lw_shared<> and assign the result instead")]]
| ^
1 warning generated.
```
so, in this change, we use the recommended way to update a lw_shared_ptr.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14648
`ManagerClient` is given a function that is used to create CQL
connections to the Scylla cluster. For some reason it was typed as
`Optional` even though it was never passed `None`. Fix it.
The DynamoDB documentation for the size() function claims that it only
works on paths (attribute names or references), but it actually works on
constants from the query (e.g., ":val") as well.
It turns out that Alternator supports this undocumented case already, but
gets the error path wrong: Usually, when size() is calculated on the data,
if the data has the wrong type of size() (e.g., an integer), the condition
simply doesn't match. But if the value comes from the query - it should
generate an error that the query is wrong - ValidationException.
This patch fixes this case, and also adds tests for it that pass on both
DynamoDB and Alternator (after this patch).
Fixes#14592
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14593
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.
There are 13 tests, 9 of them currently xfail.
The failures are caused by one recently-discovered issue:
Refs #14501: Cannot Cast Counter To Double
and by three previously unknown or undocumented issues:
Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
milliseconds
Refs #14522: Support CAST function not only in SELECT
Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14528
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!
So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.
This patch also fixes a small bug which prevented this test from running
on DynamoDB.
Fixes#14495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14496
Due to wrong order of stopping of compaction services, shutdown needs
to wait until all compactions are complete, which may take really long.
Moreover, test version of compaction manager does not abort task manager,
which is strictly bounded to it, but stops its compaction module. This results
in tests waiting for compaction task manager's tasks to be unregistered,
which never happens.
Stopping and aborting of compaction manager and task manager's compaction
module are performed in a proper order.
Closes#14461
* github.com:scylladb/scylladb:
tasks: test: abort task manager when wrapped_compaction_manager is destructed
compaction: swap compaction manager stopping order
compaction: modify compaction_manager::stop()
The command is to print interesting and/or hard-to-get-by-hand info about individual tables
Closes#14635
* github.com:scylladb/scylladb:
test: Add 'scylla table' cmd test
scylla-gdb: Print table phased barriers
scylla-gdb: Add 'table' command
Performance tests such as `perf-fast-forward` are executed in our CI
environments in two steps (two invocations of the `scylla` process):
first by populating data directories (with `--populate` option), then by
running the actual test.
These tests are using `cql_test_env`, which did not load the previously
saved (in the populate step) Host ID of this node, but generated a new
one randomly instead.
In b39ca97919 we enabled
`consistent_cluster_management` by default. This caused the perf tests
to hang in `setup_group0` at `read_barrier` step. That's because Raft
group 0 was initialized with old configuration -- the one created during
the populate step -- but the Raft server was started with a newly
generated Host ID (which is used as the server's Raft ID), so the server
considered itself as being outside the configuration.
Fix this by reloading the Host ID from disk, simulating more closely the
behavior of main.cc initialization.
Fixes#14599Closes#14640
Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node.
That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in https://github.com/scylladb/scylladb/issues/14317.
To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead.
Without further ado,
before:
`INFO 2023-07-01 07:10:26,281 [shard 0] compaction - [Cleanup keyspace2.standard1 701af580-17f7-11ee-8b85-a479a1a77573] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s8o_06uww24drzrroaodpv-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.`
after:
`INFO 2023-07-01 07:07:52,354 [shard 0] compaction - [Cleanup keyspace2.standard1 199dff90-17f7-11ee-b592-b4f5d81717b9] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s4m_5hehd2rejj8w15d2nt-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.`
Fixes#12998.
Fixes#14317.
Closes#14469
* github.com:scylladb/scylladb:
test: Extend cleanup correctness test to cover more cases
compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range
sstables: Close SSTable reader if index exhaustion is detected in fast forward call
sstables: Simplify sstable reader initialization
compaction: Extend make_sstable_reader() interface to work with mutation_source
test: Extend sstable partition skipping test to cover fast forward using token
This reverts commit 2a58b4a39a, reversing
changes made to dd63169077.
After patch 87c8d63b7a,
table_resharding_compaction_task_impl::run() performs the forbidden
action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard,
which is a data race that can cause a use-after-free, typically manifesting
as allocator corruption.
Note: before the bad patch, this was avoided by copying the _contents_ of the
lw_shared_ptr into a new, local lw_shared_ptr.
Fixes#14475Fixes#14618Closes#14641
with tagging ops, we will be able to attach kv pairs to an object.
this will allow us to mark sstable components with taggings, and
filter them based on them.
* test/pylib/minio_server.py: enable anonymous user to perform
more actions. because the tagging related ops are not enabled by
"mc anonymous set public", we have to enable them using "set-json"
subcommand.
* utils/s3/client: add methods to manipulate taggings.
* test/boost/s3_test: add a simple test accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14486
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.
Existing intended cases were annotated.
Closes#14607
Before this PR, the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.
Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec77c6).
Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
replaced or changed their IPs. These entries are still using status
`NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
`handle_state_normal` is also called for it). The
`_gossiper.wait_alive` call would wait for those entries too and
eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
may not be populated yet, which is required to do the IP<->Host ID
translations -- and populating `token_metadata` happens inside
`handle_state_normal`, so we have a chicken-and-egg problem here.
It turns out that we don't need to calculate `sync_nodes` (and
hence `ignore_nodes`) in order to wait for NORMAL state handlers. We
can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries
appearing in gossiper, even those that correspond to dead/ignored
nodes and obsolete IPs. `handle_state_normal` is called, and
eventually finishes, for all of them.
`wait_for_normal_state_handled_on_boot` no longer receives a set of
nodes as parameter and is modified appropriately, it's now calculating
the necessary set of nodes on each retry (the set may shrink while
we're waiting, e.g. because an entry corresponding to a node that was
replaced is garbage-collected from gossiper state).
Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.
This addresses the immediate failure described in #14487, but the test
would still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set.
We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.
In fact such a method was already introduced in the past:
ca61d88764
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.
We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.
We also introduce regression tests with necessary extensions
to the test framework.
Fixes#14468Fixes#14487Closes#14507
* github.com:scylladb/scylladb:
test: rename `test_topology_ip.py` to `test_replace.py`
test: test bootstrap after IP change
test: scylla_cluster: return the new IP from `change_ip` API
test: node replace with `ignore_dead_nodes` test
test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`
storage_service: remove `get_nodes_to_sync_with`
storage_service: use `token_metadata` to calculate nodes waited for to be UP
storage_service: don't calculate `ignore_nodes` before waiting for normal handlers
Consider a cluster with no data, e.g. in tests. When a new node is bootstrapped with repair we iterate over all (shard, table, range), read data from all the peer nodes for the range, look for any discrepancies and heal them. Even for small num_tokens (16 in the tests) the number of affected ranges (those we need to consider) amounts to total number of tokens in the cluster, which is 32 for the second node and 48 for the third. Multiplying this by the number of shards and the number of tables in each keyspace gives thousands of ranges. For each of them we need to follow some row level repair protocol, which includes several RPC exchanges between the peer nodes and creating some data structures on them. These exchanges are processed sequentially for each shard, there are `parallel_for_each` in code, but they are throttled by the choosen memory constraints and in fact execute sequentially.
When the bootstrapping node (master) reaches a peer node and asks for data in the specific range and master shard, two options exist. If sharder parameters (primarily, `--smp`) are the same on the master and on the peer, we can just read one local shard, this is fast. If, on the other hand, `--smp` is different, we need to do a multishard query. The given range from the master can contain data from different peer shards, so we split this range into a number of subranges such that each of them contain data only from the given master shard (`dht::selective_token_range_sharder`). The number of these subranges can be quite big (300 in the tests). For each of these subranges we do `fast_forward_to` on the `multishard_reader`, and this incurs a lot of overhead, mainly becuse of `smp::submit_to`.
In this series we optimize this case. Instead of splitting the master range and reading only what's needed, we read all the data in the range and then apply the filter by the master shard. We do this if the estimated number of partitions is small (<=100).
This is the logs of starting a second node with `--smp 4`, first node was `--smp 3`:
```
with this patch
20:58:49.644 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.222.46.3 in scylla-2...
20:59:22.713 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.222.46.3 in scylla-2, pid 1132859
without this patch
21:04:06.424 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.181.31.3 in scylla-2...
21:06:01.287 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.181.31.3 in scylla-2, pid 1134140
```
Fixes: #14093Closes#14178
* github.com:scylladb/scylladb:
repair_test: add test_reader_with_different_strategies
repair: extract repair_reader declaration into reader.hh
repair_meta: get_estimated_partitions fix
repair_meta: use multishard_filter reader if the number of partitions is small
repair_meta: delay _repair_reader creation
database.hh: make_multishard_streaming_reader with range parameter
database.cc: extract streaming_reader_lifecycle_policy
When task manager is not aborted, the tasks are stored in the memory,
not allowing the tasks' gate to be closed.
When wrapped_compaction_manager is destructed, task manager gets
aborted, so that system could shutdown.
The test has about 1/2500000 chance to fail due to a conflict of random
values. And it recently did, just to spite us.
Fight back.
Fixes#14563Closes#14576
The CDC generation data can be large and not fit in a single command.
This pr splits it into multiple mutations by smartly picking a
`mutation_size_threshold` and sending each mutation as a separate group
0 command.
Commands are sent sequentially to avoid concurrency problems.
Topology snapshots contain only mutation of current CDC generation data
but don't contain any previous or future generations. If a new
generation of data is being broadcasted but hasn't been entirely applied
yet, the applied part won't be sent in a snapshot. New or delayed nodes
can never get the applied part in this scenario.
Send the entire cdc_generations_v3 table in the snapshot to resolve this
problem.
A mechanism to remove old CDC generations will be introduced as a
follow-up.
Closes#13962
* github.com:scylladb/scylladb:
test: raft topology: test `prepare_and_broadcast_cdc_generation_data`
service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop
raft topology: introduce `prepare_and_broadcast_cdc_generation_data`
raft: add release_guard
raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
raft topology: make `mutation_size_threshold` depends on `max_command_size`
raft: reduce max batch size of raft commands and raft entries
raft: add description argument to add_entry_unguarded
raft: introduce `write_mutations` command
raft: refactor `topology_change` applying
Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead.
Fixes#14388Closes#14558
* github.com:scylladb/scylladb:
direct_fd: do not ping self
raft: initialize raft_group_registry with host id early
raft: code cleanup
This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size`
is limited to less than 1 MB. It adds an injection which copies mutations
generated by `get_cdc_generation_mutations` n times, where n is picked that
the memory size of all mutations exceeds `max_command_size`.
This test passes if cdc generation data is committed by raft in multiple commands.
If all the data is committed in a single command, the leader node will loop trying
to send raft command and getting the error:
```
storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {})
```
For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size`
(half of the commitlog segment size), so with some additional information, it
can exceed this threshold resulting in throwing an exception when writing
mutation to the commitlog.
A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size`
(half of the commitlog segment size). It doesn't have additional metadata, but
it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble,
but it is prefered to be careful.
Make `raft_sys_table_storage::_max_mutation_size` and
`group0_state_machine::merger::max_command_size` more strict to leave space
for metadata.
Fixed typo "1204" => "1024".
Earlier, when local query processor wasn't available at
the beginning of system start, we couldn't query our own
host id when initializing the raft group registry. The local
host id is needed by the registry since it is responsible
to route RPC messages to specific raft groups, and needs
to reject messages destined to a different host.
Now that the host id is known early at boot, remove the optional
and pass host id in the constructor. Resolves an earlier fixme.
Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint.
This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message.
Closes#14357.
Closes#14460
* github.com:scylladb/scylladb:
tests: introduce InjectionHandler class for communicating with injected code
api/error_injection: add message_injection endpoint
tests: utils: error injections: add test for inject_with_handler
utils: error injection: add inject_with_handler for interactions with injected code
utils: error injection: create structure for error injections data
Also simplify the API by getting rid of `ActionReturn` and returning
errors through exceptions (which are correctly forwarded to the client
for some time already).
Regression test for #14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.
In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Fixes https://github.com/scylladb/scylladb/issues/14503Closes#14502
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
in this series, test/object_storage is restructured into a pytest based test. this paves the road to a test suites covers more use cases. so we can some more lower-level tests for tiered/caching-store.
Closes#14165
* github.com:scylladb/scylladb:
s3/test: do not return ip in managed_cluster()
s3/test: verify the behavior with asserts
s3/test: restructure object_store/run into a pytest
s3/test: extract get_scylla_with_s3_cmd() out
s3/test: s/restart_with_dir/kill_with_dir/
s3/test: vendor run_with_dir() and friends
s3/test: remove get_tempdir()
s3/test: extract managed_cluster() out
let's just use cluster.contact_points for retrieving the IP address
of the scylla node in this single-node cluster. so the name of
managed_cluster() is less weird.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of using a single run to perform the test, restructure
it into a pytest based test suite with a single test case.
this should allow us to add more tests exercising the object-storage
and cached/tierd storage in future.
* add fixtures so they can be reused by tests
* use tmpdir fixture for managing the tmpdir, see
https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture
* perform part of the teardown in the "test_tempdir()" fixture
* change the type of test from "Run" to "Python"
* rename "run" to "test_basic.py"
* optionally start the minio server if the settings are not
found in command line or env variables, so that the tests are
self-contained without the fixture setup by test.py.
* instead of sys.exit(), use assert statement, as this is
what pytest uses.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>