Fixes https://github.com/scylladb/scylladb/issues/14565
This commit improves the description of ScyllaDB configuration
via User Data on AWS.
- The info about experimental features and developer mode is removed.
- The description of User Data is fixed.
- The example in User Data is updated.
- The broken link is fixed.
Closes#14569
The CDC generation data can be large and not fit in a single command.
This pr splits it into multiple mutations by smartly picking a
`mutation_size_threshold` and sending each mutation as a separate group
0 command.
Commands are sent sequentially to avoid concurrency problems.
Topology snapshots contain only mutation of current CDC generation data
but don't contain any previous or future generations. If a new
generation of data is being broadcasted but hasn't been entirely applied
yet, the applied part won't be sent in a snapshot. New or delayed nodes
can never get the applied part in this scenario.
Send the entire cdc_generations_v3 table in the snapshot to resolve this
problem.
A mechanism to remove old CDC generations will be introduced as a
follow-up.
Closes#13962
* github.com:scylladb/scylladb:
test: raft topology: test `prepare_and_broadcast_cdc_generation_data`
service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop
raft topology: introduce `prepare_and_broadcast_cdc_generation_data`
raft: add release_guard
raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
raft topology: make `mutation_size_threshold` depends on `max_command_size`
raft: reduce max batch size of raft commands and raft entries
raft: add description argument to add_entry_unguarded
raft: introduce `write_mutations` command
raft: refactor `topology_change` applying
Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead.
Fixes#14388Closes#14558
* github.com:scylladb/scylladb:
direct_fd: do not ping self
raft: initialize raft_group_registry with host id early
raft: code cleanup
This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size`
is limited to less than 1 MB. It adds an injection which copies mutations
generated by `get_cdc_generation_mutations` n times, where n is picked that
the memory size of all mutations exceeds `max_command_size`.
This test passes if cdc generation data is committed by raft in multiple commands.
If all the data is committed in a single command, the leader node will loop trying
to send raft command and getting the error:
```
storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {})
```
When the topology_cooridnator fiber gets `raft::commit_status_unknown`, it
prints an error. This exception is not an error in this case, and it can be
thrown when the leader has changed. It can happen in `add_entry_unguarded`
while sending a part of the CDC generation data in the `write_mutations` command.
Catch this exception in `topology_coordinator::run` and print a warning.
Broadcasts all mutations returned from `prepare_new_cdc_generation_data`
except the last one. Each mutation is sent in separate raft command. It takes
`group0_guard`, and if the number of mutations is greater than one, the guard
is dropped, and a new one is created and returned, otherwise the old one will
be returned. Commands are sent in parallel and unguarded (the guard used for
sending the last mutation will guarantee that the term hasn't been changed).
Returns the generation's UUID, guard and last mutation, which will be sent
with additional topology data by the caller.
If we send the last mutation in the `write_mutation` command, we would use a
total of `n + 1` commands instead of `n-1 + 1` (where `n` is the number of
mutations), so it's better to send it in `topology_change` (we need to send
it after all `write_mutations`) with some small metadata.
With the default commitlog segment size, `mutation_size_threshold` will be 4 MB.
In large clusters e.g. 100 nodes, 64 shards per node, 256 vnodes cdc generation
data can reach the size of 30 MB, thus there will be no more than 8 commands.
In a multi-DC cluster with 100ms latencies between DCs, this operation should
take about 200ms since we send the commands concurrently, but even if the commands
were replicated sequentially by Raft, it should take no more than 1.6s, which is
incomparably smaller than bootstrapping operation (bootstrapping is quick if there
is no data in the cluster, but usually if one has 100 nodes they have tons of data,
so indeed streaming/repair will take much longer (hours/days)).
Fixes FIXME in pr #13683.
If `group0_state_machine` applies all commands individually (without batching),
the resulting current `state_id` -- which will be compared with the
`prev_state_id` of the next command if it is a guarded command -- equals the
maximum of the `next_state_id` of all commands applied up to this point.
That's because the current `state_id` is obtained from the history table by
taking the row with the largest clustering key.
When `group0_state_machine::apply` is called with a batch of commands, the
current `state_id` is loaded from `system.group0_history` to `merger::last_group0_state_id`
only once. When a command is merged, its `next_state_id` overwrites
`last_group0_state_id`, regardless of their order.
Let's consider the following situation:
The leader sends two unguarded `write_mutations` commands concurrently, with
timeuuids T1 and T2, where T1 < T2. Leader waits to apply them and sends guarded
`topology_change` with `prev_state_id` equal T2.
Suppose that the command with timeuuid T2 is committed first, and these commands
are small enough that all of `write_mutations` could be merged into one command.
Some followers can get all of these three commands before its `fsm` polls them.
In this situation, `group0_state_machine::apply` is called with all three of
them and `merger` will merge both `write_mutations` into one command. After that,
`merger::last_group0_state_id` will be equal to T1 (this command was committed
as the second one). When it processes the `topology_change` command, it will
compare its `prev_state_id` and `merger::last_group0_state_id`, resulting in
making this command a no-op (which wouldn't happen if the commands were applied
individually).
Such a scenario results in inconsistent results: one replica applies `topology_change`,
but another makes it a no-op.
Topology snapshots contain only mutation of current CDC generation data but don't
contain any previous or future generations. If new a generation of data is being
broadcasted but hasn't been entirely applied yet, the applied part won't be sent
in a snapshot. In this scenario, new or delayed nodes can never get the applied part.
Send entire cdc_generations_v3 table in the snapshot to resolve this problem.
As a follow-up, a mechanism to remove old CDC generations will be introduced.
`get_cdc_generation_mutations` splits data to mutations of maximal size
`mutation_size_treshold`. Before this commit it was hardcoded to 2 MB.
Calculate `mutation_size_threshold` to leave space for cdc generation
data and not exceed `max_command_size`.
For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size`
(half of the commitlog segment size), so with some additional information, it
can exceed this threshold resulting in throwing an exception when writing
mutation to the commitlog.
A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size`
(half of the commitlog segment size). It doesn't have additional metadata, but
it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble,
but it is prefered to be careful.
Make `raft_sys_table_storage::_max_mutation_size` and
`group0_state_machine::merger::max_command_size` more strict to leave space
for metadata.
Fixed typo "1204" => "1024".
Provide useful description for `write_mutations` and
`broadcast_tables_query` that is stored in `system.group0_history`.
Reduces scope of issue #13370.
Fixes https://github.com/scylladb/scylladb/issues/13877
This commit adds the information about Rust CDC Connector
to the documentation. All relevant pages are updated:
the ScyllaDB Rust Driver page, and other places in
the docs where Java and Go CDC connectors are mentioned.
In addition, the drivers table is updated to indicate
Rust driver support for CDC.
Closes#14530
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:
```
WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```
It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.
And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.
To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.
Fixes#13538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14557
DEFAULT_MIN_SSTABLE_SIZE is defined as `50L * 1024L * 1024L`
which is 50 MB, not 50 bytes.
Fixes#14413
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14414
We have plenty of code marked with #if 0. Once it was an indication
of missing functionality, but the code has evolved so much it's
useless as an indication and only a distraction.
Delete it.
Closes#14511
Fixes https://github.com/scylladb/scylladb/issues/14459
This PR removes the (dead) link to the unirestore tool in a private repository. In addition, it adds minor language improvements.
Closes#14519
* github.com:scylladb/scylladb:
doc: minor language improvements on the Migration Tools page
doc: remove the link to the private repository
The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554)
where even if a user specifies a specific enpoint URL, the SDK uses
DescribeEndpoints to try to "refresh" the endpoint. The problem is that
DescribeEndpoints can't return a scheme (http or https) and the SDK
arbitrarily picks https - making it unable to communicate with Alternator
over http. As an example, the new "dynamodb shell" (written in C++)
cannot communicate with Alternator running over http.
This patch adds a configuration option, "alternator_describe_endpoints",
which can be used to override what DescribeEndpoints does:
1. Empty string (the default) leaves the current behavior -
DescribeEndpoints echos the request's "Host" header.
2. The string "disabled" disables the DescribeEndpoints (it will return
an UnknownOperationException). This is how DynamoDB Local behaves,
and the AWS C++ SDK and the Dynamodb Shell work well in this mode.
3. Any other string is a fixed string to be returned by DescribeEndpoints.
It can be useful in setups that should return a known address.
Note that this patch does not, by default, change the current behaivor
of DescribeEndpoints. But it us the future to override its behavior
in a user experiences problems in the field - without code changes.
Fixes#14410.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14432
Earlier, when local query processor wasn't available at
the beginning of system start, we couldn't query our own
host id when initializing the raft group registry. The local
host id is needed by the registry since it is responsible
to route RPC messages to specific raft groups, and needs
to reject messages destined to a different host.
Now that the host id is known early at boot, remove the optional
and pass host id in the constructor. Resolves an earlier fixme.
Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint.
This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message.
Closes#14357.
Closes#14460
* github.com:scylladb/scylladb:
tests: introduce InjectionHandler class for communicating with injected code
api/error_injection: add message_injection endpoint
tests: utils: error injections: add test for inject_with_handler
utils: error injection: add inject_with_handler for interactions with injected code
utils: error injection: create structure for error injections data
Currently, it is hard for injected code to wait for some events, for example,
requests on some REST endpoint.
This commit adds the `inject_with_handler` method that executes injected function
and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers
associated with the injection on all shards about the received message.
There is a counter of received messages in `received_messages_counter`; it is shared
between the injection_data, which is created once when enabling an injection on
a given shard, and all `injection_handler`s, that are created separately for each
firing of this injection. The `counter` is incremented when receiving a message from
the REST endpoint and the condition variable is signaled.
Each `injection_handler` (separate for each firing) stores its own private counter,
`_read_messages_counter` that private counter is incremented whenever we wait for a
message, and compared to the received counter. We sleep on the condition variable
if not enough messages were received.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Fixes https://github.com/scylladb/scylladb/issues/14503Closes#14502
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
this change has no impact on `build.ninja` generated by `configure.py`.
as we are using a `set` for tracking the tests to be built. but it's
still an improvement, as we should not add duplicated entries in a set
when initializing it.
there are two occurrences of `test/boost/double_decker_test`, the one
which is in the club of the local cluster of collections tests - bptree,
btree, radix_tree and double_decker are preserved.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14478
in this series, test/object_storage is restructured into a pytest based test. this paves the road to a test suites covers more use cases. so we can some more lower-level tests for tiered/caching-store.
Closes#14165
* github.com:scylladb/scylladb:
s3/test: do not return ip in managed_cluster()
s3/test: verify the behavior with asserts
s3/test: restructure object_store/run into a pytest
s3/test: extract get_scylla_with_s3_cmd() out
s3/test: s/restart_with_dir/kill_with_dir/
s3/test: vendor run_with_dir() and friends
s3/test: remove get_tempdir()
s3/test: extract managed_cluster() out
Currently we hold group0_guard only during DDL statement's execute()
function, but unfortunately some statements access underlying schema
state also during check_access() and validate() calls which are called
by the query_processor before it calls execute. We need to cover those
calls with group0_guard as well and also move retry loop up. This patch
does it by introducing new function to cql_statement class take_guard().
Schema altering statements return group0 guard while others do not
return any guard. Query processor takes this guard at the beginning of a
statement execution and retries if service::group0_concurrent_modification
is thrown. The guard is passed to the execute in query_state structure.
Fixes: #13942
Message-Id: <ZJ2aeNIBQCtnTaE2@scylladb.com>
there is chance that minio_server is not ready to serve after
launching the server executable process. so we need to retry until
the first "mc" command is able to talk to it.
in this change, add method `mc()` is added to run minio client,
so we can retry the command before it timeouts. and it allows us to
ignore the failure or specify the timeout. this should ready the
minio server before tests start to connect to it.
also, in this change, instead of hardwiring the alias of "local" in the code,
define a variable for it. less repeating this way.
Fixes https://github.com/scylladb/scylladb/issues/1719Closes#14517
* github.com:scylladb/scylladb:
test/pylib: do not hardwire alias to "local"
test/pylib: retry if minio_server is not ready
let's just use cluster.contact_points for retrieving the IP address
of the scylla node in this single-node cluster. so the name of
managed_cluster() is less weird.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of using a single run to perform the test, restructure
it into a pytest based test suite with a single test case.
this should allow us to add more tests exercising the object-storage
and cached/tierd storage in future.
* add fixtures so they can be reused by tests
* use tmpdir fixture for managing the tmpdir, see
https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture
* perform part of the teardown in the "test_tempdir()" fixture
* change the type of test from "Run" to "Python"
* rename "run" to "test_basic.py"
* optionally start the minio server if the settings are not
found in command line or env variables, so that the tests are
self-contained without the fixture setup by test.py.
* instead of sys.exit(), use assert statement, as this is
what pytest uses.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* define a dedicated S3_server class which duck types MinioServer.
it will be used to represent S3 server in place of MinioServer if
S3 is used for testing
* prepare object_storage.yaml in get_scylla_with_s3(), so it is more
clear that we are using the same set of settings for launching
scylla
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
replace the restart_with_dir() with kill_with_dir(), so
that we can simplify the usage of managed_cluster() by enabling it
to start and stop the single-node cluster. with this change, the caller
does not need to run the scylla and pass its pid to this function
any more.
since the restart_with_dir() call is superseded by managed_cluster(),
which tears down the cluster, teardown() is now only responsible to
print out the log file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to match with another call of managed_cluster(), so it's clear that
we are just reusing test_tempdir.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
for setting up the cluster and tearing down it.
this helps to indent the code so that it is visually explicit
the lifecycle of the cluster.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
there is chance that minio_server is not ready to serve after
launching the server executable process. so we need to retry until
the first "mc" command is able to talk to it.
in this change, add method `mc()` is added to run minio client,
so we can retry the command before it timeouts. and it allows us to
ignore the failure or specify the timeout. this should ready the
minio server before tests start to connect to it.
Fixes#1719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>