Currently, when the topology coordinator notices a request to join or
replace a node, the node is transitioned to an appropriate state and the
topology is moved to commit_new_generation/write_both_read_old, in a
single group 0 operation. In later commits, the topology coordinator
will accept/reject nodes based on the request, so we would like to have
a separate step - topology coordinator accepts, transitions to bootstrap
state, tells the node that it is accepted, and only then continues with
the topology transition.
This commits adds a new `join_group0` transition state that precedes
`commit_cdc_generation`.
The handler for join_node_request will need to know which node is
considered the group 0 leader right now by the local node.
If the topology coordinator crashes and a new node immediately wants to
replace it with the same IP, the node that handles join_node_request
will attempt to perform a read barrier. If this happens quickly enough,
due to the IP reuse the RPC will be sent to the new node instead of the
(now crashed) topology coordinator; the RPC will get an error and will
fail the barrier.
If we detect that the new node wants to replace the current topology
coordinator, the upcoming join_node_request_handler will wait until
there is a leader change.
Like in the non-raft topology path, during the new handshake, the
joining node will wait until all normal nodes are alive. The timeout
used during the wait is extracted to a constant so that it will be
reused in the handshake code, to be introduced in later commits.
Currently, the raft_group0 uses GROUP0_MODIFY_CONFIG RPC to ask an
existing group 0 member to add this node to the group, in case the
joining node was not a discovery leader. The new handshake verbs
(JOIN_NODE_REQUEST + JOIN_NODE_RESPONSE) will replace the old RPC. As a
preparation, this commit abstracts away the handshake process.
We will want to conditionally register some verbs based on whether we
are using raft topology or not. This commit serves as a preparation,
passing the `raft_topology_change_enabled` to the function which
initializes the verbs (although there is _raft_topology_change_enabled
field already, it's only initialized on shard 0 later).
The `join_node_request` and `join_node_response` RPCs are added:
- `join_node_request` is sent from the joining node to any node in the
cluster. It contains some initial parameters that will be verified by
the receiving node, or the topology coordinator - notably, it contains
a list of cluster features supported by the joining node.
- `join_node_response` is sent from the topology coordinator to the
joining node to tell it about the the outcome of the verification.
The `service::topology_features` struct was introduced in #14955. Its
purpose was to make it possible to load cluster features from
`system.topology` before schema commitlog replay. It contains a map from
host ID to supported feature set for every normal node.
In order not to duplicate logic for loading features,
the `service::topology`'s `replica_state`s do not hold a set of
supported features and users are supposed to refer to the features
in `topology_features`, which is a field in the `topology` struct.
However, accessing features is quite awkward now.
This commit adds `supported_features` field back to the `replica_state`
struct and the `load_topology_state` function initializes them properly.
The logic duplication needed to initialize them is quite small and the
drawbacks that come with it are outweighed by the fact that we now can
refer to node's supported features in a more natural way.
The `topology_features` struct is no longer a field of `topology`, but
it still exists for the purpose of the feature check that happens before
commitlog replay.
In unlucky but possible circumstances where a node is being replaced
very quickly, RPC requests using raft-related verbs from storage_service
might be sent to it - even before the node starts its group 0 server.
In the latter case, this triggers on_internal_error.
This commit adds protection to the existing verbs in storage_service:
they check whether the group 0 is running and whether the received
host_id matches the actual recipient's host_id.
None of the verbs that are modified are in any existing release, so the
added parameter does not have to be wrapped in rpc::optional.
There can be 2 waiters now (coordinator and CDC generation publisher),
so signal() is not enough.
Change made in c416c9ff33 missed to
update this site.
Closesscylladb/scylladb#15527
When preparing a `field_selection`, we need to prepare the UDT value,
and then verify that it has this field.
`field_selection_test_assignment` prepares the UDT value using the same
receiver as the whole `field_selection`. This is wrong, this receiver
has the type of the field, and not the UDT.
It's impossible to create a receiver for the UDT. Many different UDTs
can produce an `int` value when the field `a` is selected.
Therefore the receiver should be `nullptr`.
No unit test is added, as this bug doesn't currently cause any issues.
Preparing a column value doesn't do any type checks, so nothing fails.
Still it's good to fix it, just to be correct.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closesscylladb/scylladb#14788
more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.
Refs https://github.com/scylladb/scylladb/issues/15379Closesscylladb/scylladb#15515
* github.com:scylladb/scylladb:
build: extract get_os_ids() out
build: extract find_ninja() out
build: extract thrift_uses_boost_share_ptr() out
Currently, the tools loosely follow the following convention on
error-codes:
* return 1 if the error is with any of the command-line arguments
* return 2 on other errors
This patch changes the returned error-code on unknown operation/command
to 100 (instead of the previous 1). The intent is to allow any wrapper
script to determine that the tool failed because the operation is
unrecognized and not because of something else. In particular this
should enable us to write a wrapper script for scylla-nodetool, which
dispatches commands still un-implemented in scylla-nodetool, to the java
nodetool.
Note that the tool will still print an error message on an unknown
operation. So such wrapper script would have to make sure to not let
this bleed-through when it decides to forward the operation.
Closesscylladb/scylladb#15517
Currently the datadir is ignored.
Use it to construct the table's base path.
Fixesscylladb/scylladb#15418Closesscylladb/scylladb#15480
* github.com:scylladb/scylladb:
distributed_loader: populate_keyspace: access cf by ref
distributed_loader: table_populator: use datadir for base_path
distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed
distributed_loader: populate_keyspace: fixup indentation
distributed_loader: populate_keyspace: iterate over datadirs in the inner loop
test: sstable_directory_test: add test_multiple_data_dirs
table: init_storage: create upload and staging subdirs on all datadirs
Issue #10357 is about a SELECT with a filter on a regular column which
incorrectly returns a static row without regular columns set (so the
filter would not have matched). We already have four tests reproducing
this issue, but each of them is a small part of a large tests translated
from Cassandra, making it hard to understand the scope of this bug.
So in this patch we add two new tests, one passing and one xfailing,
which clarify the scope of this bug. It turns out that the bug only
occurs when a partition has no clustering rows and only has a static
row. If the partition does have clustering rows - even if those don't
match the filter - the bug doesn't happen. The xfailing test is just
two statements long - a single INSERT and a single SELECT
Refs #10357.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#15120
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.
1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests
2) turns out it stops first the gate used for closing readers that will
become inactive.
3) proceeds to wait for in-flight reads by closing the reader permit gate.
4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.
5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.
By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.
Fixes#15534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15535
in this series, we rename s3 credential related variable and option names so they are more consistent with AWS's official document. this should help with the maintainability.
Closesscylladb/scylladb#15529
* github.com:scylladb/scylladb:
main.cc: rename aws option
utils/s3/creds: rename aws_config member variables
So far generic describe (`DESC <name>`) followed Cassandra implementation and it only described keyspace/table/view/index.
This commit adds UDT/UDF/UDA to generic describe.
Fixes: #14170Closesscylladb/scylladb#14334
* github.com:scylladb/scylladb:
docs:cql: add information about generic describe
cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc
cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe
- s/aws_key/aws_access_key_id/
- s/aws_secret/aws_secret_access_key/
- s/aws_token/aws_session_token/
rename them to more popular names, these names are also used by
boto's API. this should improve the readability and consistency.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There is no need to hold on to the table's
shared ptr since it's held by the global table ptr
we got in the outer loop.
Simplify the code by just getting the local table reference
from `gtable`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the datadir is ignored.
Use it to construct the table's base path.
Fixes scylladb/scylladb#15418
Note that scylla still doesn't work correctly
with multiple data directories due to #15510.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, mark_ready_for_writes is called too early,
after the first data dir is processed, then the next
datadir will hit an assert in `table::mark_ready_for_writes`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is more efficient to iterate over multiple data directories
in the inner loop rather than the outer loop.
Following patch will make use of the datadir in
table_populator.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a basic regression test that starts the cql test env
with multiple data directories.
It fails without the previous patch:
table: init_storage: create upload and staging subdirs on all datadirs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Also, while at it, add copyright/license blurbs for tests that were missing it.
Closesscylladb/scylladb#15495
* github.com:scylladb/scylladb:
test/topology_custom: add copyright/license blurb to tests
test/topology_custom: test_select_from_mutation_fragments.py: use async query api
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
Fixes: #14710.
Closesscylladb/scylladb#15091
* github.com:scylladb/scylladb:
cql3: statements: delete execute override
cql3: statements: call check_restricted_table_properties in prepare
cql3: statements: pass data_dictionary::database to check_restricted_table_properties
more structured this way. and the data dependency is more clear
with this change. this also allows us to quickly identify the parts
which should/can be reused when migrating to the CMake based building
system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fix fromJson(null) to return null, not a error as it did before this patch.
We use "null" as the default value when unwrapping optionals
to avoid bad optional access errors.
Fixes: scylladb#7912
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15481
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Closesscylladb/scylladb#15400
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
Make sure that all writes started by the old coordinator are completed or
will eventually fail before starting a new coordinator.
Message-ID: <ZQv+OCrHl+KyAnvv@scylladb.com>
The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any.
Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not.
This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use.
Closesscylladb/scylladb#15497
* github.com:scylladb/scylladb:
s3::client: Track memory in client uploads
code: Configure s3 clients' memory usage
s3::client: Construct client with shared semaphore
sstables::storage_manager: Introduce config
This new exception type inherits from std::bad_alloc and allows logalloc
code to add additional information about why the allocation failed. We
currently have 3 different throw sites for std::bad_alloc in logalloc.cc
and when investigating a coredump produced by --abort-on-lsa-bad-alloc,
it is impossible to determine, which throw-site activated last,
triggering the abort.
This patch fixes that by disambiguating the throw-sites and including it
in the error message printed, right before abort.
Refs: #15373Closesscylladb/scylladb#15503
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.
Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes#14992.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's in preparation to next change that will make reshape
inherit from regular compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The v.u.g. start stop is now spread over main() code heavily.
1. sharded<v.u.g.>.start() happens early enough to allow depending services register staging sstables on it
2. after the system is "more-or-less" alive the invoke_on_all(v.u.g.::start()) is called (conditionally) to activate the generator background fiber. Not 100% sure why it happens _that_ late, but somehow it's required that while scylla is joining the cluster the generation doesn't happen
3. early on stop the v.u.g. is fully stopped
The 3rd step is pretty nasty. It may happen that v.u.g. is not stopped if scylla start aborts before the last action is defer-scheduled. Also, when it happens, it leaves stopping dependencies with non-initialized v.u.g.'s local instances, which is not symmetrical to how they start.
Said that, this PR fixes the stopping sequence to happen later, i.e. -- being defer-scheduled right after sharded<v.u.g.> is started. Also it makes sure that terminating the background fiber happens as early as it is now. This is done the compaction_manager-style -- the v.u.g. subscribes on stop signal abort source and kicks the fiber to stop when it fires.
Closesscylladb/scylladb#15466
* github.com:scylladb/scylladb:
view_update_generator: Stop for real later
view_update_generator: Add logging to do_abort()
view_update_generator: Move abort kicking to do_abort()
view_update_generator: Add early abort subscription