Extract decorated_key.hh and ring_position.hh
out of i_partitioner.hh so they can be included
selectively, since i_partitioner.hh contains too much
bagage that is not always needed in full.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the `token_comparator` definition and
implementation to token.{hh,cc}, respectively
since they are independent of i_partitioner.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is possible the sender and receiver of streaming nodes have different
views on if a table is dropped or not.
For example:
- n1, n2 and n3 in the cluster
- n4 started to join the cluster and stream data from n1, n2, n3
- a table was dropped
- n4 failed to write data from n2 to sstable because a table was dropped
- n4 ended the streaming
- n2 checked if the table was present and would ignore the error if the table was dropped
- however n2 found the table was still present and was not dropped
- n2 marked the streaming as failed
This will fail the streaming when a table is dropped. We want streaming to
ignore such dropped tables.
In this patch, a status code is sent back to the sender to notify the
table is dropped so the sender could ignore the dropped table.
Fixes#15370Closesscylladb/scylladb#15912
After starting the associated node, ScyllaServer waits until the node
starts serving CQL requests. It does that by periodically trying to
establish a python driver session to the node.
During session establishment, the driver tries to fetch some metadata
from the system tables, and uses a pretty short timeout to do so (by
default it's 2 seconds). When running tests in debug mode, this timeout
can prove to be too short and may prevent the testing framework from
noticing that the node came up.
Fix the problem by increasing the timeout. Currently, after the session
is established, a query is sent in order to further verify that the
session works and it uses a very generous timeout of 1000 seconds to do
so - use the same timeout for internal queries in the python driver.
Fixes: scylladb/scylladb#15898Closesscylladb/scylladb#15929
The follow-up to #15594.
We retry every automatic `migration_manager::announce` if
`group0_concurrent_modification` occurs. Concurrent operations can
happen during concurrent bootstrap in Raft-based topology, so we need
this change to enable support for concurrent bootstrap.
This PR adds retry loops in 4 places:
- `service::create_keyspace_if_missing`,
- `system_distributed_keyspace::start`,
- `redis::create_keyspace_if_not_exists_impl`,
- `table_helper::setup_keyspace` (used for creating the `system_traces` keyspace).
Fixes#15435Closesscylladb/scylladb#15613
* github.com:scylladb/scylladb:
table_helper: fix indentation
table_helper: retry in setup_keyspace on concurrent operation
table_helper: add logger
redis/keyspace_utils: fix indentation
redis: retry creating defualt databases on concurrent operation
db/system_distributed_keyspace: fix indentation
db/system_distributed_keyspace: retry start on concurrent operation
auth/service: retry creating system_auth on concurrent operation
Topology on raft is still an experimental feature. The RPC verbs
introduced in that mode shouldn't be used when it's disabled, otherwise
we lose the right to make breaking changes to those verbs.
First, make sure that the aforementioned verbs are not sent outside the
mode. It turns out that `raft_pull_topology_snapshot` could be sent
outside topology-on-raft mode - after the PR, it no longer can.
Second, topology-on-raft mode verbs are now not registered at all on the
receiving side when the mode is disabled.
Additionally tested by running `topology/` tests with
`consistent_cluster_management: True` but with experimental features
disabled.
Fixes: scylladb/scylladb#15862Closesscylladb/scylladb#15917
* github.com:scylladb/scylladb:
storage_service: fix indentation
raft: topology: only register verbs in topology-on-raft mode
raft: topology: only pull topology snapshot in topology-on-raft mode
move the code which updates the third-party library closer to where
the library is found. for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15915
this mirrors what we already have in `configure.py`.
so that Seastar can report [[nodiscard]] violations as error.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15914
Currently, table_helper::setup_keyspace is used only for starting
the system_traces keyspace. We need to handle concurrent group 0
operations possible during concurrent bootstrap in the Raft-based
topology.
When running some pytest-based tests they start scylla binary by hand
instead of relying on test.py's "clusters". In automatic run (e.g. via
test.py itself) the correct scylla binary is the one pointed to by
SCYLLA environment, but when run from shell via pytest directly it tries
to be smart and looks at build/*/scylla binaries picking the one with
the greatest mtime.
That guess is not very nice, because if the developer switches between
build modes with configure.py and rebuilds binaries, binaries from
"older" or "previous" builds stay on the way and confuse the guessing
code. It's better to be explicit.
refs: #15679
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15684
This patch series adds error handling for streaming failure during
topology operations instead of an infinite retry. If streaming fails the
operation is rolled back: bootstrap/replace nodes move to left and
decommissioned/remove nodes move back to normal state.
* 'gleb/streaming-failure-rollback-v4' of github.com:scylladb/scylla-dev:
raft: make sure that all operation forwarded to a leader are completed before destroying raft server
storage_service: raft topology: remove code duplication from global_tablet_token_metadata_barrier
tests: add tests for streaming failure in bootstrap/replace/remove/decomission
test/pylib: do not stop node if decommission failed with an expected error
storage_service: raft topology: fix typo in "decommission" everywhere
storage_service: raft topology: add streaming error injection
storage_service: raft topology: do not increase topology version during CDC repair
storage_service: raft topology: rollback topology operation on streaming failure.
storage_service: raft topology: load request parameters in left_token_ring state as well
storage_service: raft topology: do not report term_changed_error during global_token_metadata_barrier as an error
storage_service: raft topology: change global_token_metadata_barrier error handling to try/catch
storage_service: raft topology: make global_token_metadata_barrier node independent
storage_service: raft topology: split get_excluded_nodes from exec_global_command
storage_service: raft topology: drop unused include_local and do_retake parameters from exec_global_command which are always true
storage_service: raft topology: simplify streaming RPC failure handling
There are some schema modifications performed automatically (during
bootstrap, upgrade etc.) by Scylla that are announced by multiple calls
to `migration_manager::announce` even though they are logically one
change. Precisely, they appear in:
- `system_distributed_keyspace::start`,
- `redis:create_keyspace_if_not_exists_impl`,
- `table_helper::setup_keyspace` (for the `system_traces` keyspace).
All these places contain a FIXME telling us to `announce` only once.
There are a few reasons for this:
- calling `migration_manager::announce` with Raft is quite expensive --
taking a `read_barrier` is necessary, and that requires contacting a
leader, which then must contact a quorum,
- we must implement a retrying mechanism for every automatic `announce`
if `group0_concurrent_modification` occurs to enable support for
concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs
mentioned above would be harder, and fixing the FIXMEs later would also
be harder.
This PR fixes the first two FIXMEs and improves the situation with the
last one by reducing the number of the `announce` calls to two.
Unfortunately, reducing this number to one requires a big refactor. We
can do it as a follow-up to a new, more specific issue. Also, we leave a
new FIXME.
Fixing the first two FIXMEs required enabling the announcement of a
keyspace together with its tables. Until now, the code responsible for
preparing mutations for a new table could assume the existence of the
keyspace. This assumption wasn't necessary, but removing it required
some refactoring.
Fixesscylladb/scylladb#15437Closesscylladb/scylladb#15897
* github.com:scylladb/scylladb:
table_helper: announce twice in setup_keyspace
table_helper: refactor setup_table
redis: create_keyspace_if_not_exists_impl: fix indentation
redis: announce once in create_keyspace_if_not_exists_impl
db: system_distributed_keyspace: fix indentation
db: system_distributed_keyspace: announce once in start
tablet_allocator: update on_before_create_column_family
migration_listener: add parameter to on_before_create_column_family
alternator: executor: use new prepare_new_column_family_announcement
alternator: executor: introduce create_keyspace_metadata
migration_manager: add new prepare_new_column_family_announcement
Verbs related to topology on raft should not be sent outside the
topology on raft mode - and, after the previous commit, they aren't.
Make sure not to register handlers for those verbs if topology on raft
mode is not enabled.
Currently, during group0 snapshot transfer, the node pulling
the snapshot will send the `raft_pull_topology_snapshot` verb even if
the cluster is not in topology-on-raft mode. The RPC handler returns an
empty snapshot in that case. However, using the verb outside topology on
raft causes problems:
- It can cause issues during rolling upgrade as the snapshot transfer
will keep failing on the upgraded nodes until the leader node is
upgraded,
- Topology changes on raft are still experimental, and using the RPC
outside experimental mode will prevent us from doing breaking changes
to it.
Solve the issue by passing the "topology changes on raft enabled" flag
to group0_state_machine and send the RPC only in topology on raft mode.
We can opt out from installing suggested packages. Mainly those related to Java and friends that we do not seem to need.
Fixes: #15579
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#15580
There's such a wrapper class in test_services. After #15889 this class resembles the test_env_compaction_manager and can be replaced with it. However, two users of the former wrapper class need it just to construct table object, and the way they do it is re-implementation of table_for_tests class.
This PR patches the test cases to make use of table_for_tests and removes the compaction_manager_for_testing that becomes unused after it.
Closesscylladb/scylladb#15909
* github.com:scylladb/scylladb:
test_services: Ditch compaction_manager_for_testing
test/sstable_compaction_test: Make use of make_table_for_tests()
test/sstable_3_x_test: Make use of make_table_for_tests()
table_for_tests: Add const operator-> overload
sstable_test_env: Add test_env_compaction_manager() getter
sstable_test_env: Tune up maybe_start_compaction_manager() method
test/sstable_compaction_test: Remove unused tracker allocation
Now this wrapper is unused, all (both) test cases that needed it were
patched to use make_table_for_tests().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The max_ongoing_compaction_test test case constructs table object by
hand. For that it needs tracker, compaction manager and stats. Similarly
to previous patch, the test_env::make_table_for_tests() helper does
exactly that, so the test case can be simplified as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The compacted_sstable_reader() helper constructs table object and all
its "dependencies" by hand. The test_env::make_table_for_tests() helper
does the same, so the test code can be simplified.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it public and add `bool enable` flag so that test cases could start
the compaction manager (to call make_table_for_tests() later) but keep
it disabled for their testing purposes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable_run_based_compaction_test case allocates the tracker but
doesn't use it. Probably was left after the case was patched to use
make_table_for_tests() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
on debian derivatives librapidxml-dev installs rapidxml.h as
rapixml/rapidxml.hpp, so let's use it as a fallback.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15814
before this change the argument passed to --date-stamp option is
ignored, as we don't reference the date-stamp specified with this option
at all. instead, we always overwrite it with the the output of
`date --utc +%Y%m%d`, if we are going to reference this value.
so, in this change instead of unconditionally overwriting it, we
keep its value intact if it is already set.
the change which introduced this regression was 839d8f40e6Fixes#15894
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15895
The object in question is used to facilitate creation of table objects for compaction tests. Currently the table_for_test carries a bunch of auxiliary objects that are needed for table creation, such as stats of all sorts and table state. However, there's also some "infrastructure" stuff onboard namely:
- reader concurrency semaphore
- cache tracker
- task manager
- compaction manager
And those four are excessive because all the tests in question run inside the sstables::test_env that has most of it.
This PR removes the mentioned objects from table_for_tests and re-uses those from test_env. Also, while at it, it also removes the table::config object from table_for_tests so that it looks more like core code that creates table does.
Closesscylladb/scylladb#15889
* github.com:scylladb/scylladb:
table_for_tests: Use test_env's compaction manager
sstables::test_env: Carry compaction manager on board
table_for_tests: Stop table on stop
table_for_tests: Get compaction manager from table
table_for_tests: Ditch on-board concurrency semaphore
table_for_tests: Require config argument to make table
table_for_tests: Create table config locally
table_for_tests: Get concurrency semaphore from table
table_for_tests: Get table directory from table itself
table_for_tests: Reuse cache tracker from sstables manager
table_for_tests: Remove unused constructor
tests: Split the compaction backlog test case
sstable_test_env: Coroutinize and move to .cc test_env::stop()
Replacing `restrict_replication_simplestrategy` config option with
2 config options: `replication_strategy_{warn,fail}_list`, which
allow us to impose soft limits (issue a warning) and hard limits (not
execute CQL) on replication strategy when creating/altering a keyspace.
The reason to rather replace than extend `restrict_replication_simplestrategy` config
option is that it was not used and we wanted to generalize it.
Only soft guardrail is enabled by default and it is set to SimpleStrategy,
which means that we'll generate a CQL warning whenever replication strategy
is set to SimpleStrategy. For new cloud deployments we'll move
SimpleStrategy from warn to the fail list.
Guardrails violations will be tracked by metrics.
Resolves#5224
Refs #8892 (the replication strategy part, not the RF part)
Closesscylladb/scylladb#15399
Handler of STREAM_MUTATION_FRAGMENTS verb creates and starts reader. The
resulting future is then checked for being exceptional and an error
message is printed in logs.
However, if reader fails because of socket being closed by peer, the
error looks excessive. In that case the exception is just regular
handling of the socket/stream closure and can be demoted down to debug
level.
fixes: #15891
Similar cherry-picking of log level exists in e.g. storage proxy, see
for example 56bd9b5d (service: storage_proxy: do not report abort
requests in handle_write )
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15892
before this change, we feed `build_reloc.sh` with hardwired arguments
when building python3 submodule. but this is not flexible, and hurts
the maintainability.
in this change, we mirror the behavior of `configure.py`, and collect
the arguments from the output of `install-dependencies.sh`, and feed
the collected argument to `build_reloc.sh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15885
We have observed do_repair_ranges() receiving tens of thousands of
ranges to repairs on occasion. do_repair_ranges() repairs all ranges in
parallel, with parallel_for_each(). This is normally fine, as the lambda
inside parallel_for_each() takes a semaphore and this will result in
limited concurrency.
However, in some instances, it is possible that most of these ranges are
skipped. In this case the lambda will become synchronous, only logging a
message. This can cause stalls beacuse there are no opportunities to
yield. Solve this by adding an explicit yield to prevent this.
Fixes: #14330Closesscylladb/scylladb#15879
The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal
with legacy materialized view schemas used for secondary indexes,
schemas which were created before the notion of "computed columns" was
introduced. Back then, secondary index schemas would use a regular
"token" column. Later it became a computed column and old schemas would
be migrated during rolling upgrade.
The migration code was introduced in 2019
(db8d4a0cc6) and then fixed in 2020
(d473bc9b06).
The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming
that users don't try crazy things like upgrading from 2021.X to 2023.X
(which we do not support), all clusters will have already executed the
migration code once they upgrade to 2023.X, meaning we can get rid of
it.
The main motivation of this PR is to get rid of the
`db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft
mode this was the only call to `merge_schema` outside "group 0 code" and
in fact it is unsafe -- it uses locally generated mutations with locally
generated timestamp (`api::new_timestamp()`), so if we actually did it,
we would permanently diverge the group 0 state machine across nodes
(the schema pulling code is disabled in Raft mode). Fortunately, this
should be dead code by now, as explained in the previous paragraph.
The migration code is now turned into a sanity check, if the users
try something crazy, they will get an error instead of silent data
corruption.
Closesscylladb/scylladb#15695
* github.com:scylladb/scylladb:
view: remove unused `_backing_secondary_index`
schema_tables: turn view schema fixing code into a sanity check
schema_tables: make comment more precise
feature_service: make COMPUTED_COLUMNS feature unconditionally true
to be compatible with `configure.py` which allows us to optionally
specify the --date-stamp option for SCYLLA-VERSION-GEN. this option
is used by our CI workflow.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15896
We refactor table_helper::setup_keyspace so that it calls
migration_manager::announce at most twice. We achieve it by
announcing all tables at once.
The number of announcements should further be reduced to one, but
it requires a big refactor. The CQL code used in
parse_new_cf_statement assumes the keyspace has already been
created. We cannot have such an assumption if we want to announce
a keyspace and its tables together. However, we shouldn't touch
the CQL code as it would impact user requests, too.
One solution is using schema_builder instead of the CQL statements
to create tables in table_helper.
Another approach is removing table_helper completely. It is used
only for the system_traces keyspace, which Scylla creates
automatically. We could refactor the way Scylla handles this
keyspace and make table_helper unneeded.