Commit Graph

5380 Commits

Author SHA1 Message Date
Botond Dénes
4d538e1363 Merge 'Task manager tasks covering compaction group compaction' from Aleksandra Martyniuk
All compaction task executors, except for regular compaction one,
become task manager compaction tasks.

Creating and starting of major_compaction_task_executor is modified
to be consistent with other compaction task executors.

Closes #14505

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to cover compaction group tasks
  compaction: turn custom_task_executor into compaction_task_impl
  compaction: turn sstables_task_executor into sstables_compaction_task_impl
  compaction: change sstables compaction tasks type
  compaction: move table_upgrade_sstables_compaction_task_impl
  compaction: pass task_info through sstables compaction
  compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
  compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
  comapction: use optional task info in major compaction
  compaction: use perform_compaction in compaction_manager::perform_major_compaction
2023-08-04 10:11:00 +03:00
Michał Jadwiszczak
b92d47362f schema::describe: print 'synchronous_updates' only if it was specified
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.

Fixes: #14924

Closes #14928
2023-08-04 09:52:37 +03:00
Kefu Chai
d8d91379e7 test: remove unnecessary check in compaction_manager_basic_test
we wait for the same condition couple lines before, so no need to
check it again using `BOOST_CHECK_EQUAL()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14921
2023-08-04 09:26:22 +03:00
Kefu Chai
d4ee84ee1e s3/test: nuke tempdir but keep $tempdir/log
before this change, if the object_store test fails, the tempdir
will be preserved. and if our CI test pipeline is used to perform
the test, the test job would scan for the artifacts, and if the
test in question fails, it would take over 1 hour to scan the tempdir.

to alleviate the pain, let's just keep the scylla logging file
no matter the test fails or succeeds. so that jenkins can scan the
artifacts faster if the test fails.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14880
2023-08-03 11:07:59 +03:00
Konstantin Osipov
df97135583 test.py: forward the optional property file when creating a server
To support multi-DC tests we need to provide a property
file when creating a server.
Forward it from the test client to test.py.

Closes #14683
2023-08-02 13:45:19 +02:00
Kamil Braun
b835acf853 Merge 'Cluster features on raft: topology coordinator + check on boot' from Piotr Dulikowski
This PR implements the functionality of the raft-based cluster features
needed to safely manage and enable cluster features, according to the
cluster features on raft design doc.

Enabling features is a two phase process, performed by the topology
coordinator when it notices that there are no topology changes in
progress and there are some not-yet enabled features that are declared
to be supported by all nodes:

1. First, a global barrier is performed to make sure that all nodes saw
   and persisted the same state of the `system.topology` table as the
   coordinator and see the same supported features of all nodes. When
   booting, nodes are now forbidden to revoke support for a feature if all
   nodes declare support for it, a successful barrier this makes sure that
   no node will restart and disable the features.
2. After a successful barrier, the features are marked as enabled in the
   `system.topology` table.

The whole procedure is a group 0 operation and fails if the topology
table is modified in the meantime (e.g. some node changes its supported
features set).

For now, the implementation relies on gossip shadow round check to
protect from nodes without all features joining the cluster. In a
followup, a new joining procedure will be implemented which involves the
topology coordinator and lets it verify joining node's cluster features
before the new node is added to group 0 and to the cluster.

A set of tests for the new implementation is introduced, containing the
same tests as for the non-raft-based cluster feature implementation plus
one additional test, specific to this implementation.

Closes #14722

* github.com:scylladb/scylladb:
  test: topology_experimental_raft: cluster feature tests
  test: topology: fix a skipped test
  storage_service: add injection to prevent enabling features
  storage_service: initialize enabled features from first node
  topology_state_machine: add size(), is_empty()
  group0_state_machine: enable features when applying cmds/snapshots
  persistent_feature_enabler: attach to gossip only if not using raft
  feature_service: enable and check raft cluster features on startup
  storage_service: provide raft_topology_change_enabled flag from outside
  storage_service: enable features in topology coordinator
  storage_service: add barrier_after_feature_update
  topology_coordinator: exec_global_command: make it optional to retake the guard
  topology_state_machine: add calculate_not_yet_enabled_features
2023-08-02 12:32:27 +02:00
Kefu Chai
d28c06b65b test: remove unused #include in sstable_*_test.cc
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.

in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed, except for "test/lib/scylla_test_case.hh"
as it brings some command line options used by scylla tests.

see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14922
2023-08-02 11:58:03 +03:00
Benny Halevy
949ea43034 topology: unindex_node: erase dc from datacenters when empty
In branch 5.2 we erase `dc` from `_datacenters` if there are
no more endpoints listed in `_dc_endpoints[dc]`.

This was lost unintentionally in f3d5df5448
and this commit restores that behavior, and fixes test_remove_endpoint.

Fixes scylladb/scylladb#14896

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14897
2023-08-02 09:08:24 +03:00
Piotr Dulikowski
d40bb0bacb test: topology_experimental_raft: cluster feature tests
Although the implementation of cluster features on raft is not complete
yet, it makes sense to add some tests for the existing implementation.
The `test_raft_cluster_features.py` file includes the same set of tests
as the file with non-raft-based cluster feature tests, plus one
additional test which checks that a node will not allow disabling a
feature if it sees that other nodes support it (even though the feature
is not enabled yet).
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
435005b6a5 test: topology: fix a skipped test
The `test_partial_upgrade_can_be_finished_with_removenode` test does not
work because the `cql` variable is used before it is declared. It was
not noticed because the test is marked as skipped, and does not work for
the non-raft cluster feature implementation. The variable declaration is
moved higher and the test now works; it will be used to test the raft
cluster feature implementation.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
61a44e0bc0 storage_service: provide raft_topology_change_enabled flag from outside
Information about whether we are using topology changes on raft or not
will be soon necessary for the persistent feature enabler, so that it
can do some additional checks based on the local raft topology state.
2023-08-01 18:54:57 +02:00
Kamil Braun
8bb3732d66 Merge 'storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal' from Patryk Jędrzejczak
We add the CDC generation optimality check in
`storage_service::raft_check_and_repair_cdc_streams` so that it doesn't
create new generations when unnecessary. Since
`generation_service::check_and_repair_cdc_streams` already has this
check, we extract it to the new `is_cdc_generation_optimal` function to
not duplicate the code.

After this change, multiple tasks could wait for a single generation
change. Calling `signal` on `topology_state_machine.event` would't wake
them all. Moreover, we must ensure the topology coordinator wakes when
his logic expects it. Therefore, we change all `signal` calls on
`topology_state_machine.event` to `broadcast`.

We delay the deletion of the `new_cdc_generation` request to the moment
when the topology transition reaches the `publish_cdc_generation` state.
We need this change to ensure the added CDC generation optimality check
in the next commit has an intended effect. If we didn't make it, it
would be possible that a task makes the `new_cdc_generation` request,
and then, after this request was removed but before committing the new
generation, another task also makes the `new_cdc_generation` request. In
such a scenario, two generations are created, but only one should. After
delaying the deletion of `new_cdc_generation` requests, the second
request would have no effect.

Additionally, we modify the `test_topology_ops.py` test in a way that
verifies the new changes. We call
`storage_service::raft_check_and_repair_cdc_streams` multiple times
concurrently and verify that exactly one generation has been created.

Fixes #14055

Closes #14789

* github.com:scylladb/scylladb:
  storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal
  storage_service: delay deletion of the new_cdc_generation request
  raft topology: broadcast on topology_state_machine.event instead of signal
  cdc: implement the is_cdc_generation_optimal function
2023-08-01 12:10:00 +02:00
Kamil Braun
84bb75ea0a Merge 'service: migration_manager: change the prepare_ methods to functions' from Patryk Jędrzejczak
The `migration_manager` service is responsible for schema convergence in
the cluster - pushing schema changes to other nodes and pulling schema
when a version mismatch is observed. However, there is also a part of
`migration_manager` that doesn't really belong there - creating
mutations for schema updates. These are the functions with `prepare_`
prefix. They don't modify any state and don't exchange any messages.
They only need to read the local database.

We take these functions out of `migration_manager` and make them
separate functions to reduce the dependency of other modules (especially
`query_processor` and CQL statements) on `migration_manager`. Since all
of these functions only need access to `storage_proxy` (or even only
`replica::database`), doing such a refactor is not complicated. We just
have to add one parameter, either `storage_proxy` or `database` and both
of them are easily accessible in the places where these functions are
called.

This refactor makes `migration_manager` unneeded in a few functions:
- `alternator::executor::create_keyspace`,
- `cql3::statements::alter_type_statement::prepare_announcement_mutations`,
- `cql3::statements::schema_altering_statement::prepare_schema_mutations`,
- `cql3::query_processor::execute_thrift_schema_command:`,
- `thrift::handler::execute_schema_command`.

We remove the `migration_manager&` parameter from all these functions.

Fixes #14339

Closes #14875

* github.com:scylladb/scylladb:
  cql3: query_processor::execute_thrift_schema_command: remove an unused parameter
  cql3: schema_altering_statement::prepare_schema_mutations: remove an unused parameter
  cql3: alter_type_statement::prepare_announcement_mutations: change parameters
  alternator: executor::create_keyspace: remove an unused parameter
  service: migration_manager: change the prepare_ methods to functions
2023-08-01 11:56:56 +02:00
Avi Kivity
dac93b2096 Merge 'Concurrent tablet migration and balancing' from Tomasz Grabiec
This change makes tablet load balancing more efficient by performing
migrations independently for different tablets, and making new load
balancing plans concurrently with active migrations.

The migration track is interrupted by pending topology change operations.

The coordinator executes the load balancer on edges of tablet state
machine transitions. This allows new migrations to be started as soon
as tablets finish streaming.

The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.

Overload of shards is limited by the fact that load balancer algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.

Closes #14851

* github.com:scylladb/scylladb:
  tablets: load_balancer: Remove double logging
  tests: tablets: Check that load balancing is interrupted by topology change
  tests: tablets: Add test for load balancing with active migrations
  tablets: Balance tablets concurrently with active migrations
  storage_service, tablets: Extract generate_migration_updates()
  storage_service, tablets: Move get_leaving_replica() to tablets.cc
  locator: tablets: Move std::hash definition earlier
  storage_service: Advance tablets independently
  topology_coordinator: Fix missed notification on abort
  tablets: Add formatter for tablet_migration_info
2023-07-31 16:44:33 +03:00
Botond Dénes
4a02865ea1 Merge 'Prevent invalidation of iterators over database::_column_families' from Aleksandra Martyniuk
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.

Fixes: #13290

Closes #13349

* github.com:scylladb/scylladb:
  replica: make tables_metadata's attributes private
  replica: add methods to get a filtered copy of tables map
  replica: add methods to check if given table exists
  replica: add methods to get table or table id
  replica: api: return table_id instead of const table_id&
  replica: iterate safely over tables related maps
  replica: pass tables_metadata to phased_barrier_top_10_counts
  replica: add methods to safely add and remove table
  replica: wrap column families related maps into tables_metadata
  replica: futurize database::add_column_family and database::remove
2023-07-31 15:31:59 +03:00
Botond Dénes
72043a6335 Merge 'Avoid using qctx in schema_tables' column-mapping queries' from Pavel Emelyanov
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.

Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member

Closes #14876

* github.com:scylladb/scylladb:
  schema_tables: Extract query_processor from system_keyspace for querying
  schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
  migration_manager: Add system_keyspace argument to get_schema_mapping()
2023-07-31 15:00:59 +03:00
Botond Dénes
781721218f Merge 'storage_service: refresh_sync_nodes: restrict to normal token owners' from Benny Halevy
It is possible that topology will contain nodes that are no longer normal token owners, so they don't need to be sync'ed with.

Fixes scylladb/scylladb#14793

Closes #14798

* github.com:scylladb/scylladb:
  storage_service: refresh_sync_nodes: restrict to reachable token owners
  storage_service: refresh_sync_nodes: fix log message
  locator: topology: node::state: make fine grained
2023-07-31 14:52:19 +03:00
Benny Halevy
d903d03bf8 locator: topology: node::state: make fine grained
Currently the node::state is coarse grained
so one cannot distinguish between e.g. a leaving
node due to decommission (where the node is used
for reading) vs. due to remove node (where the
node is not used for reading).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 10:33:48 +03:00
Kefu Chai
47e27dd2d2 test: wait until there is no pending tasks in compaction_manager_basic_test
before this change, after triggering the compaction,
compaction_manager_basic_test waits until the triggered compaction
completes. but since the regular compaction is run in a loop which
does not stop until either the daemon is stopping, or there is no
more sstables to be compacted, or the compaction is disabled.

but we only get the input sstables for compaction after swiching
to the "pending" state, and acquiring the read lock of the
compaction_state, and acquiring the read lock is implemented as
an coroutine, so there is chance that coroutine is suspended,
and the execution switches to the test. in this case, the test
will find that even after the triggered compaction completes,
there are still one or more pending compactions. hence the test
fails.

to address this problem, instead of just waiting for the compaction
to complete, we also wait until the number of pending compaction tasks
is 0. so that even if the test manages to sneak into the time window,
it won't proceed and starting check the compaction manager's stats.

Fixes #14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14889
2023-07-31 10:29:18 +03:00
Nadav Har'El
04e5082d52 alternator: limit expression length and recursion depth
DynamoDB limits of all expressions (ConditionExpression, UpdateExpression,
ProjectionExpression, FilterExpression, KeyConditionExpression) to just
4096 bytes. Until now, Alternator did not enforce this limit, and we had
an xfailing test showing this.

But it turns out that not enforcing this limit can be dangerous: The user
can pass arbitrarily-long and arbitrarily nested expressions, such as:

    a<b and (a<b and (a<b and (a<b and (a<b and (a<b and (...))))))

or
    (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((

and those can cause recursive algorithms in Alternator's parser and
later when applying expressions to recurse very deeply, overflow the
stack, and crash.

This patch includes new tests that demonstrate how Scylla crashes during
parsing before enforcing the 4096-byte length limit on expressions.
The patch then enforces this length limit, and these tests stop crashing.
We also verify that deeply-nested expressions shorter than the 4096-byte
limit are apparently short enough for our recursion ability, and work
as expected.

Unforuntately, running these tests many times showed that the 4096-byte
limit is not low enough to avoid all crashes so this patch needs to do
more:

The parsers created by ANTLR are recursive, and there is no way to limit
the depth of their recursion (i.e., nothing like YACC's YYMAXDEPTH).
Very deep recursion can overflow the stack and crash Scylla. After we
limited the length of expression strings to 4096 bytes this was *almost*
enough to prevent stack overflows. But unfortunetely the tests revealed
that even limited to 4096 bytes, the expression can sometimes recurse
too deeply: Consider the expression "((((((....((((" with 4000 parentheses.
To realize this is a syntax error, the parser needs to do a recursive
call 4000 times. Or worse - because of other Antlr limitations (see rants
in comments in expressions.g) it's actually 12000 recursive calls, and
each of these calls have a pretty large frame. In some cases, this
overflows the stack.

The solution used in this patch is not pretty, but works. We add to rules
in alternator/expressions.g that recurse (there are two of those - "value"
and "boolean_expression") an integer "depth" parameter, which we increase
when the rule recurses. Moreover, we add a so-called predicate
"{depth<MAX_DEPTH}?" that stops the parsing when this limit is reached.
When the parsing is stopped, the user will see a special kind of parse
error, saying "expression nested too deeply".

With this last modification to expressions.g, the tests for deeply-nested but
still-below-4096-bytes expressions
(test_limits.py::test_deeply_nested_expression_*) would not fail sporadically
as they did without it.

While adding the "expression nested too deeply" case, I also made the
general syntax-error reporting in Alternator nicer: It no longer prints
the internal "expression_syntax_error" type name (an exception type will
only be printed if some sort of unexpected exception happens), and it
prints the character position where the syntax error (or too deep
nested expression) was recognized.

Fixes #14473

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14477
2023-07-31 08:57:54 +03:00
Tomasz Grabiec
96d06b58df tests: tablets: Check that load balancing is interrupted by topology change
We add a special mode of load balancing, enabled through error
injection, which causes it to continuously generate plans. This
should keep the topology coordinator continuously in the tablet
migration track.

We enable this mode in test_tablets.py:test_bootstrap before
bootstrapping nodes to see that bootstrap request interrupts
tablet migration track. If this would not be the case, the
test will hang.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
8fdbc42e71 tests: tablets: Add test for load balancing with active migrations 2023-07-31 01:45:23 +02:00
Nadav Har'El
b55b8f29b9 test/cql-pytest: test confirming that casting to counter doesn't work
In the previous patch we implemented CAST operations from the COUNTER
type to various other types. We did not implement the reverse cast,
from different types to the counter type. Should we? In this patch
we add a test that shows we don't need to bother - Cassandra does not
support such casts, so it's fine that we don't too - and indeed the
test shows we don't support them.
It's not a useful operation anyway.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-07-30 20:16:25 +03:00
Nadav Har'El
b513bba201 cql: support casting of counter to other types
We were missing support in the "CAST(x AS type)" function for the counter
type. This patch adds this support, as well as extensive testing that it
works in Scylla the same as Cassandra.

We also un-xfail an existing test translated from Cassandra's unit
test. But note that this old test did not cover all the edge-cases that
the new test checks - some missing cases in the implementation were
not caught by the old test.

Fixes #14501

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-07-30 20:16:25 +03:00
Nadav Har'El
c1762750ed cql: implement missing counterasblob() and blobascounter() functions
Code in functions.cc creates the different TYPEasblob() and blobasTYPE()
functions for all type names TYPE. The functions for the "counter" type
were skipped, supposedly because "counters are not supported yet". But
counters are supported, so let's add the missing functions.

The code fix is trivial, the tests that verify that the result behaves
like Cassandra took more work.

After this patch, unimplemented::cause::COUNTERS is no longer used
anywhere in the code. I wanted to remove it, but noticed that
unimplemented::cause is a graveyard of unused causes, so decided not
to remove this one either. We should clean it up in a separate patch.

Fixes #14742

Also includes tests for tangently-related issues:
Refs #12607
Refs #14319

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-07-30 20:16:25 +03:00
Tomasz Grabiec
4e9d95d78c Merge 'Compact data before streaming' from Botond Dénes
Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness.
This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use:
* Load'n'stream and streaming uses the current time on the local node.
* Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness.

 Importantly, this compaction does *not* purge tombstones (tombstone GC is disabled completely).

Fixes: https://github.com/scylladb/scylladb/issues/3561

Closes #14756

* github.com:scylladb/scylladb:
  replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
  repair/row_level: opt in to compacting the stream
  streaming: opt-in to compacting the stream
  sstables_loader: opt-in for compacting the stream
  replica/table: add optional compacting to make_multishard_streaming_reader()
  replica/table: add optional compacting to make_streaming_reader()
  db/config: add config item for enabling compaction for streaming and repair
  repair: log the error which caused the repair to fail
  readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
  mutation/mutation_compactor: allow user to abandon current partition
2023-07-28 16:42:13 +02:00
Pavel Emelyanov
cf4d4d7e9b schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
The callers all have local sys_ks argument:

- merge_tables_and_views()
- service::get_column_mapping()
- database::parse_system_tables()

And a test that can get it from cql_test_env.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 15:55:13 +03:00
Kefu Chai
cc2bbde8f1 test: use BOOST_CHECK_EQUAL when appropriate in compaction_manager_basic_test
compaction_manager_basic_test checks the stats of compaction_manager to
verify that there are no ongoing or pending compactions after the triggering
the compaction and waiting for its completion. but in #14865, there are
still active compaction(s) after the compaction_manager's stats shows there
is at least one task completed.

to understand this issue better, let's use `BOOST_CHECK_EQUAL()` instead
of `BOOST_REQUIRE()`, so that the test does not error out when the check
fails, and we can have better understanding of the status when the test
fails.

Refs #14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14872
2023-07-28 15:45:07 +03:00
Patryk Jędrzejczak
3468cbd66b service: migration_manager: change the prepare_ methods to functions
The migration_manager service is responsible for schema convergence
in the cluster - pushing schema changes to other nodes and pulling
schema when a version mismatch is observed. However, there is also
a part of migration_manager that doesn't really belong there -
creating mutations for schema updates. These are the functions with
prepare_ prefix. They don't modify any state and don't exchange any
messages. They only need to read the local database.

We take these functions out of migration_manager and make them
separate functions to reduce the dependency of other modules
(especially query_processor and CQL statements) on
migration_manager. Since all of these functions only need access
to storage_proxy (or even only replica::database), doing such a
refactor is not complicated. We just have to add one parameter,
either storage_proxy or database and both of them are easily
accessible in the places where these functions are called.
2023-07-28 13:55:27 +02:00
Patryk Jędrzejczak
3f29c98394 storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal
We add the CDC generation optimality check in
storage_service::raft_check_and_repair_cdc_streams so that it
doesn't create new generations when unnecessary.

Additionally, we modify the test_topology_ops.py test in a way
that verifies the new changes. We call
storage_service::raft_check_and_repair_cdc_streams multiple
times concurrently and verify that exactly one generation has been
created.
2023-07-28 11:04:30 +02:00
Aleksandra Martyniuk
bfa3a7325a test: extend test_compaction_task.py to cover compaction group tasks 2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
1decf86d71 compaction: change sstables compaction tasks type 2023-07-28 10:51:55 +02:00
Avi Kivity
cf81eef370 Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.

Manually tested using ccm on cluster upgrade scenarios and node restarts.

Closes #14441

* github.com:scylladb/scylladb:
  test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
  schema_mutations, migration_manager: Ignore empty partitions in per-table digest
  migration_manager, schema_tables: Implement migration_manager::reload_schema()
  schema_tables: Avoid crashing when table selector has only one kind of tables
2023-07-28 00:01:33 +03:00
Patryk Jędrzejczak
b81a6037f1 test: pylib: ensure ScyllaCluster.add_server does not start a second cluster
If the cluster isn't empty and all servers are stopped, calling
ScyllaCluster.add_server can start a new cluster. That's because
ScyllaCluster._seeds uses the running servers to calculate the
seed node list, so if all nodes are down, the new node would
select only itself as a seed, starting a new cluster.

As a single ScyllaCluster should describe a single cluster, we
make ScyllaCluster.add_server fail when called on a non-empty
cluster with all its nodes stopped.

Closes #14804
2023-07-27 13:27:23 +02:00
Alexey Novikov
ff721ec3e3 make timestamp string format cassandra compatible
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification

Fixes #14518
Fixes #7997

Closes #14726
2023-07-27 12:01:09 +03:00
Botond Dénes
fdaf908967 repair/row_level: opt in to compacting the stream
Using a centrally generated compaction-time, generated on the repair
master and propagated to all repair followers. For repair it is
imperative that all participants use the exact same compaction time,
otherwise there can be artificial differences between participants,
generating unnecessary repair activity.
If a repair follower doesn't get a compaction-time from the repair
master, it uses a locally generated one. This is no worse than the
previous state of each node being on some undefined state of compaction.
2023-07-27 04:57:50 -04:00
Botond Dénes
2f8d77e97b replica/table: add optional compacting to make_multishard_streaming_reader()
Doing to make_multishard_streaming_reader() what the previous commit did
to make_streaming_reader(). In fact, the new compaction_time parameter
is simply forwarded to the make_streaming_reader() on the shard readers.

Call sites are updated, but none opt in just yet.
2023-07-27 03:22:11 -04:00
Raphael S. Carvalho
050ce9ef1d cached_file: Evict unused pages that aren't linked to LRU yet
It was found that cached_file dtor can hit the following assert
after OOM

cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`

cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.

That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.

share() is responsible for automatically placing the page into
LRU once refcount drops to zero.

If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.

Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.

A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.

A reproducer was added.

Fixes #14814.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14818
2023-07-27 00:01:46 +02:00
Nadav Har'El
59c1498338 test/alternator: don't forget to delete tables on test failures
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.

We missed a few such cases in test_gsi.py, and this patch corrects them.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.

There are no functional changes in this patch - most of the lines
changed are just reindents.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14835
2023-07-26 21:51:22 +03:00
Nadav Har'El
056d04954c Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()
2023-07-26 20:04:28 +03:00
Nadav Har'El
d2ca600eec test/*/run: kill Scylla with SIGTERM
Today, test/*/run always kills Scylla at the end of the test with
SIGKILL (kill -9), so the Scylla shutdown code doesn't run. It was
believed that a clean shutdown would take a long time, but in fact,
it turns out that 99% of the shutdown time was a silly sleep in the
gossip code, which this patch disables with the "--shutdown-announce-in-ms"
option.

After enabling this option, clean shutdown takes (in a dev build on
my laptop) just 0.02 seconds. It's worth noting that this shutdown
has no real work to do - no tables to flush, and so on, because the
pytest framework removes all the tables in its own fixture cleanup
phase.

So in this patch, to kill Scylla we use SIGTERM (15) instead of SIGKILL.
We then wait until a timeout of 10 seconds (much much more than 0.02
seconds!) for Scylla to exit. If for some reason it didn't exit (e.g.,
it hung during the shutdown), it is killed again with SIGKILL, which
is guaranteed to succed.

This change gives us two advantages

1. Every test run with test/*/run exercises the shutdown path. It is perhaps
   excessive, but since the shutdown is so quick, there is no big downside.

2. In a test-coverage run, a clean shutdown allows flushing the counter
   files, which wasn't possible when Scylla was killed with KILL -9.

Fixes #8543

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14825
2023-07-26 14:06:24 +03:00
Avi Kivity
ff1f461a42 Merge 'Introduce tablet load balancer' from Tomasz Grabiec
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.

The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.

The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.

Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.

Closes #14601

* github.com:scylladb/scylladb:
  tests: test_tablets: Add test for bootstraping a node
  storage_service: topology_coordinator: Implement tablet migration state machine
  tablets: Introduce tablet_mutation_builder
  service: tablet_allocator: Introduce tablet load balancer
  tablets: Introduce tablet_map::for_each_tablet()
  topology: Introduce get_node()
  token_metadata: Add non-const getter of tablet_metadata
  storage_service: Notify topology state machine after applying schema change
  storage_service: Implement stream_tablet RPC
  tablets: Introduce global_tablet_id
  stream_transfer_task, multishard_writer: Work with table sharder
  tablets: Turn tablet_id into a struct
  db: Do not create per-keyspace erm for tablet-based tables
  tablets: effective_replication_map: Take transition stage into account when computing replicas
  tablets: Store "stage" in transition info
  doc: Document tablet migration state machine and load balancer
  locator: erm: Make get_endpoints_for_reading() always return read replicas
  storage_service: topology_coordinator: Sleep on failure between retries
  storage_service: topology_coordinator: Simplify coordinator loop
  main: Require experimental raft to enable tablets
2023-07-26 12:30:29 +03:00
Botond Dénes
d0f725c1b9 test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
A test reproducing #14819, that is, the view update builder not flushing
the buffer when only empty partitions are consumed (with only a
tombstone in them).
2023-07-26 03:09:53 -04:00
Botond Dénes
ad2ddffb22 Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from

- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests

All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx

Closes #14778

* github.com:scylladb/scylladb:
  system_keyspace: Make save_truncation_record() non-static
  code: Pass sharded<db::system_keyspace>& to database::truncate()
  db: Add sharded<system_keyspace>& to legacy_schema_migrator
2023-07-26 08:48:49 +03:00
Tomasz Grabiec
ae8ffe23fc tests: test_tablets: Add test for bootstraping a node 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
5c681a1d63 tablets: Introduce tablet_mutation_builder 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
6f4a35f9ae service: tablet_allocator: Introduce tablet load balancer
Will be invoked by the topology coordinator later to decide
which tablets to migrate.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
f88220aeee stream_transfer_task, multishard_writer: Work with table sharder
So that we can use it on tablet-based tables.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
8cf92d4c86 tablets: Turn tablet_id into a struct
The IDL compiler cannot deal with enum classes like this.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
dc2ec3f81c tablets: Store "stage" in transition info
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.

See the "Tablet migration" section of topology-over-raft.md
2023-07-25 21:08:02 +02:00