When a `topology_change` command is applied, the topology state is
reloaded and `cdc::generation_service::handle_cdc_generation` is called.
This creates a dependency of group0 on `cdc::generation_service`.
Currently, the group0 server is stopped during `raft_group_registry`
shutdown. However, it is called after `cdc::generation_service`
shutdown, which can result in a segfault.
To prevent this issue, this commit stops the group0 server and removes it
from `raft_group_registry` during `group0_service` shutdown.
Fixes#14397.
Closes#14779
Reproducer:
97d6946e31
It creates two nodes. The second one is forced to stop after joining
group0. It sleeps before calling handle_cdc_generation and sleeps just
before raft_group_registry is stopped. It ensures that
handle_cdc_generation wakes up after starting the second sleep. If the
cdc_generation_service shutdown waits for raft_group_registry to stop,
handle_cdc_generation will be called without any issue. Otherwise, it
will crash since cdc_generation_service won't exist. The test passes
always. If the crash happens it can be seen in the log file of the
second node.
This change makes tablet load balancing more efficient by performing
migrations independently for different tablets, and making new load
balancing plans concurrently with active migrations.
The migration track is interrupted by pending topology change operations.
The coordinator executes the load balancer on edges of tablet state
machine transitions. This allows new migrations to be started as soon
as tablets finish streaming.
The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
Overload of shards is limited by the fact that load balancer algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.
Closes#14851
* github.com:scylladb/scylladb:
tablets: load_balancer: Remove double logging
tests: tablets: Check that load balancing is interrupted by topology change
tests: tablets: Add test for load balancing with active migrations
tablets: Balance tablets concurrently with active migrations
storage_service, tablets: Extract generate_migration_updates()
storage_service, tablets: Move get_leaving_replica() to tablets.cc
locator: tablets: Move std::hash definition earlier
storage_service: Advance tablets independently
topology_coordinator: Fix missed notification on abort
tablets: Add formatter for tablet_migration_info
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.
Fixes: #13290Closes#13349
* github.com:scylladb/scylladb:
replica: make tables_metadata's attributes private
replica: add methods to get a filtered copy of tables map
replica: add methods to check if given table exists
replica: add methods to get table or table id
replica: api: return table_id instead of const table_id&
replica: iterate safely over tables related maps
replica: pass tables_metadata to phased_barrier_top_10_counts
replica: add methods to safely add and remove table
replica: wrap column families related maps into tables_metadata
replica: futurize database::add_column_family and database::remove
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.
Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member
Closes#14876
* github.com:scylladb/scylladb:
schema_tables: Extract query_processor from system_keyspace for querying
schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
migration_manager: Add system_keyspace argument to get_schema_mapping()
It is possible that topology will contain nodes that are no longer normal token owners, so they don't need to be sync'ed with.
Fixes scylladb/scylladb#14793
Closes#14798
* github.com:scylladb/scylladb:
storage_service: refresh_sync_nodes: restrict to reachable token owners
storage_service: refresh_sync_nodes: fix log message
locator: topology: node::state: make fine grained
It is possible that topology would contain nodes that do no longer
own tokens or that are unreachable, so they can't be sync'ed with.
Restrict the list to nodes in a normal or being_decommissioned state.
Fixesscylladb/scylladb#14793
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the node::state is coarse grained
so one cannot distinguish between e.g. a leaving
node due to decommission (where the node is used
for reading) vs. due to remove node (where the
node is not used for reading).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, after triggering the compaction,
compaction_manager_basic_test waits until the triggered compaction
completes. but since the regular compaction is run in a loop which
does not stop until either the daemon is stopping, or there is no
more sstables to be compacted, or the compaction is disabled.
but we only get the input sstables for compaction after swiching
to the "pending" state, and acquiring the read lock of the
compaction_state, and acquiring the read lock is implemented as
an coroutine, so there is chance that coroutine is suspended,
and the execution switches to the test. in this case, the test
will find that even after the triggered compaction completes,
there are still one or more pending compactions. hence the test
fails.
to address this problem, instead of just waiting for the compaction
to complete, we also wait until the number of pending compaction tasks
is 0. so that even if the test manages to sneak into the time window,
it won't proceed and starting check the compaction manager's stats.
Fixes#14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14889
Add calls to `maybe_yield` in the per-range loops to prevent stalls
if the loop never yields.
Note: originally the stalls were detected in nested calls
to `query_partition_key_range_concurrent` (see #14008).
This series turned the tail-recursion into iteration,
but still the inner loop(s) never yield and do quite
a lot of computations - so they mioght stall when called
with a large number of ranges.
Fixes#14008
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
instead of using a loop of std::swap(), let's use std::shift_left()
when appropriate. simpler and more readable this way.
moreover, the pattern of looking for a command and consume it from
the command line resembles what we have in main(), so let's use
similar logic to handle both of them. probably we can consolidate
them in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14888
this change change `const` to `constexpr`. because the string literal
defined here is not only immutable, but also initialized at
compile-time, and can be used by constexpr expressions and functions.
this change is introduced to reduce the size of the change when moving
to compile-time format string in future. so far, seastar::format() does
not use the compile-time format string, but we have patches pending on
review implementing this. and the author of this change has local
branches implementing the changes on scylla side to support compile-time
format string, which practically replaces most of the `format()` calls
with `seastar::format()`.
to reduce the size of the change and the pain of rebasing, some of the
less controversial changes are extracted and upstreamed. this one is one
of them.
this change also addresses following compilation failure:
```
/home/kefu/dev/scylladb/tools/scylla-sstable.cc:2836:44: error: call to consteval function 'fmt::basic_format_string<char, const char *const &, seastar::basic_sstring<char, unsigned int, 15>>::basic_format_string<const char *, 0>' is not a constant expression
2836 | .description = seastar::format(description_template, app_name, boost::algorithm::join(operations | boost::adaptors::transformed([] (const auto& op) {
| ^
/usr/include/fmt/core.h:3148:67: note: read of non-constexpr variable 'description_template' is not allowed in a constant expression
3148 | FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) {
| ^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14887
DynamoDB limits of all expressions (ConditionExpression, UpdateExpression,
ProjectionExpression, FilterExpression, KeyConditionExpression) to just
4096 bytes. Until now, Alternator did not enforce this limit, and we had
an xfailing test showing this.
But it turns out that not enforcing this limit can be dangerous: The user
can pass arbitrarily-long and arbitrarily nested expressions, such as:
a<b and (a<b and (a<b and (a<b and (a<b and (a<b and (...))))))
or
(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
and those can cause recursive algorithms in Alternator's parser and
later when applying expressions to recurse very deeply, overflow the
stack, and crash.
This patch includes new tests that demonstrate how Scylla crashes during
parsing before enforcing the 4096-byte length limit on expressions.
The patch then enforces this length limit, and these tests stop crashing.
We also verify that deeply-nested expressions shorter than the 4096-byte
limit are apparently short enough for our recursion ability, and work
as expected.
Unforuntately, running these tests many times showed that the 4096-byte
limit is not low enough to avoid all crashes so this patch needs to do
more:
The parsers created by ANTLR are recursive, and there is no way to limit
the depth of their recursion (i.e., nothing like YACC's YYMAXDEPTH).
Very deep recursion can overflow the stack and crash Scylla. After we
limited the length of expression strings to 4096 bytes this was *almost*
enough to prevent stack overflows. But unfortunetely the tests revealed
that even limited to 4096 bytes, the expression can sometimes recurse
too deeply: Consider the expression "((((((....((((" with 4000 parentheses.
To realize this is a syntax error, the parser needs to do a recursive
call 4000 times. Or worse - because of other Antlr limitations (see rants
in comments in expressions.g) it's actually 12000 recursive calls, and
each of these calls have a pretty large frame. In some cases, this
overflows the stack.
The solution used in this patch is not pretty, but works. We add to rules
in alternator/expressions.g that recurse (there are two of those - "value"
and "boolean_expression") an integer "depth" parameter, which we increase
when the rule recurses. Moreover, we add a so-called predicate
"{depth<MAX_DEPTH}?" that stops the parsing when this limit is reached.
When the parsing is stopped, the user will see a special kind of parse
error, saying "expression nested too deeply".
With this last modification to expressions.g, the tests for deeply-nested but
still-below-4096-bytes expressions
(test_limits.py::test_deeply_nested_expression_*) would not fail sporadically
as they did without it.
While adding the "expression nested too deeply" case, I also made the
general syntax-error reporting in Alternator nicer: It no longer prints
the internal "expression_syntax_error" type name (an exception type will
only be printed if some sort of unexpected exception happens), and it
prints the character position where the syntax error (or too deep
nested expression) was recognized.
Fixes#14473
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14477
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742).
As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.
Fixes#14501Fixes#14742Closes#14745
* github.com:scylladb/scylladb:
test/cql-pytest: test confirming that casting to counter doesn't work
cql: support casting of counter to other types
cql: implement missing counterasblob() and blobascounter() functions
cql: implement missing type functions for "counters" type
We add a special mode of load balancing, enabled through error
injection, which causes it to continuously generate plans. This
should keep the topology coordinator continuously in the tablet
migration track.
We enable this mode in test_tablets.py:test_bootstrap before
bootstrapping nodes to see that bootstrap request interrupts
tablet migration track. If this would not be the case, the
test will hang.
After this change, the load balancer can make progress with active
migrations. If the algorithm is called with active tablet migrations
in tablet metadata, those are treated by load balancer as if they were
already completed. This allows the algorithm to incrementally make
decision which when executed with active migrations will produce the
desired result.
Overload of shards is limited by the fact that the algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.
The coordinator executes the load balancer on edges of tablet state
machine stransitions. This allows new migrations to be started as soon
as tablets finish streaming.
The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
This change makes the topology state machine advance each tablet
independently which allows them to finish migrations at different
speeds, not at the speed of the slowest tablet.
It will also open the possibility of starting new transitions concurrently
with already active ones.
This is implemented by having a single transition state "tablet
migration", and handling it by scanning all the transitions and
advancing tablet state machines. Updates and barriers are batched for
all tablets in each cycle.
One complication is the tracking of streaming sessions. The operations
are no longer nested in the scope of a single handle method, and
cannot be waited on explicitly, as that would inhibit progress of the
coordinator, which starts later migrations. They live as independent
fibers, which associated with tablets in a transient data structure
which lives within the coordinator instance. This data structure is
consulted for a given tablet in each cycle of the
handle_tablet_migration() pump to check if streaming has finished and
we can move the tablet to the next stage. If the pump has no work,
only then it waits for any streaming to finish by blocking on the
_topo_sm.event.
If _as is aborted while the coordinator is in the middle of handling,
and decides to go to sleep, it may go to sleep without noticing that
it was aborted. Fix by checking before blocking on the condition
variable.
In general, every condition which can cause signal() should be checked
before when(). This patch doesn't fix all the cases. For example,
signal() can be called when there arrives a new topology request. This
can happen after the coordinator checked because it releases the guard
before calling when().
In the previous patch we implemented CAST operations from the COUNTER
type to various other types. We did not implement the reverse cast,
from different types to the counter type. Should we? In this patch
we add a test that shows we don't need to bother - Cassandra does not
support such casts, so it's fine that we don't too - and indeed the
test shows we don't support them.
It's not a useful operation anyway.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We were missing support in the "CAST(x AS type)" function for the counter
type. This patch adds this support, as well as extensive testing that it
works in Scylla the same as Cassandra.
We also un-xfail an existing test translated from Cassandra's unit
test. But note that this old test did not cover all the edge-cases that
the new test checks - some missing cases in the implementation were
not caught by the old test.
Fixes#14501
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Code in functions.cc creates the different TYPEasblob() and blobasTYPE()
functions for all type names TYPE. The functions for the "counter" type
were skipped, supposedly because "counters are not supported yet". But
counters are supported, so let's add the missing functions.
The code fix is trivial, the tests that verify that the result behaves
like Cassandra took more work.
After this patch, unimplemented::cause::COUNTERS is no longer used
anywhere in the code. I wanted to remove it, but noticed that
unimplemented::cause is a graveyard of unused causes, so decided not
to remove this one either. We should clean it up in a separate patch.
Fixes#14742
Also includes tests for tangently-related issues:
Refs #12607
Refs #14319
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
types.cc had eight of its functions unimplemented for the "counters"
types, throwing an "unimplemented::cause::COUNTERS" when used.
A ninth function (validate) was unimplemented for counters but did not
even throw.
Many code paths did not use any of these functions so didn't care, but
some do - e.g., the silly do-nothing "SELECT CAST(c AS counter)" when
c is already a counter column, which causes this operation to fail.
When the types.cc code encounters a counter value, it is (if I understand
it correctly) already a single uint64_t ("long_type") value, so we fall
back to the long_type implementation of all the functions. To avoid mistakes,
I simply copied the reversed_type implementation for all these functions -
whereas the reversed_type implementation falls back to using the underlying
type, the counter_type implementation always falls back to long_type.
After this patch, "SELECT CAST(c AS counter)" for a counter column works.
We'll introduce a test that verifies this (and other things) in a later
patch in this series.
The following patches will also need more of these functions to be
implemented correctly (e.g., blobascounter() fails to validate the size
of the input blob if the validate function isn't implemented for the
counter type).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The scaffolding required to have a working scylla tool app, is considerable, leading to a large amount of boilerplate code in each such app. This logic is also very similar across the two tool apps we have and would presumably be very similar in any future app. This PR extracts this logic into `tools/utils.hh` and introduces `tool_app_template`, which is similar to `seastar::app_template` in that it centralizes all the option handling and more in a single class, that each tool has to just instantiate and then call `run()` to run the app.
This cuts down on the repetition and boilerplate in our current tool apps and make prototyping new tool apps much easier.
Closes#14855
* github.com:scylladb/scylladb:
tools/utils.hh: remove unused headers
tools/utils: make get_selected_operation() and configure_tool_mode() private
tools/utils.hh: de-template get_selected_operation()
tools/scylla-types: migrate to tools_app_template
tools/scylla-types: prepare for migration to tool_app_template
tools/scylla-sstable.cc: fix indentation
tools/scylla-sstables: migrate to tool_app_template
tools/scylla-sstables: prepare for migration to tool_app_template
tools: extract tool app skeleton to utils.hh
The selector keeps selected format in system.local and uses static
db::system_keyspace::(get|set)_scylla_local_param() helpers to access
it. The helpers are turning into non-static so the selector should call
those on system_keyspace object, not class
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14871
This refreshes clang to 16.0.6 and libstdc++ to 13.1.1.
compiler-rt, libasan, and libubsan are added to install-dependencies.sh
since they are no longer pulled in as depdendencies.
Closes#13730
This reverts commit fb05fddd7d. After
1554b5cb61 ("Update seastar submodule"),
which fixed a coroutine bug in Seastar, it is no longer necessary.
Also revert the related "build: drop the warning on -O0 might fail tests"
(894039d444).
* seastar c0e618bbb...0784da876 (11):
> Revert "metrics: Remove registered_metric::operator()"
> build: use new behavior defined by CMP0127
> build: pass -DBOOST_NO_CXX98_FUNCTION_BASE to C++ compiler
> coroutine: fix a use-after-free in maybe_yield
Ref #13730.
> Merge 'sstring: add more accessors' from Kefu Chai
> Merge 'semaphore: semaphore_units: return units when reassigned' from Benny Halevy
> metrics: do not define defaulted copy assignment operator
> HTTP headers in http_response are now case insensitive
> rpc: Make server._proto a reference
> Merge 'Cleanup class metrics::registered_metrics' from Pavel Emelyanov
> core: undefine fallthrough to fix compilation error
Closes#14862
Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness.
This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use:
* Load'n'stream and streaming uses the current time on the local node.
* Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness.
Importantly, this compaction does *not* purge tombstones (tombstone GC is disabled completely).
Fixes: https://github.com/scylladb/scylladb/issues/3561Closes#14756
* github.com:scylladb/scylladb:
replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
repair/row_level: opt in to compacting the stream
streaming: opt-in to compacting the stream
sstables_loader: opt-in for compacting the stream
replica/table: add optional compacting to make_multishard_streaming_reader()
replica/table: add optional compacting to make_streaming_reader()
db/config: add config item for enabling compaction for streaming and repair
repair: log the error which caused the repair to fail
readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
mutation/mutation_compactor: allow user to abandon current partition
... instead of global qctx. The now used qctx->execute_cql() just calls
the query_processor::execute_internal with cache_internal::yes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14874
The schema_tables() column-mapping code runs queries over system. table,
but it needs LOCAL_ONE CL and cherry-pick on caching, so regular
system_keyspace::execute_cql() won't work here.
However, since schema_tables is somewhat part of system_keyspace, it's
natural to let the former fetch private query_processor& from the latter
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The callers all have local sys_ks argument:
- merge_tables_and_views()
- service::get_column_mapping()
- database::parse_system_tables()
And a test that can get it from cql_test_env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will need one to pass to db::schema_tables code. The caller is paxos
code with sys_ks local variable at hand
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction_manager_basic_test checks the stats of compaction_manager to
verify that there are no ongoing or pending compactions after the triggering
the compaction and waiting for its completion. but in #14865, there are
still active compaction(s) after the compaction_manager's stats shows there
is at least one task completed.
to understand this issue better, let's use `BOOST_CHECK_EQUAL()` instead
of `BOOST_REQUIRE()`, so that the test does not error out when the check
fails, and we can have better understanding of the status when the test
fails.
Refs #14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14872
It now has a single user, so it doesn't have to be a template.
For now, make the method inline, so it can stay in the header. It will
be moved to utils.cc in the next patch.