Make sure the compaction_state:s are idle before
they are destroyed. Although all tasks are stopped
in stop_ongoing_compactions, make sure there is
fiber holding the compaction_state gate.
compaction_manager::remove now needs to close the
compaction_state gate and to stop_ongoing_compactions
only if the gate is not closed yet.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Check if the compaction_state gate is closed
along with _state != state::enabled and return early
in this case.
At this point entering the gate is guaranteed to succeed.
So enter the gate before calling `perform_compaction`
keeping the std::optional<gate_holder> throughout
the compaction task.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Passing the gate_closed_exception to the task promise in start()
ends up with abandoned exception since no-one is waiting
for it.
Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.
Fixesscylladb/scylladb#15211
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Have a private about_source for every module
and request abort on stop() to signal all outstanding
tasks to abort (especially when they are sleeping
for the task_ttl).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather to the top-level task_manager about_source,
to provide separation between task_manager modules
so each one can be aborted and stopped independentally
of the others (in the next patch).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replacing `minimum_keyspace_rf` config option with 4 config options:
`{minimum,maximum}_replication_factor_{warn,fail}_threshold`, which
allow us to impose soft limits (issue a warning) and hard limits (not
execute CQL) on RF when creating/altering a keyspace.
The reason to rather replace than extend `minimum_keyspace_rf` config
option is to be aligned with Cassandra, which did the same, and has the
same parameters' names.
Only min soft limit is enabled by default and it is set to 3, which means
that we'll generate a CQL warning whenever RF is set to either 1 or 2.
RF's value of 0 is always allowed and means that there will not be any
replicas on a given DC. This was agreed with PM.
Because we don't allow to change guardrails' values when scylla is
running (per PM), there're no tests provided with this PR, and dtests will be
provided separately.
Exceeding guardrails' thresholds will be tracked by metrics.
Resolves#8619
Refs #8892 (the RF part, not the replication-strategy part)
Closes#14262
Currently, we start group 0 leadership monitor fiber only during
a normal bootstrap. However, we should also do it when we restart
a node (either with or without upgrading it to Raft).
Fixes#15166Closes#15204
in a4eb3c6e0f, we passed the path of
"image" to `dbuild`, but that was wrong. we should pass its content
to this script. so in this change, it is fixed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15247
Index caching was disabled by default because it caused performance regressions
for some small-partition workloads. See https://github.com/scylladb/scylladb/issues/11202.
However, it also means that there are workloads which could benefit from the
index cache, but (by default) don't.
As a compromise, we can set a default limit on the memory usage of index cache,
which should be small enough to avoid catastrophic regressions in
small-partition workloads, but big enough to accommodate workloads where
index cache is obviously beneficial.
This series adds such a configurable limit, sets it to to 0.2 of total cache memory by default,
and re-enables index caching by default.
Fixes#15118Closes#14994
* github.com:scylladb/scylladb:
test: boost/cache_algorithm_test: add cache_algorithm_test
sstables: partition_index_cache: deglobalize stats
utils: cached_file: deglobalize cached_file metrics
db: config: enable index caching by default
config: add index_cache_fraction
utils: lru: add move semantics to list links
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.
This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.
But it also makes sense even without that motive.
Move cached_file metrics from a thread_local variable
to cache_tracker.
This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.
But it also makes sense even without that motive.
Index caching was disabled by default because it caused performance regressions
for some small-partition workloads. See #11202.
However, it also means that there are workloads which could benefit from the
index cache, but (by default) don't.
As a compromise, we can set a default limit on the memory usage of index cache,
which should be small enough to avoid catastrophical regressions in
small-partition workloads, but big enough to accomodate workloads where
index cache is obviously beneficial.
This patch sets such a limit to 0.2 of total cache memory, and re-enables
index caching by default.
Adds a configurable upper limit to memory usage by index caches.
See the source code comments added in this patch for more details.
This patch shouldn't change visible behaviour, because the limit is set to 1.0
by default, so it is never triggerred. We will change the default in a future
patch.
Before the patch, fixing list links is done manually in the move constructor of
`evictable`. After the patch, it is done by the move constructors of the links
themselves.
This makes for slightly cleaner code, especially after we add more links in an
upcoming patch.
This series ensures that endpoint state changes (for each single endpoint) are applied to the gossiper endpoint_state_map as a whole and on all shards.
Any failure in the process will keep the existing endpoint state intact.
Note that verbs that modify the endpoint states of multiple endpoints may still succeed to modify some of them before hitting an error and those changes are committed to the endpoint_state_map, so we don't ensure atomicity when updating multiple endpoints' states.
Fixes scylladb/scylladb#14794
Fixes scylladb/scylladb#14799
Closes#15073
* github.com:scylladb/scylladb:
gossiper: move endpoint_state by value to apply it
gossiper: replicate: make exception safe
gms: pass endpoint_state_ptr to endpoint_state change subscribers
gossiper: modify endpoint state only via replicate
gossiper: keep and serve shared endpoint_state_ptr in map
gossiper: get_max_endpoint_state_version: get state by reference
api/failure_detector: get_all_endpoint_states: reduce allocations
cdc/generation: get_generation_id_for: get endpoint_state&
gossiper: add for_each_endpoint_state helpers
gossiper: add num_endpoints
gossiper: add my_endpoint_state
This PR collects followups described in #14972:
- The `system.topology` table is now flushed every time feature-related
columns are modified. This is done because of the feature check that
happens before the schema commitlog is replayed.
- The implementation now guarantees that, if all nodes support some
feature as described by the `supported_features` column, then support
for that feature will not be revoked by any node. Previously, in an
edge case where a node is the last one to add support for some feature
`X` in `supported_features` column, crashes before applying/persisting
it and then restarts without supporting `X`, it would be allowed to boot
anyway and would revoke support for the `X` in `system.topology`.
The existing behavior, although counterintuitive, was safe - the
topology coordinator is responsible for explicitly marking features as
enabled, and in order to enable a feature it needs to perform a special
kind of a global barrier (`barrier_after_feature_update`) which only
succeeds after the node has updated its features column - so there is no
risk of enabling an unsupported feature. In order to make the behavior
less confusing, the node now will perform a second check when it tries
to update its `supported_features` column in `system.topology`.
- The `barrier_after_feature_update` is removed and the regular global
`barrier` topology command is used instead. The `barrier` handler now
performs a feature check if the node did not have a chance to verify and
update its cluster features for the second time.
JOIN_NODE rpc will be sent separately as it is a big item on its own.
Fixes: #14972Closes#15168
* github.com:scylladb/scylladb:
test: topology{_experimental_raft}: don't stop gracefully in feature tests
storage_service: remove _topology_updated_with_local_metadata
topology_coordinator: remove barrier_after_feature_update
topology_coordinator: perform feature check during barrier
storage_service: repeat the feature check after read barrier
feature_service: introduce unsupported_feature_exception
feature_service: move startup feature check to a separate function
topology_coordinator: account for features to enable in should_preempt_balancing
group0_state_machine: flush system.topology when updating features columns
Add a document describing in detail how to use clang's "-ftime-trace"
option, and the ClangBuildAnalyzer tool, to find the source files,
header files and templates which slow down Scylla's build the most.
I've used this tool in the past to reduce Scylla build time - see
commits:
fa7a302130 (reduced 6.5%)
f84094320d (reduced 0.1%)
6ebf32f4d7 (reduced 1%)
d01e1a774b (reduced 4%)
I'm hoping that documenting how to use this tool will allow other
developers to suggest similar commits.
Refs #1.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15209
to lower the programmer's cognitive load. as programmer might want
to pass the full path as the `fname` when calling
`make_descriptor(sstring sstdir, sstring fname)`, but this overload
only accepts the filename component as its second parameter. a
single `path` parameter would be easier to work with.
Refs #15187Closes#15188
* github.com:scylladb/scylladb:
sstable: add samples of fname to be matched by regex
sstables: change make_descriptor() to accept fs::path
sstables: switch entry_descriptor(sstring..) to std::string_view
sstables: change make_descriptor() to accept fs::path
the dbuild script provided by the branch being debugged might not
include the recent fixes included by current branch from which
`open-coredump.sh` is launched.
so, instead of using the dbuild script in the repo being debugged,
let's use the dbuild provided by current branch. also, wrap the
dbuild command line. for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15240
Scrub tests use a lot of temporary directories. This is suspected to
cause problems in some cases. To improve the situation, this patch:
* Creates a single root temporary directory for all scrub tests
* All further fixtures create their files/directories inside this root
dir.
* All scrub tests create their temporary directories within this root
dir.
* All temporary directories now use an appropriate "prefix", so we can
tell which temporary directory is part of the problem if a test fails.
Refs: #14309Closes#15117
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.
Drop the assert that verifies the assumption and return an error in this
case instead.
Fixes: #15222
Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
* seastar 0784da87...6e80e84a (29):
> Revert "shared_token_bucket: Make duration->tokens conversion more solid"
> Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai
> memory: diable transparent hugepages if --overprovisioned is specified
Ref https://github.com/scylladb/scylladb/issues/15095
> http/exception: s/<TAB>/ /
> install-dependencies.sh: re-add protobuf
> Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov
> Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov
Fixes https://github.com/scylladb/scylladb/issues/13100
> smp: make service management semaphore thread local
> tls_test: abort_accept() after getting server socket
> Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov
> rpc: Fix client-side stream registration race
Ref https://github.com/scylladb/scylladb/issues/13100
> tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas
> build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler
> reactor: Drop some dangling friend declarations
> fair_queue: Do not re-evaluate request capacity twice
> build: do not use serial number file when signing a cert
> shared_token_bucket: Make duration->tokens conversion more solid
> tests: Add perf test for shard_token_bucket
> Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov
> fair_queue: Remove individual requests counting
> reactor, linux-aio: print value of aio-max-nr on error
> Merge 'build, net: disable implicit fallthough' from Kefu Chai
> shared_token_bucket: Fix duration_for() underflow
> rpc: Generalize get_stats_internal() method
> doc/building-dpdk.md: fix invalid file path of README-DPDK.md
> install-dependencies: add centos9
> Merge 'log: report scheduling group along with shard id' from Kefu Chai
> dns: handle exception in do_sendv for udp
> Merge 'Add a stall detector histogram' from Amnon Heiman
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15218
change another overload of `make_descriptor()` to accept `fs::path`,
in the same spirit of a previous change in this area. so we have
a more consistent API for creating sstable descriptor. and this
new API is simpler to use.
Refs #15187
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so its callers don't need to construct a temporary `sstring` if
the parameter's type is not `sstring`. for instance, before
this change, `entry_descriptor::make_descriptor(const std::filesystem::path...)`
would have to construct two temporary instances of `sstring`
for calling this function.
after this change, it does not have to do so.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to lower the programmer's cognitive load. as programmer might want
to pass the full path as the `fname` when calling
`make_descriptor(sstring sstdir, sstring fname)`, but this overload
only accepts the filename component as its second parameter. a
single `path` parameter would be easier to work with.
Refs #15187
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The current cluster feature tests are stopping nodes in a graceful way.
Doing it gracefully isn't strictly necessary for the test scenarios and
we can switch `server_stop_gracefully` calls to `server_stop`. This only
became possible after a previous commit which causes `system.topology`
table to be flushed when cluster feature columns are modified, and will
server as a good test for it.
After removing barrier_after_feature_update, the flag is no longer
needed by anybody. The field in storage_service is removed and
do_update_topology_with_local_metadata is inlined.
The `barrier_after_feature_update` was introduced as a variant of the
`barrier` command, meant to be used by the topology coordinator when
enabling a feature. It was meant to give more guarantees to the topology
coordinator than the regular barrier, but the regular barrier has been
adjusted in the previous commits so that it can be used instead of the
special barrier.
This commit gets rid of `barrier_after_feature_update` and replaces its
uses with `barrier`.
Due to the possible situation where a node applies a command that
advertises support for a feature but crashes before applying it, there
is a period of time where a node might have its group 0 server running
but does not support all of the features. Currently, we solve the issue
by using a special `barrier_after_feature_update` which will not succeed
until the node makes sure to update its `supported_features` column (or,
since the previous commit, shuts down if it doesn't support all required
features).
However, we can make it work with regular barrier after adjusting it
slightly. In case the local metadata was not updated yet, it will
perform a feature check. This will make sure that the global barrier
issued by the topology coordinator before enabling features will not
succeed if the problematic situation occurs.
We would like to guarantee the following property: if all nodes have
some feature X in their `supported_features` column in
`system.topology`, then it's no longer possible for anybody to revoke
support for it. Currently, it is not guaranteed because the following
can happen:
1. A node commits a command that updates its `supported_features`,
marking feature X as supported. It is the last node to do so and now
all nodes support X.
2. Node crashes before applying the command locally.
3. Node is downgraded not to support X and restarted.
4. The feature check in `enable_features_on_startup` passes because it
happens before starting the group 0 server.
5. The `supported_features` column is updated in
`update_topology_with_local_metadata`, removing support for X.
Even though the guarantee does not hold, it's not a problem because the
`barrier_after_metadata_update` is required to succeed on all nodes
before topology coordinator moves to enable a feature, and - as the name
suggests - it requires `update_topology_with_local_metadata` to finish.
However, choosing to give this guarantee makes it simpler to reason
about how cluster features on raft work and removes some pathological
cases (e.g. trying to downgrade some other node after step 1 will fail,
but will be again possible after step 5). Therefore, this commit adds a
second check to `update_topology_with_local_metadata` which disallows
removing support for a feature that is supported by everybody - and
stops the boot process if necessary.
The logic responsible for checking supported features agains the
currently enabled features (and features that are unsafe to disable) is
moved to a separate function, `check_features`. Currently, it is only
used from `enable_features_on_startup`, but more checks against features
in raft will be added in the commits that follow.
First replicate the new endpoint_state on all shards
before applying the replicated endpoint_state objects
to _endpoint_state_map.
Fixesscylladb/scylladb#14794
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And restrict the accessor methods to return const pointers
or refrences.
With that, the endpoint_state_ptr:s held in the _endpoint_state_map
point to immutable endpoint_state objects - with one exception:
the endpoint_state update_timestamp may be updated in place,
but the endpoint_state_map is immutable.
replicate() replaces the endpoint_state_ptr in the map
with a new one to maintain immutability.
A later change will also make this exception safe so
replicate will guarantee strong exception safety so that all shards
are updated or none of them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This commit changes the interface to
using endpoint_state_ptr = lw_shared_ptr<const endpoint_state>
so that users can get a snapshot of the endpoint_state
that they must not modify in-place anyhow.
While internally, gossiper still has the legacy helpers
to manage the endpoint_state.
Fixesscylladb/scylladb#14799
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
reserve the result vector based on the known
number of endpoints and then move-construct each entry
rather than copying it.
Also, use refrences to traverse the application_state_map
rather than copying each of them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to lookup the application_state again using the
endpoint, as both callers already have a reference to
the endpoint_state handy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before changing _endpoint_state_map to hold a
lw_shared_ptr<endpoint_state>, provide synchronous helpers
for users to traverse all endpoint_states with no need
to copy them (as long as the called func does not yield).
With that, gossiper::get_endpoint_states() can be made private.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>