We add the has_pending_ranges function to erm. The
implementation for vnode is similar to that of token_metadata.
For tablets, we add new code that checks if the given endpoint
is contained in tablet_map::_transitions.
The function storage_service::update_pending_ranges is
turned to update_topology_changes_info.
The pending_endpoints and read_endpoints will be
computed later, when the erms are rebuilt.
We already use the new pending_endpoints from erm though
the get_pending_ranges virtual function, in this commit
we update all the remaining places to use the new
implementation in erm, as well as remove the old implementation
in token_metadata.
We want to switch token_metadata_test to the new
implementation of pending_endpoints and read_endpoints in erm.
To do this, it is convenient to have token_metadata and
replication_strategy as shared pointers, as it fits better with the signature
of calculate_effective_replication_map. In this commit we don't
change the logic of the tests, we just migrate them to use pointers.
In this commit we introduce functions to erm for accessing
pending_endpoints and read_endpoints similar to the
corresponding functions in token_metadata. The only
difference - we no longer need the keyspace_name map.
The functions get_pending_endpoints and get_endpoints_for_reading
are virtual, since they have different implementations
for vnode and for tablets.
The get_pending_endpoints already existed. For tablets it
remained unchanged, while for vnode we just changed
it from calling on token_metadata to using a local field.
We have also removed ks_name from the signature as it's
no longer needed.
For vnodes, the get_endpoints_for_reading also just
employs the local field. In the case of tablets, we currently
return nullptr as the appropriate implementation remains unclear.
In this commit we add logic to calculate pending_endpoints and
read_endpoints, similar to how it was done in update_pending_ranges.
For situations where 'natural_endpoints_depend_on_token'
is false we short-circuit the calculations, breaking out
of the loop after the first iteration. In this case we add a
single item with key=default_replication_map_key
to the replication_map and set pending_endpoints/read_endpoints
key range to the entire set of possible values.
In the loop we iterate over all_tokens, which contains the union of
all boundary tokens, from the old and from the new topology.
In addition to updating pending_endpoints and read_endpoints in the loop,
we remember the new natural endpoints in the replication_map
if the current token is contained in the current set of boundary tokens.
We optimise memory usage of replication_map by
storing endpoints list only once in case of
natural_endpoints_depend_on_token() == false. For simplicity,
this list is stored in the same unordered_map with
special key default_replication_map_key.
We inline both get_natural_endpoints and
for_each_natural_endpoint_until from abstract_replication_strategy
into vnode_erm since now the overrides in local and everywhere
strategies are redundant. The default implementation works
for them as empty sorted_tokens() is not a problem, we
store endpoints with a special key.
Function do_get_natural_endpoints was extracted,
since get_natural_endpoints returns by val,
but for_each_natural_endpoint_until reference in sufficient.
We want to refactor replication_map so that it doesn't
store multiple copies of the same endpoints vector
in case of natural_endpoints_depend_on_token == false.
To preserve get_range_addresses behaviour
we iterate over tm.sorted_tokens() instead of
_replication_map.
It's possible that the callers of this function
are ok with single range in case of
natural_endpoints_depend_on_token == false,
but to restrict the scope of the refactoring we
refrain from going to that direction.
We need to account for the new fields in the clone implementation.
The signature future<erm> erm::clone() const; doesn't work because
the call will be made via foreign_ptr on an instance from another
shard, so we need to use local values for replication_strategy
and token_metadata.
Refactor ~vnode_effective_replication_map, use
our new clear_gently overload for rvalue references.
Add new fields _pending_endpoints and _read_endpoints
to the call.
vnode_efficient_replication_map::clear_gently is removed as
it was not used.
We don't use the return value of erase, so
we can allow it to return anything. We'll
need this for ring_mapping, since
boost::icl::interval_map::erase(it)
returns void.
We plan to move pending_endpoints and read_endpoints, along
with their computation logic, from token_metadata to
vnode_effective_replication_map. The vnode_effective_replication_map
seems more appropriate for them since it contains functionally
similar _replication_map and we will be able to reuse
pending_endpoints/read_endpoints across keyspaces
sharing the same factory_key.
At present, pending_endpoints and read_endpoints are updated in the
update_pending_ranges function. The update logic comprises two
parts - preparing data common to all keyspaces/replication_strategies,
and calculating the migration_info for specific keyspaces. In this commit,
we introduce a new topology_change_info structure to hold the first
part's data add create an update_topology_change_info function to
update it. This structure will later be used in
vnode_effective_replication_map to compute pending_endpoints
and read_endpoints. This enables the reuse of topology_change_info
across all keyspaces, unlike the current update_pending_ranges
implementation, which is another benefit of this refactoring.
The update_topology_change_info implementation is mostly derived from
update_pending_ranges, there are a few differences though:
* replacing async and thread with plain co_awaits;
* adding a utils::clear_gently call for the previous value
to mitigate reactor stalls if target_token_metadata grows large;
* substituting immediately invoked lambdas with simple variables and
blocks to reduce noise, as lambdas would need to be converted into coroutines.
The original update_pending_ranges remains unchanged, and will be
removed entirely upon transitioning to the new implementation.
Meanwhile, we add an update_topology_change_info call to
storage_service::update_pending_ranges so that we can
iteratively switch the system to the new implementation.
Without the feature, the system schema doesn't have the table, and the
read will fail with:
Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system)
We should not attempt to read tablet metadata in the experimental
feature is not enabled.
Fixes#13946Closes#13947
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.
Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.
refs: #13024
refs: #13020
refs: #12707Closes#13767
* github.com:scylladb/scylladb:
sstable: Toss tempdir extension usage
sstable: Drop pending_delete_dir_basename()
sstable: Drop is_pending_delete_dir() helper
sstable_directory: Make garbage_collect() non-static
sstable_directory: Move deletion log exists check
distributed_loader: Move garbage collecting into sstable_directory
distributed_loader: Collect garbace collecting in one call
sstable: Coroutinize remove_temp_dir()
sstable: Coroutinize touch_temp_dir()
sstable: Use storage::temp_dir instead of hand-crafted path
When a CQL expression is printed, it can be done using
either the `debug` mode, or the `user` mode.
`user` mode is basically how you would expect the CQL
to be printed, it can be printed and then parsed back.
`debug` mode is more detailed, for example in `debug`
mode a column name can be displayed as
`unresolved_identifier(my_column)`, which can't
be parsed back to CQL.
The default way of printing is the `debug` mode,
but this requires us to remember to enable the `user`
mode each time we're printing a user-facing message,
for example for an invalid_request_exception.
It's cumbersome and people forget about it,
so let's change the default to `user`.
There issues about expressions being printed
in a `strange` way, this fixes them.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13916
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.
Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.
This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).
Closes#13933
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).
The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.
Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.
Fixes: #13837Closes#13172
* github.com:scylladb/scylladb:
gossiper: Remove features and sysks from gossiper
system_keyspace: De-static save_local_supported_features()
system_keyspace: De-static load_|save_local_enabled_features()
system_keyspace: Move enable_features_on_startup to feature_service (cont)
system_keyspace: Move enable_features_on_startup to feature_service
feature_service: Open-code persist_enabled_feature_info() into enabler
gms: Move feature enabler to feature_service.cc
gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
gms: Persist features explicitly in features enabler
feature_service: Make persist_enabled_feature_info() return a future
system_keyspace: De-static load_peer_features()
gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
gossiper: Enable features and register enabler from outside
gms: Add feature_service and system_keyspace to feature_enabler
Some state that is used to fill in 'peeers' table is still propagated
over gossiper. When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.
Fixes: #13911
Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.
This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.
This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.
Refs: #11870Closes#13869
* github.com:scylladb/scylladb:
db/system_keyspace: remove dependency on storage_proxy
db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent
replica: add query.hh
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13908
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.
with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13853
There are several places that need to carry a pointer to a table that's shard-wide accessible -- database snapshot and truncate code and distributed loader. The database code uses `get_table_on_all_shards()` returning a vector of foreign lw-pointers, the loader code uses its own global_column_family_ptr class.
This PR generalizes both into global_table_ptr facility.
Closes#13909
* github.com:scylladb/scylladb:
replica: Use global_table_ptr in distributed loader
replica: Make global_table_ptr a class
replica: Add type alias for vector of foreign lw-pointers
replica: Put get_table_on_all_shards() to header
replica: Rewrite get_table_on_all_shards()
instead of encoding the fact that we are using generation identifier
as a hint where the SSTable with this generation should be processed
at the caller sites of `as_int()`, just provide an accessor on
sstable_generation_generator's side. this helps to encapsulate the
underlying type of generation in `generation_type` instead of exposing
it to its users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13846
`ep` is std::move'ed to get_endpoint_state_for_endpoint_ptr
but it's used later for logger.warn()
Fixes#13921
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13920
The loader has very similar global_column_family_ptr class for its
distributed loadings. Now it can use the "standard" one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now all users of global_table know it's a vector and reference its
elements with this_shard_id() index. Making the global_table_ptr a class
makes it possible to stop using operator[] and "index" this_shard_id()
in its -> and * operators.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use sharded<database>::invoke_on_all() instead of open-coded analogy.
Also don't access database's _column_families directly, use the
find_column_family() method instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.
This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper is used to return const char* value of the pending delete
dir. Callers can use it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by the sstable_directory::replay_pending_delete_log()
method. The latter is only called by the sstable_directory itself with
the path being pending-delete dir for sure. So the method can be made
private and the is_pending_delete_dir() can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When non static the call can use sstable_directory::_sstable_dir path,
not the provided argument. The main benefit is that the method can later
be moved onto lister so that filesystem and ownership-table listers can
process dangling bits differently.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check if the deletion log exists in the handling helper, not outside of
it. This makes next patch shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the directory that owns the components lister and can reason about
the way to pick up dangling bits, be it local directories or entries
from the ownership table.
First thing to do is to move the g.c. code into sstable_directory. While
at it -- convert ssting dir into fs::path dir and switch logger.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the loader starts it first scans the directory for sstables'
tempdirs and pending deletion logs. Put both into one call so that it
can be moved more easily later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When opening an sstable on filesystem it's first created in a temporary
directory whose path is saved in storage::temp_dir variable. However,
the opening method constructs the path by hand. Fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/13915
This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.
This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.
Closes#13917
string_format_test was added in 1b5d5205c8,
so let's add it to CMake building system as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13912
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.
in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13907