before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define a formatter for db::operation_type, and
remove their operator<<().
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16832
This short series prevents the creation of compaction tasks when we know in advance that they have nothing to do.
This is possible in the clean path by:
- improve the detection of candidates for cleanup by skipping sstables that require cleanup but are already being compacted
- checking that list of sstables selected for cleanup isn't empty before creating the cleanup task
For upgrade sstables, and generally when rewriting all sstable: launch the task only if the list off candidate sstables isn't empty.
For regular compaction, when triggered via `table::add_sstable_and_update_cache`, we currently trigger compaction (by calling `submit`) on all compaction groups while the sstable is added only to one of them.
Also, it is typically called for maintenance sstables that are awaiting offstrategy compaction, in which case we can skip calling `submit` entirely since the caller triggers offstrategy compaction at a later stage.
Refs scylladb/scylladb#15673
Refs scylladb/scylladb#16694Fixesscylladb/scylladb#16803Closesscylladb/scylladb#16808
* github.com:scylladb/scylladb:
table: add_sstable_and_update_cache: trigger compaction only in compaction group
compaction_manager: perform_task_on_all_files: return early when there are no sstables to compact
compaction_manager: perform_cleanup: use compaction_manager::eligible_for_compaction
There is no need to trigger compaction in all compaction
groups when an sstable is added to only one of them.
And with that level of control, if the caller passes
sstables::offstrategy::yes, we know it will
trigger offstrategy compaction later on so there
is no need to trigger compaction at all
for this sstable at this time.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The sstables replay_position in stats_metadata is
valid only on the originating node and shard.
Therefore, validate the originating host and shard
before using it in compaction or table truncate.
Fixes#10080
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16550
And call on_internal_error if process_upload_dir
is called for tablets-enabled keyspace as it isn't
supported at the moment (maybe it could be in the future
if we make sure that the sstables are confined to tablets
boundaries).
Refs #12775Fixes#16743
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16788
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define a formatter for replica::memtable and
replica::memtable_entry, and remove their operator<<().
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16793
keyspace objects are heavyweight and copies are immediately our-of-date,
so copying them is bad.
Fix by deleting the copy constructor and copy assignment operator. One
call site is fixed. This call site is safe since the it's only used
for accessing a few attributes (introduced in f70c4127c6).
Closesscylladb/scylladb#16782
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we
* define a formatter for `db::consistency_level`
* drop its `operator<<`, as it is not used anymore
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16755
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
In 7d5e22b43b ("replica: memtable: don't forget memtable
memory allocation statistics") we taught memtable_list to remember
learned memory allocation reserves so a new memtable inherits these
statistics from an older memtable. Share it now further across tablets
that belong to the same table as well. This helps the statistics be more
accurate for tablets that are migrated in, as they can share existing
tablet's memory allocation history.
Closesscylladb/scylladb#16571
* github.com:scylladb/scylladb:
table, memtable: share log-structured allocator statistics across all memtables in a table
memtable: consolidate _read_section, _allocating_section in a struct
Previously, the tablet information was sent to the drivers
in two pieces within the custom_payload. We had information
about the replicas under the `tablet_replicas` key and token range
information under `token_range`. These names were quite generic
and might have caused problems for other custom_payload users.
Additionally, dividing the information into two pieces raised
the question of what to do if one key is present while the other
is missing.
This commit changes the serialization mechanism to pack all information
under one specific name, `tablets-routing-v1`.
From: Sylwia Szunejko <sylwia.szunejko@scylladb.com>
Closesscylladb/scylladb#16148
The log-structured allocator collects allocation statistics (which it
uses to manage memory reserves) in some objects kept in
memtable_table_shared_data. Right now, this object is local to memtable_list,
which itself is local to a tablet replica. Move it to table scope so
different tablets in the shard share the statistics. This helps a
newly-migrated tablet adjust more quickly.
Those two members are passed from memtable_list to memtable. Since we
wish to pass them from table, it becomes awkward to pass them as two
separate variables as their contents are specific to memtable internals.
Wrap them in a name that indicates their role (being table-wide shared
data for memtables) and pass them as a unit.
Right now the initial_tablets is kept as replication strategy option in the legacy system_schema.keyspaces table. However, r.s. options are all considered to be replication factors, not anything else. Other than being confusing, this also makes it impossible to extend keyspace configuration with non-integer tablets-related values.
This PR moves the initial_tablets into scylla-specific part of the schema. This opens a way to more ~~ugly~~ flexible ways of configuring tablets for keyspace, in particular it should be possible to use boolean on/off switch in CREATE KEYSPACE or some other trick we find appropriate.
Mos of what this PR does is extends arguments passed around keyspace_metadata and abstract_replication_strategy. The essence of the change is in last patches
* schema_tables: Relax extract_scylla_specific_ks_info() check
* locator,schema: Move initial tablets from r.s. options to params
refs: #16319
refs: #16364Closesscylladb/scylladb#16555
* github.com:scylladb/scylladb:
test: Add sanity tests for tablets initialization and altering
locator,schema: Move initial tablets from r.s. options to params
schema_tables: Relax extract_scylla_specific_ks_info() check
locator: Keep optional initial_tablets on r.s. params
ks_prop_defs: Add initial_tablets& arg to prepare_options()
keyspace_metadata: Carry optional<initial_tablets> on board
locator: Pass abstract_replication_strategy& into validate_tablet_options()
locator: Carry r.s. params into process_tablet_options()
locator: Call create_replication_strategy() with r.s. params
locator: Wrap replication_strategy_config_options into replication_strategy_params
locator: Use local members in ..._replication_strategy constructors
Now all the callers have it at hands (spoiler: not yet initialized, but
still) so the params can also have it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The object in question fully describes the keyspace to be created and,
among other things, contains replication strategy options. Next patches
move the "initial_tablets" option out of those options and keep it
separately, so the ks metadata should also carry this option separately.
This patch is _just_ extending the metadata creation API, in fact the
new field is unused (write-only) so all the places that need to provide
this data keep it disengaged and are explicitly marked with FIXME
comment. Next patches will fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patch added params to r.s. classes' constructors, but callers
don't construct those directly, instead they use the create_r.s.()
wrapper. This patch adds params to the wrapper too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When altering a keyspace several keyspace_metadata objects are created
along the way. The last one, that is then kept on the keyspace_metadata
object, forgets to get its copy of storage options thus transparently
converting to LOCAL type.
The bug surfaces itself when altering replication strategy class for
S3-backed storage -- the 2nd attempt fails, because after the 1st one
the keyspace_metadata gets LOCAL storage options and changing storage
options is not allowed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16524
Both virtual tables and schema registry contain thread_local caches that are destroyed
at thread exit. after a Seastar change[1], these destructions can happen after the reactor
is destroyed, triggering a use-after-free.
Fix by scoping the destruction so it takes place earlier.
[1] 101b245ed7Closesscylladb/scylladb#16510
* github.com:scylladb/scylladb:
schema_registry, database: flush entries when no longer in use
virtual_tables: scope virtual tables registry in system_keyspace
The schema registry disarms internal timers when it is destroyed.
This accesses the Seastar reactor. However, after [1] we don't have ordering
between the reactor destruction and the thread_local registry destruction.
Fix this by flushing all entries when the database is destroyed. The
database object is fundamental so it's unlikely we'll have anything
using the registry after it's gone.
[1] 101b245ed7
truncating is an unusual operation, and we write a logging message
when the truncate op starts with INFO level, it would be great if
we can have a matching logging messge indicating the end of truncate
on the server side. this would help with investigation the TRUNCATE
timeout spotted on the client. at least we can rule out the problem
happening we server is performing truncate.
Refs #15610
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16247
Consider this:
1) file streaming takes storage snapshot = list of sstables
2) concurrent compaction unlink some of those sstables from file system
3) file streaming tries to send unlinked sstables, but files other
than data and index cannot be read as only data and index have file
descriptors opened
To fix it, the snapshot now returns a set of files, one per sstable
component, for each sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#16476
This introduces the ability to split a storage group.
The main compaction group is split into left and right groups.
set_split() is used to set the storage group to splitting mode, which
will create left and right compaction groups. Incoming writes will
now be placed into memtable of either left or right groups.
split() is used to complete the splitting of a group. It only
returns when all preexisting data is split. That means main
compaction group will be empty and all the data will be stored
in either left or right group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Storage group is the storage of tablets. This new concept is helpful
for tablet splitting, where the storage of tablet will be split
in multiple compaction groups, where each can be compacted
independently.
The reason for not going with arena concept is that it added
complexity, and it felt much more elegant to keep compaction
group unchanged which at the end of the day abstracts the concept
of a set of sstables that can be compacted and operated
independently.
When splitting, the storage group for a tablet may therefore own
multiple compaction groups, left, right, and main, where main
keeps the data that needs splitting. When splitting completes,
only left and right compaction groups will be populated.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With off-strategy, we allow sstables to be moved into a new sstable
set even if they didn't undergo reshape compaction.
That's done by specifying a sstable is present both in input and
output, with the completion desc.
We want to do the same with other compaction types.
Think for example of split compaction: compaction manager may decide
a sstable doesn't need splitting, yet it wants that sstable to be
moved into a new sstable set.
Theoretically, we could introduce new code to do this movement,
but more code means increased maintenance burden and higher chances
of bugs. It makes sense to reuse the compaction completion path,
as we do today with off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
reader_concurrency_sempaphore are triplicated: each metrics is registered
for streaming, user, and system classes.
To fix, just move the metrics registration from database to
reader_concurrency_sempaphore, so each reader_concurrency_sempaphore
instantiated will register its metrics (if its creator asked for it).
Adjust the names given to reader_concurrency_sempaphore so we don't
change the labels.
scylla-gdb is adjusted to support the new names.
To be used in the next patch to control whether the semaphore registers
and exports metrics or not. We want to move metric registration to the
semaphore but we don't want all semaphores to export metrics. The
decision on whether a semaphore should or shouldn't export metrics
should be made on a case-by-case basis so this new parameter has no
default value (except for the for_tests constructor).
Soon, the reader_concurrency_semaphore will require a unique
and meaningful name in order to label its metrics. To prepare
for that, name sstable_manager instances. This will be used
to generate a name for sstable_manager's reader_concurrency_semaphore.
To make sure a table object is kept valid throughout the lifetime
of compaction a following patch will enter the table's
_async_gate when the compaction task starts.
This change defers awaiting the gate.close future
till after stopping ongoing compaction so that
closing the gate will prevent starting new compactions
while ongoing compaction can be stopped and finally
awaiting the close() future will wait for them to
unwind and exit the gate after being stopped.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a table is truncated or dropped it can be auto-snapshotted if the respective config option is set (by default it is). Non local storages don't implement snapshotting yet and emit on_internal_error() in that case aborting the whole process. It's better to skip snapshot with a warning instead.
Closesscylladb/scylladb#16220
* github.com:scylladb/scylladb:
database: Do not auto snapshot non-local storages' tables
database: Simplify snapshot booleans in truncate_table_on_all_shards()
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Snapshotting is not yet supported for those (see #13025) and
auto-snapshot would step on internal error. Skip it and print a warning
into logs
fixes#16078
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are three of them in this function -- with_snapshot argument,
auto_snapshot local copy of db::config option and the should_snapshot
local variable that's && of the above two. The code can go with just one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The following error was seen:
[shard 0] table - compaction_group_for_token: compaction_group idx=0 range=(minimum
token,-6917529027641081857] does not contain token=minimum token
Since minimum_token or maximum_token will not be inside a token range. Skip
the in token range check.
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.
Refs: https://github.com/scylladb/scylladb/issues/16255Closesscylladb/scylladb#16289
* github.com:scylladb/scylladb:
Update unified/build_unified.sh
Update main.cc
Update dist/common/scripts/scylla-housekeeping
Typos: fix typos in code
utils::fb_utilities is a global in-memory registry for storing and retrieving broadcast_address and broadcat_rpc_address.
As part of the effort to get rid of all global state, this series gets rid of fb_utilities.
This will eventually allow e.g. cql_test_env to instantiate multiple scylla server nodes, each serving on its own address.
Closesscylladb/scylladb#16250
* github.com:scylladb/scylladb:
treewide: get rid of now unused fb_utilities
tracing: use locator::topology rather than fb_utilities
streaming: use locator::topology rather than fb_utilities
raft: use locator::topology/messaging rather than fb_utilities
storage_service: use locator::topology rather than fb_utilities
storage_proxy: use locator::topology rather than fb_utilities
service_level_controller: use locator::topology rather than fb_utilities
misc_services: use locator::topology rather than fb_utilities
migration_manager: use messaging rather than fb_utilities
forward_service: use messaging rather than fb_utilities
messaging_service: accept broadcast_addr in config rather than via fb_utilities
messaging_service: move listen_address and port getters inline
test: manual: modernize message test
table: use gossiper rather than fb_utilities
repair: use locator::topology rather than fb_utilities
dht/range_streamer: use locator::topology rather than fb_utilities
db/view: use locator::topology rather than fb_utilities
database: use locator::topology rather than fb_utilities
db/system_keyspace: use topology via db rather than fb_utilities
db/system_keyspace: save_local_info: get broadcast addresses from caller
db/hints/manager: use locator::topology rather than fb_utilities
db/consistency_level: use locator::topology rather than fb_utilities
api: use locator::topology rather than fb_utilities
alternator: ttl: use locator::topology rather than fb_utilities
gossiper: use locator::topology rather than fb_utilities
gossiper: add get_this_endpoint_state_ptr
test: lib: cql_test_env: pass broadcast_address in cql_test_config
init: get_seeds_from_db_config: accept broadcast_address
locator: replication strategies: use locator::topology rather than fb_utilities
locator: topology: add helpers to retrieve this host_id and address
snitch: pass broadcast_address in snitch_config
snitch: add optional get_broadcast_address method
locator: ec2_multi_region_snitch: keep local public address as member
ec2_multi_region_snitch: reindent load_config
ec2_multi_region_snitch: coroutinize load_config
ec2_snitch: reindent load_config
ec2_snitch: coroutinize load_config
thrift: thrift_validation: use std::numeric_limits rather than fb_utilities
Since abort callbacks are fired synchronously, we must change the
table's erm before we do that so that the callbacks obtain the new
erm.
Otherwise, we will block barriers.
When collected sstables are deleted each is passed into
sstables_manager.delete_atomically(). For on-disk sstables this creates
a deletion log for each removed stable, which is quite an overkill. The
atomic deletion callback already accepts vector of shared sstables, so
it's simpler (and a bit faster) to remove them all in a batch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>