before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we
* define a formatter for `db::consistency_level`
* drop its `operator<<`, as it is not used anymore
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16755
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
In 7d5e22b43b ("replica: memtable: don't forget memtable
memory allocation statistics") we taught memtable_list to remember
learned memory allocation reserves so a new memtable inherits these
statistics from an older memtable. Share it now further across tablets
that belong to the same table as well. This helps the statistics be more
accurate for tablets that are migrated in, as they can share existing
tablet's memory allocation history.
Closesscylladb/scylladb#16571
* github.com:scylladb/scylladb:
table, memtable: share log-structured allocator statistics across all memtables in a table
memtable: consolidate _read_section, _allocating_section in a struct
Previously, the tablet information was sent to the drivers
in two pieces within the custom_payload. We had information
about the replicas under the `tablet_replicas` key and token range
information under `token_range`. These names were quite generic
and might have caused problems for other custom_payload users.
Additionally, dividing the information into two pieces raised
the question of what to do if one key is present while the other
is missing.
This commit changes the serialization mechanism to pack all information
under one specific name, `tablets-routing-v1`.
From: Sylwia Szunejko <sylwia.szunejko@scylladb.com>
Closesscylladb/scylladb#16148
The log-structured allocator collects allocation statistics (which it
uses to manage memory reserves) in some objects kept in
memtable_table_shared_data. Right now, this object is local to memtable_list,
which itself is local to a tablet replica. Move it to table scope so
different tablets in the shard share the statistics. This helps a
newly-migrated tablet adjust more quickly.
Those two members are passed from memtable_list to memtable. Since we
wish to pass them from table, it becomes awkward to pass them as two
separate variables as their contents are specific to memtable internals.
Wrap them in a name that indicates their role (being table-wide shared
data for memtables) and pass them as a unit.
Right now the initial_tablets is kept as replication strategy option in the legacy system_schema.keyspaces table. However, r.s. options are all considered to be replication factors, not anything else. Other than being confusing, this also makes it impossible to extend keyspace configuration with non-integer tablets-related values.
This PR moves the initial_tablets into scylla-specific part of the schema. This opens a way to more ~~ugly~~ flexible ways of configuring tablets for keyspace, in particular it should be possible to use boolean on/off switch in CREATE KEYSPACE or some other trick we find appropriate.
Mos of what this PR does is extends arguments passed around keyspace_metadata and abstract_replication_strategy. The essence of the change is in last patches
* schema_tables: Relax extract_scylla_specific_ks_info() check
* locator,schema: Move initial tablets from r.s. options to params
refs: #16319
refs: #16364Closesscylladb/scylladb#16555
* github.com:scylladb/scylladb:
test: Add sanity tests for tablets initialization and altering
locator,schema: Move initial tablets from r.s. options to params
schema_tables: Relax extract_scylla_specific_ks_info() check
locator: Keep optional initial_tablets on r.s. params
ks_prop_defs: Add initial_tablets& arg to prepare_options()
keyspace_metadata: Carry optional<initial_tablets> on board
locator: Pass abstract_replication_strategy& into validate_tablet_options()
locator: Carry r.s. params into process_tablet_options()
locator: Call create_replication_strategy() with r.s. params
locator: Wrap replication_strategy_config_options into replication_strategy_params
locator: Use local members in ..._replication_strategy constructors
Now all the callers have it at hands (spoiler: not yet initialized, but
still) so the params can also have it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The object in question fully describes the keyspace to be created and,
among other things, contains replication strategy options. Next patches
move the "initial_tablets" option out of those options and keep it
separately, so the ks metadata should also carry this option separately.
This patch is _just_ extending the metadata creation API, in fact the
new field is unused (write-only) so all the places that need to provide
this data keep it disengaged and are explicitly marked with FIXME
comment. Next patches will fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patch added params to r.s. classes' constructors, but callers
don't construct those directly, instead they use the create_r.s.()
wrapper. This patch adds params to the wrapper too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When altering a keyspace several keyspace_metadata objects are created
along the way. The last one, that is then kept on the keyspace_metadata
object, forgets to get its copy of storage options thus transparently
converting to LOCAL type.
The bug surfaces itself when altering replication strategy class for
S3-backed storage -- the 2nd attempt fails, because after the 1st one
the keyspace_metadata gets LOCAL storage options and changing storage
options is not allowed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16524
Both virtual tables and schema registry contain thread_local caches that are destroyed
at thread exit. after a Seastar change[1], these destructions can happen after the reactor
is destroyed, triggering a use-after-free.
Fix by scoping the destruction so it takes place earlier.
[1] 101b245ed7Closesscylladb/scylladb#16510
* github.com:scylladb/scylladb:
schema_registry, database: flush entries when no longer in use
virtual_tables: scope virtual tables registry in system_keyspace
The schema registry disarms internal timers when it is destroyed.
This accesses the Seastar reactor. However, after [1] we don't have ordering
between the reactor destruction and the thread_local registry destruction.
Fix this by flushing all entries when the database is destroyed. The
database object is fundamental so it's unlikely we'll have anything
using the registry after it's gone.
[1] 101b245ed7
truncating is an unusual operation, and we write a logging message
when the truncate op starts with INFO level, it would be great if
we can have a matching logging messge indicating the end of truncate
on the server side. this would help with investigation the TRUNCATE
timeout spotted on the client. at least we can rule out the problem
happening we server is performing truncate.
Refs #15610
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16247
Consider this:
1) file streaming takes storage snapshot = list of sstables
2) concurrent compaction unlink some of those sstables from file system
3) file streaming tries to send unlinked sstables, but files other
than data and index cannot be read as only data and index have file
descriptors opened
To fix it, the snapshot now returns a set of files, one per sstable
component, for each sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#16476
This introduces the ability to split a storage group.
The main compaction group is split into left and right groups.
set_split() is used to set the storage group to splitting mode, which
will create left and right compaction groups. Incoming writes will
now be placed into memtable of either left or right groups.
split() is used to complete the splitting of a group. It only
returns when all preexisting data is split. That means main
compaction group will be empty and all the data will be stored
in either left or right group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Storage group is the storage of tablets. This new concept is helpful
for tablet splitting, where the storage of tablet will be split
in multiple compaction groups, where each can be compacted
independently.
The reason for not going with arena concept is that it added
complexity, and it felt much more elegant to keep compaction
group unchanged which at the end of the day abstracts the concept
of a set of sstables that can be compacted and operated
independently.
When splitting, the storage group for a tablet may therefore own
multiple compaction groups, left, right, and main, where main
keeps the data that needs splitting. When splitting completes,
only left and right compaction groups will be populated.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With off-strategy, we allow sstables to be moved into a new sstable
set even if they didn't undergo reshape compaction.
That's done by specifying a sstable is present both in input and
output, with the completion desc.
We want to do the same with other compaction types.
Think for example of split compaction: compaction manager may decide
a sstable doesn't need splitting, yet it wants that sstable to be
moved into a new sstable set.
Theoretically, we could introduce new code to do this movement,
but more code means increased maintenance burden and higher chances
of bugs. It makes sense to reuse the compaction completion path,
as we do today with off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
reader_concurrency_sempaphore are triplicated: each metrics is registered
for streaming, user, and system classes.
To fix, just move the metrics registration from database to
reader_concurrency_sempaphore, so each reader_concurrency_sempaphore
instantiated will register its metrics (if its creator asked for it).
Adjust the names given to reader_concurrency_sempaphore so we don't
change the labels.
scylla-gdb is adjusted to support the new names.
To be used in the next patch to control whether the semaphore registers
and exports metrics or not. We want to move metric registration to the
semaphore but we don't want all semaphores to export metrics. The
decision on whether a semaphore should or shouldn't export metrics
should be made on a case-by-case basis so this new parameter has no
default value (except for the for_tests constructor).
Soon, the reader_concurrency_semaphore will require a unique
and meaningful name in order to label its metrics. To prepare
for that, name sstable_manager instances. This will be used
to generate a name for sstable_manager's reader_concurrency_semaphore.
To make sure a table object is kept valid throughout the lifetime
of compaction a following patch will enter the table's
_async_gate when the compaction task starts.
This change defers awaiting the gate.close future
till after stopping ongoing compaction so that
closing the gate will prevent starting new compactions
while ongoing compaction can be stopped and finally
awaiting the close() future will wait for them to
unwind and exit the gate after being stopped.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a table is truncated or dropped it can be auto-snapshotted if the respective config option is set (by default it is). Non local storages don't implement snapshotting yet and emit on_internal_error() in that case aborting the whole process. It's better to skip snapshot with a warning instead.
Closesscylladb/scylladb#16220
* github.com:scylladb/scylladb:
database: Do not auto snapshot non-local storages' tables
database: Simplify snapshot booleans in truncate_table_on_all_shards()
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Snapshotting is not yet supported for those (see #13025) and
auto-snapshot would step on internal error. Skip it and print a warning
into logs
fixes#16078
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are three of them in this function -- with_snapshot argument,
auto_snapshot local copy of db::config option and the should_snapshot
local variable that's && of the above two. The code can go with just one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The following error was seen:
[shard 0] table - compaction_group_for_token: compaction_group idx=0 range=(minimum
token,-6917529027641081857] does not contain token=minimum token
Since minimum_token or maximum_token will not be inside a token range. Skip
the in token range check.
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.
Refs: https://github.com/scylladb/scylladb/issues/16255Closesscylladb/scylladb#16289
* github.com:scylladb/scylladb:
Update unified/build_unified.sh
Update main.cc
Update dist/common/scripts/scylla-housekeeping
Typos: fix typos in code
utils::fb_utilities is a global in-memory registry for storing and retrieving broadcast_address and broadcat_rpc_address.
As part of the effort to get rid of all global state, this series gets rid of fb_utilities.
This will eventually allow e.g. cql_test_env to instantiate multiple scylla server nodes, each serving on its own address.
Closesscylladb/scylladb#16250
* github.com:scylladb/scylladb:
treewide: get rid of now unused fb_utilities
tracing: use locator::topology rather than fb_utilities
streaming: use locator::topology rather than fb_utilities
raft: use locator::topology/messaging rather than fb_utilities
storage_service: use locator::topology rather than fb_utilities
storage_proxy: use locator::topology rather than fb_utilities
service_level_controller: use locator::topology rather than fb_utilities
misc_services: use locator::topology rather than fb_utilities
migration_manager: use messaging rather than fb_utilities
forward_service: use messaging rather than fb_utilities
messaging_service: accept broadcast_addr in config rather than via fb_utilities
messaging_service: move listen_address and port getters inline
test: manual: modernize message test
table: use gossiper rather than fb_utilities
repair: use locator::topology rather than fb_utilities
dht/range_streamer: use locator::topology rather than fb_utilities
db/view: use locator::topology rather than fb_utilities
database: use locator::topology rather than fb_utilities
db/system_keyspace: use topology via db rather than fb_utilities
db/system_keyspace: save_local_info: get broadcast addresses from caller
db/hints/manager: use locator::topology rather than fb_utilities
db/consistency_level: use locator::topology rather than fb_utilities
api: use locator::topology rather than fb_utilities
alternator: ttl: use locator::topology rather than fb_utilities
gossiper: use locator::topology rather than fb_utilities
gossiper: add get_this_endpoint_state_ptr
test: lib: cql_test_env: pass broadcast_address in cql_test_config
init: get_seeds_from_db_config: accept broadcast_address
locator: replication strategies: use locator::topology rather than fb_utilities
locator: topology: add helpers to retrieve this host_id and address
snitch: pass broadcast_address in snitch_config
snitch: add optional get_broadcast_address method
locator: ec2_multi_region_snitch: keep local public address as member
ec2_multi_region_snitch: reindent load_config
ec2_multi_region_snitch: coroutinize load_config
ec2_snitch: reindent load_config
ec2_snitch: coroutinize load_config
thrift: thrift_validation: use std::numeric_limits rather than fb_utilities
Since abort callbacks are fired synchronously, we must change the
table's erm before we do that so that the callbacks obtain the new
erm.
Otherwise, we will block barriers.
When collected sstables are deleted each is passed into
sstables_manager.delete_atomically(). For on-disk sstables this creates
a deletion log for each removed stable, which is quite an overkill. The
atomic deletion callback already accepts vector of shared sstables, so
it's simpler (and a bit faster) to remove them all in a batch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By "formatting" fix I mean -- remove the temporary on-stack references
that were left for the ease of patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lambda in question was the struct pruner method and was left there
for the ease of patching. Now, when this lambda is only called once
inside the function it is declared in, it can be open-coded into the
place where it's called
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This allocation remained from the pre-coroutine times of the method. Now
the contents of prumer -- refernce on table, vector and replay_position
can reside on coroutine frame
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Using consistent cluster management and not using schema commitlog
ends with a bad configuration throw during bootstrap. Soon, we
will make consistent cluster management mandatory. This forces us
to also make schema commitlog mandatory, which we do in this patch.
A booting node decides to use schema commitlog if at least one of
the two statements below is true:
- the node has `force_schema_commitlog=true` config,
- the node knows that the cluster supports the `SCHEMA_COMMITLOG`
cluster feature.
The `SCHEMA_COMMITLOG` cluster feature has been added in version
5.1. This patch is supposed to be a part of version 6.0. We don't
support a direct upgrade from 5.1 to 6.0 because it skips two
versions - 5.2 and 5.4. So, in a supported upgrade we can assume
that the version which we upgrade from has schema commitlog. This
means that we don't need to check the `SCHEMA_COMMITLOG` feature
during an upgrade.
The reasoning above also applies to Scylla Enterprise. Version
2024.2 will be based on 6.0. Probably, we will only support
an upgrade to 2024.2 from 2024.1, which is based on 5.4. But even
if we support an upgrade from 2023.x, this patch won't break
anything because 2023.1 is based on 5.2, which has schema
commitlog. Upgrades from 2022.x definitely won't be supported.
When we populate a new cluster, we can use the
`force_schema_commitlog=true` config to use schema commitlog
unconditionally. Then, the cluster feature check is irrelevant.
This check could fail because we initiate schema commitlog before
we learn about the features. The `force_schema_commitlog=true`
config is especially useful when we want to use consistent cluster
management. Failing feature checks would lead to crashes during
initial bootstraps. Moreover, there is no point in creating a new
cluster with `consistent_cluster_management=true` and
`force_schema_commitlog=false`. It would just cause some initial
bootstraps to fail, and after successful restarts, the result would
be the same as if we used `force_schema_commitlog=true` from the
start.
In conclusion, we can unconditionally use schema commitlog without
any checks in 6.0 because we can always safely upgrade a cluster
and start a new cluster.
Apart from making schema commitlog mandatory, this patch adds two
changes that are its consequences:
- making the unneeded `force_schema_commitlog` config unused,
- deprecating the `SCHEMA_COMMITLOG` feature, which is always
assumed to be true.
Closesscylladb/scylladb#16254
In the view update code, the function get_view_natural_endpoint()
determines which view replica this base replica should send an update
to. It currently gets the *view* table's replication map (i.e., the map
from view tokens to lists of replicas holding the token), but assumes
that this is also the *base* table's replication map.
This assumption was true with vnodes, but is no longer true with
tablets - the base table's replication map can be completely different
from the view table's. By looking at the wrong mapping,
get_view_natural_endpoint() can believe that this node isn't really
a base-replica and drop the view update. Alternatively, it can think
it is a base replica - but use the wrong base-view pairing and create
base-view inconsistencies.
This patch solves this bug - get_view_natural_endpoint() now gets two
separate replication maps - the base's and the view's. The callers
need to remember what the base table was (in some cases they didn't
care at the point of the call), and pass it to the function call.
This patch also includes a simple test that reproduces the bug, and
confirms it is fixed: The test has a 6-node cluster using tablets
and a base table with RF=1, and writes one row to it. Before this
patch, the code usually gets confused, thinking the base replica
isn't a replica and loses the view update. With this patch, the
view update works.
Fixes#16227.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16228
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>