dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.
The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
This is not strictly necessary, as the multishard reader will be later
avoided altogether for tablet-based tables, but it is a step towards
converting all code to use the erm->get_sharder() instead of
schema::get_sharder().
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.
Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.
It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.
`system_keyspace_make` would access private fields of `database` in
order to create local system tables (creating the `keyspace` and
`table` in-memory structures, creating directory for `system` and
`system_schema`).
Extract this part into `database::create_local_system_table`.
Make `database::add_column_family` private.
Some time ago (997a34bf8c) the backlog
controller was generalized to maintain some scheduling group. Back then
the group was the pair of seastar::scheduling_group and
seastar::io_priority_class. Now the latter is gone, so the controller's
notion of what sched group is can be relaxed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14266
After recent changes, all wasm related logic has been moved from
the database class to the query_processor. As a result, the wasm
headers no longer need to be included there, and in particular,
files that include replica/database.hh no longer need to wait
on the generated header rust/wasmtime_bindings.hh to compile.
Fixes#14224Closes#14223
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
Similarly to class table, the keyspace class also needs to create
directory for itself for some reason. It looks excessive as table
creation would call recursive_touch_directory() and would create the ks
directory too, but this call is there
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's no longer used outside of make_column_family_config(). Not to
encourage people to use it -- drop it and open-code into that single
caller
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When table is DROP-ed the directory with all its sstables is removed
(unless it contains snapshots). Wrap this into table.destroy_storage()
method, later it will need to become sstable::storage-specific
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's no need in copying the datadirs vector to call parallel_for_each
upon. The datadirs[0] is in fact datadir field.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method initializes storage for table naturally belongs to that
class. So rename it while moving. Also, there's no longer need to carry
table name and uuid as arguments, being table method it can just get the
paths to work on from config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The metrics that are being deregistered (in this PR) cause Scylla to crash when a
table is dropped, but the corresponding table object in memory is not
yet deallocated, and a new table with the same name is created. This
caused a double-metrics-registration exception to be thrown. In order to
avoid it, we are deregistering table's metrics as soon as the table is
marked to be disposed from the database. Table's representation in memory can
still live, but shouldn't forbid other table with the same name to be
created.
Fixes#13548Closes#13971
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.
This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.
This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.
Refs: #11870Closes#13869
* github.com:scylladb/scylladb:
db/system_keyspace: remove dependency on storage_proxy
db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent
replica: add query.hh
The loader has very similar global_column_family_ptr class for its
distributed loadings. Now it can use the "standard" one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now all users of global_table know it's a vector and reference its
elements with this_shard_id() index. Making the global_table_ptr a class
makes it possible to stop using operator[] and "index" this_shard_id()
in its -> and * operators.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use sharded<database>::invoke_on_all() instead of open-coded analogy.
Also don't access database's _column_families directly, use the
find_column_family() method instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
Containing utility methods to query data from the local replica.
Intended to be used to read system tables, completely bypassing storage
proxy in the process.
This duplicates some code already found in storage proxy, but that is a
small price to pay, to be able to break some circular dependencies
involving storage proxy, that have been plaguing us since time
immemorial.
One thing we lose with this, is the smp service level using in storage
proxy. If this becomes a problem, we can create one in database and use
it in these methods too.
Another thing we lose is increasing `replica_cross_shard_ops` storage
proxy stat. I think this is not a problem at all as these new functions
are meant to be used by internal users, which will reduce the internal
noise in this metric, which is meant to indicate users not using
shard-aware clients.
Right now the map<endpoint, config> sits on the sstables manager and its
update is governed by database (because it's peering and can kick other
shards to update it as well).
Having the sharded<storage_manager> at hand lets freeing database from
the need to update configs and keeps sstables_manager a bit smaller.
Also this will allow keeping s3 clients shared between sstables via this
map by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager in question keeps track of whatever sstables_manager needs
to work with the storage (spoiler: only S3 one). It's main-local sharded
peering service, so that container() call can be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`.
This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date.
Closes#13573
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
The user sstables manager will need to provide endpoint config for
sstables' storage drivers. For that it needs to get it from db::config
and keep in-sync with its updates.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.
This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.
Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
The only reason why it's there (right next to compaction_fwd.hh) is
because the database::table_truncate_state subclass needs the definition
of compaction_manager::compaction_reenabler subclass.
However, the former sub is not used outside of database.cc and can be
defined in .cc. Keeping it outside of the header allows dropping the
compaction_manager.hh from database.hh thus greatly reducing its fanout
over the code (from ~180 indirect inclusions down to ~20).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13622
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print classes fulfill the requirement of `FragmentedView` concept
without the help of template function of `to_hex()`, this function is
dropped in this change, as all its callers are now using fmtlib
for formatting now. the helper of `fragment_to_hex()` is dropped
as well, its only caller is `to_hex()`.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13471
The sharded<sys_ks> instances are plugged to large data handler and
compaction manager to maintain the circular dependency between these
components via the interposing database instance. Do the same for user
sstables manager, because S3 driver will need to update the local
ownership table.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The commitlog api originally implied that the commitlog_directory would contain files from a single commitlog instance. This is checked in segment_manager::list_descriptors, if it encounters a file with an unknown prefix, an exception occurs in `commitlog::descriptor::descriptor`, which is logged with the `WARN` level.
A new schema commitlog was added recently, which shares the filesystem directory with the main commitlog. This causes warnings to be emitted on each boot. This patch solves the warnings problem by moving the schema commitlog to a separate directory. In addition, the user can employ the new `schema_commitlog_directory` parameter to move the schema commitlog to another disk drive.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog) is also scheduled for 5.3, and it already requires a clean rolling restart (no cl segments to replay), we don't need to specifically handle upgrade here.
Fixes: #11867Closes#13263
* github.com:scylladb/scylladb:
commitlog: use separate directory for schema commitlog
schema commitlog: fix commitlog_total_space_in_mb initialization
The commitlog api originally implied that
the commitlog_directory would contain files
from a single commitlog instance. This is
checked in segment_manager::list_descriptors,
if it encounters a file with an unknown
prefix, an exception occurs in
commitlog::descriptor::descriptor, which is
logged with the WARN level.
A new schema commitlog was added recently,
which shares the filesystem directory with
the main commitlog. This causes warnings
to be emitted on each boot. This patch
solves the warnings problem by moving
the schema commitlog to a separate directory.
In addition, the user can employ the new
schema_commitlog_directory parameter to move
the schema commitlog to another disk drive.
By default, the schema commitlog directory is
nested in the commitlog_directory. This can help
avoid problems during an upgrade if the
commitlog_directory in the custom scylla.yaml
is located on a separate disk partition.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog)
is also scheduled for 5.3, and it already
requires a clean rolling restart (no cl
segments to replay), we don't need to
specifically handle upgrade here.
Fixes: #11867
initialization
It seems there was a typo here, which caused
commitlog_total_space_in_mb to always be zero
and the schema commitlog to be effectively
unlimited in size.
The wasm engine is moved from replica::database to the query_processor.
The wasm instance cache and compilation thread runner were already there,
but now they're also initialized in the query_processor constructor.
By moving the initialization to the constructor, we can now
be certain that all wasm-related objects (wasm instance cache,
compilation thread runner, and wasm engine, which was already
passed in the constructor) are initialized when we try to use
them because we have to use the query processor to access them
anyway.
The change is also motivated by the fact that we're planning
to take Wasm UDFs out of experimental, after which they should
stop getting special treatment.
Closes#13311
* github.com:scylladb/scylladb:
wasm: move wasm initialization to query_processor constructor
wasm: return wasm instance cache as a reference instead of a pointer
wasm: move wasm engine to query_processor
The latter is the place where mutate_MV is called and it needs the
view updates generator nearby.
The call-stack starts at database::do_apply(). As was described in one
of the previous patches, applying mutations that need updating views
happen late enough, so if the view updates generator is not plugged to
the database yet, it's OK to bail out with exception. If it's plugged,
it's carried over thus keeping the generator instance alive and waited
for on its stop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The database is low-level service and currently view update generator
implicitly depend on it via storage proxy. However, database does need
to push view updates with the help of mutate_MV helper, thus adding the
dependency loop.
This patch exploits the fact that view updates start being pushed late
enough, by that time all other service, including proxy and view update
generator, seem to be up and running. This allows a "weak dependency"
from database to view update generator, like there's one from database
to system keyspace already.
So in this patch the v.u.g. puts the shared-from-this pointer onto the
database at the time it starts. On stop it removes this pointer after
database is drained and (hopefully) all view updates are pushed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The wasm engine is used for compiling and executing Wasm UDFs, so
the query_processor is a more appropriate location for it than
replica::database, especially because the wasm instance cache
and the wasm alien thread runner are already there.
This patch also reduces the number of wasm engines to 1, shared by
all shards, as recommended by the wasmtime developers.