This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.
This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.
Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
The only reason why it's there (right next to compaction_fwd.hh) is
because the database::table_truncate_state subclass needs the definition
of compaction_manager::compaction_reenabler subclass.
However, the former sub is not used outside of database.cc and can be
defined in .cc. Keeping it outside of the header allows dropping the
compaction_manager.hh from database.hh thus greatly reducing its fanout
over the code (from ~180 indirect inclusions down to ~20).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13622
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print classes fulfill the requirement of `FragmentedView` concept
without the help of template function of `to_hex()`, this function is
dropped in this change, as all its callers are now using fmtlib
for formatting now. the helper of `fragment_to_hex()` is dropped
as well, its only caller is `to_hex()`.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13471
The sharded<sys_ks> instances are plugged to large data handler and
compaction manager to maintain the circular dependency between these
components via the interposing database instance. Do the same for user
sstables manager, because S3 driver will need to update the local
ownership table.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The commitlog api originally implied that the commitlog_directory would contain files from a single commitlog instance. This is checked in segment_manager::list_descriptors, if it encounters a file with an unknown prefix, an exception occurs in `commitlog::descriptor::descriptor`, which is logged with the `WARN` level.
A new schema commitlog was added recently, which shares the filesystem directory with the main commitlog. This causes warnings to be emitted on each boot. This patch solves the warnings problem by moving the schema commitlog to a separate directory. In addition, the user can employ the new `schema_commitlog_directory` parameter to move the schema commitlog to another disk drive.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog) is also scheduled for 5.3, and it already requires a clean rolling restart (no cl segments to replay), we don't need to specifically handle upgrade here.
Fixes: #11867Closes#13263
* github.com:scylladb/scylladb:
commitlog: use separate directory for schema commitlog
schema commitlog: fix commitlog_total_space_in_mb initialization
The commitlog api originally implied that
the commitlog_directory would contain files
from a single commitlog instance. This is
checked in segment_manager::list_descriptors,
if it encounters a file with an unknown
prefix, an exception occurs in
commitlog::descriptor::descriptor, which is
logged with the WARN level.
A new schema commitlog was added recently,
which shares the filesystem directory with
the main commitlog. This causes warnings
to be emitted on each boot. This patch
solves the warnings problem by moving
the schema commitlog to a separate directory.
In addition, the user can employ the new
schema_commitlog_directory parameter to move
the schema commitlog to another disk drive.
By default, the schema commitlog directory is
nested in the commitlog_directory. This can help
avoid problems during an upgrade if the
commitlog_directory in the custom scylla.yaml
is located on a separate disk partition.
This is expected to be released in 5.3.
As #13134 (raft tables->schema commitlog)
is also scheduled for 5.3, and it already
requires a clean rolling restart (no cl
segments to replay), we don't need to
specifically handle upgrade here.
Fixes: #11867
initialization
It seems there was a typo here, which caused
commitlog_total_space_in_mb to always be zero
and the schema commitlog to be effectively
unlimited in size.
The wasm engine is moved from replica::database to the query_processor.
The wasm instance cache and compilation thread runner were already there,
but now they're also initialized in the query_processor constructor.
By moving the initialization to the constructor, we can now
be certain that all wasm-related objects (wasm instance cache,
compilation thread runner, and wasm engine, which was already
passed in the constructor) are initialized when we try to use
them because we have to use the query processor to access them
anyway.
The change is also motivated by the fact that we're planning
to take Wasm UDFs out of experimental, after which they should
stop getting special treatment.
Closes#13311
* github.com:scylladb/scylladb:
wasm: move wasm initialization to query_processor constructor
wasm: return wasm instance cache as a reference instead of a pointer
wasm: move wasm engine to query_processor
The latter is the place where mutate_MV is called and it needs the
view updates generator nearby.
The call-stack starts at database::do_apply(). As was described in one
of the previous patches, applying mutations that need updating views
happen late enough, so if the view updates generator is not plugged to
the database yet, it's OK to bail out with exception. If it's plugged,
it's carried over thus keeping the generator instance alive and waited
for on its stop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The database is low-level service and currently view update generator
implicitly depend on it via storage proxy. However, database does need
to push view updates with the help of mutate_MV helper, thus adding the
dependency loop.
This patch exploits the fact that view updates start being pushed late
enough, by that time all other service, including proxy and view update
generator, seem to be up and running. This allows a "weak dependency"
from database to view update generator, like there's one from database
to system keyspace already.
So in this patch the v.u.g. puts the shared-from-this pointer onto the
database at the time it starts. On stop it removes this pointer after
database is drained and (hopefully) all view updates are pushed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The wasm engine is used for compiling and executing Wasm UDFs, so
the query_processor is a more appropriate location for it than
replica::database, especially because the wasm instance cache
and the wasm alien thread runner are already there.
This patch also reduces the number of wasm engines to 1, shared by
all shards, as recommended by the wasmtime developers.
The concept is needed by enterprise functionality, but in the hunt for globals this sticks out and should be removed.
This is also partially prompted by the need to handle the keyspaces in the above set special on shutdown as well as startup. I.e. we need to ensure all user keyspaces are flushed/closed earlier then these. I.e. treat as "system" keyspace for this purpose.
These changes adds a "extension internal" keyspace set instead, which for now (until enterprise branches are updated) also included the "load_prio" set. However, it changes distributed loader to use the extension API interface instead, as well as adds shutdown special treatment to replica::database.
Closes#13335
* github.com:scylladb/scylladb:
datasbase: Flush/close "extension internal" keyspaces after other user ks
distributed_loader: Use extensions set of "extension internal" keyspaces
db::extentions: Add "extensions internal" keyspace set
Refs #13334
Effectively treats keyspaces listed in "extension internal" as system keyspaces
w.r.t. shutdown/drain. This ensures all user keyspaces are fully flushed before
we disable these "internal" ones.
We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog.
Fixes: #12642Closes#13134
* github.com:scylladb/scylladb:
test_raft_upgrade: add a test for schema commit log feature
scylla_cluster.py: add start flag to server_add
ServerInfo: drop host_id
scylla_cluster.py: add config to server_add
scylla_cluster.py: add expected_error to server_start
scylla_cluster.py: ScyllaServer.start, refactor error reporting
scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
raft: check if schema commitlog is initialized Refuse to boot if neither the schema commitlog feature nor force_schema_commit_log is set. For the upgrade procedure the user should wait until the schema commitlog feature is enabled before enabling consistent_cluster_management.
raft: move raft initialization after init_system_keyspace
database: rename before_schema_keyspace_init->maybe_init_schema_commitlog
raft: use schema commitlog for raft tables
init_system_keyspace: refactoring towards explicit load phases
We are going to move the raft tables from the first
load phase to the second. This means the second
init_system_keyspace call will load raft tables along
with the schema, making the name of this function imprecise.
To make sure all tracing done on a certain page will make its way into
the appropriate trace session.
This is a contination of the previous patch (which added trace pointer
to the permit).
And propagate it down to where it is created. This will be used to add
trace points for semaphore related events, but this will come in the
next patches.
The `storage_options` describes where sstables should be located. Currently the object reside on keyspace_metadata, but is thus not available at the place it's needed the most -- the `table::make_sstable()` call. This set converts keyspace_metadata::storage_opts to be lw-shared-ptr and shares the ptr with class table.
refs: #12523 (detached small change from large PR)
Closes#13212
* github.com:scylladb/scylladb:
table: Keep storage options lw-shared-ptr
keyspace_metadata: Make storage options lw-shared-ptr
this series intends to deprecate `::join()`, as it always materializes a range into a concrete string. but what we always want is to print the elements in the given range to stream, or to a seastar logger, which is backed by fmtlib. also, because fmtlib offers exactly the same set of features implemented by to_string.hh, this change would allow us to use fmtlib to replace to_string.hh for better maintainability, and potentially better performance. as fmtlib is lazy evaluated, and claims to be performant under most circumstances.
Closes#13163
* github.com:scylladb/scylladb:
utils: to_string: move join to namespace utils
treewide: use fmt::join() when appropriate
row_cache: pass "const cache_entry" to operator<<
Tables need to know which storage their sstables need to be located at,
so class table needs to have itw reference of the storage options. The
thing can be inherited from the keyspace metadata.
Tests sometimes create table without keyspace at hand. For those use
default-initialized storage options (which is local storage).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().
as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.
please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.
some noteworthy things related to this change:
* unlike the existing `join()`, `fmt::join()` returns a view. so we
have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
for instance, a `const std::string*`. but operator<<() always print
a typed pointer. so if we want to format a typed pointer, we either
need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
`operator<<(std::ostream& os, const column_definition* cd)`, so we
have to use a wrapper class of `maybe_column_definition` for printing
a pointer to `column_definition`. since the overload is only used
by the two overloads of
`statement_restrictions::add_single_column_parition_key_restriction()`,
the operator<< for `const column_definition*` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
A read that requested memory and has to wait for it can be registered as inactive. This can happen for example if the memory request originated from a background I/O operation (a read-ahead maybe).
Handling this case is currently very difficult. What we want to do is evict such a read on-the-spot: the fact that there is a read waiting on memory means memory is in demand and so inactive reads should be evicted. To evict this reader, we'd first have to remove it from the memory wait list, which is almost impossible currently, because `expiring_fifo<>`, the type used for the wait list, doesn't allow for that. So in this PR we set out to make this possible first, by transforming all current queues to be intrusive lists of permits. Permits are already linked into an intrusive list, to allow for enumerating all existing permits. We use these existing hooks to link the permits into the appropriate queue, and back to `_permit_list` when they are not in any special queue. To make this possible we first have to make all lists store naked permits, moving all auxiliary data fields currently stored in wrappers like `entry` into the permit itself. With this, all queues and lists in the semaphore are intrusive lists, storing permits directly, which has the following implications:
* queues no longer take extra memory, as all of them are intrusive
* permits are completely self-sufficient w.r.t to queuing: code can queue or dequeue permits just with a reference to a permit at hand, no other wrapper, iterator, pointer, etc. is necessary.
* queues don't keep permits alive anymore; destroying a permit will automatically unlink it from the respective queue, although this might lead to use-after-free. Not a problem in practice, only one code-path (`reader_concurrenc_semaphore::with_permit()`) had to be adjusted.
After all that extensive preparations, we can now handle the case of evicting a reader which is queued on memory.
Fixes: #12700Closes#12777
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: handle reader blocked on memory becoming inactive
reader_concurrency_semaphore: move _permit_list next to the other lists
reader_permit: evict inactive read on timeout
reader_concurrency_semaphore: move inactive_read to .cc
reader_concurrency_semaphore: store permits in _inactive_reads
reader_concurrency_semaphore: inactive_read: de-inline more methods
reader_concurrency_semaphore: make _ready_list intrusive
reader_permit: add wait_for_execution state
reader_concurrency_semaphore: make wait lists intrusive
reader_concurrency_semaphore: move most wait_queue methods out-of-line
reader_concurrency_semaphore: store permits directly in queues
reader_permit: introduce (private) operator * and ->
reader_concurrency_semaphore: remove redundant waiters() member
reader_concurrency_semaphore: add waiters counter
reader_permit: use check_abort() for timeout
reader_concurrency_semaphore: maybe_dump_permit_diagnostics(): remove permit list param
reader_concurrency_semaphroe: make foreach_permit() const
reader_permit: add get_schema() and get_op_name() accessors
reader_concurrency_semaphore: mark maybe_dump_permit_diagnostics as noexcept
Our end goal (#12642) is to mark raft tables to use
schema commitlog. There are two similar
cases in code right now - `with_null_sharder`
and `set_wait_for_sync_to_commitlog` `schema_builder`
methods. The problem is that if we need to
mark some new schema with one of these methods
we need to do this twice - first in
a method describing the schema
(e.g. `system_keyspace::raft()`) and second in the
function `create_table_from_mutations`, which is not
obvious and easy to forget.
`create_table_from_mutations` is called when schema object
is reconstructed from mutations, `with_null_sharder`
and `set_wait_for_sync_to_commitlog` must be called from it
since the schema properties they describe are
not included in the mutation representation of the schema.
This series proposes to distinguish between the schema
properties that get into mutations and those that do not.
The former are described with `schema_builder`, while for
the latter we introduce `schema_static_props` struct and
the `schema_builder::register_static_configurator` method.
This way we can formulate a rule once in the code about
which schemas should have a null sharder/be synced, and it will
be enforced in all cases.
Closes#13170
* github.com:scylladb/scylladb:
schema.hh: choose schema_commitlog based on schema_static_props flag
schema.hh: use schema_static_props for wait_for_sync_to_commitlog
schema.hh: introduce schema_static_props, use it for null_sharder
database.cc: drop ensure_populated and mark_as_populated
This patch finishes the refactoring. We introduce the
use_schema_commitlog flag in schema_static_props
and use it to choose the commitlog in
database::add_column_family. The only
configurator added declares what was originally in
database::add_column_family - all
tables from schema_tables keyspace
should use schema_commitlog.
There was some logic to call mark_as_populate at
the appropriate places, but the _populated field
and the ensure_populated function were
not used by anyone.
The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler.
refs: #13100
refs: #10941Closes#13141
* github.com:scylladb/scylladb:
database: Increase verbosity of database::stop() method
large_data_handler: Increase verbosity on shutdown
large_data_handler: Coroutinize .stop() method
The compilation of wasm UDFs is performed by a call to a foreign
function, which cannot be divided with yielding points and, as a
result, causes long reactor stalls for big UDFs.
We avoid them by submitting the compilation task to a non-seastar
std::thread, and retrieving the result using seastar::alien.
The thread is created at the start of the program. It executes
tasks from a queue in an infinite loop.
All seastar shards reference the thread through a std::shared_ptr
to a `alien_thread_runner`.
Considering that the compilation takes a long time anyway, the
alien_thread_runner is implemented with focus on simplicity more
than on performance. The tasks are stored in an std::queue, reading
and writing to it is synchronized using an std::mutex for reading/
writing to the queue, and an std::condition_variable waiting until
the queue has elements.
When the destructor of the alien runner is called, an std::nullopt
sentinel is pushed to the queue, and after all remaining tasks are
finished and the sentinel is read, the thread finishes.
Fixes#12904Closes#13051
* github.com:scylladb/scylladb:
wasm: move compilation to an alien thread
wasm: convert compilation to a future
After we move the compilation to a alien thread, the completion
of the compilation will be signaled by fulfilling a seastar promise.
As a result, the `precompile` function will return a future, and
because of that, other functions that use the `precompile` functions
will also become futures.
We can do all the neccessary adjustments beforehand, so that the actual
patch that moves the compilation will contain less irrelevant changes.
The code for compare_endpoints originates at the dawn of time (bc034aeaec)
and is called on the fast path from storage_proxy via `sort_by_proximity`.
This series considerably reduces the function's footprint by:
1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case)
2. avoid sstring copy in topology::get_{datacenter,rack}
Closes#12761
* github.com:scylladb/scylladb:
topology: optimize compare_endpoints
to_string: add print operators for std::{weak,partial}_ordering
utils: to_sstring: deinline std::strong_ordering print operator
move to_string.hh to utils/
test: network_topology: add test_topology_compare_endpoints
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
There's another one that accepts explicit basedir first argument and
that's used by the rest of the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12643
Currently, UDAs can't be reused if Scylla has been
restarted since they have been created. This is
caused by the missing initialization of saved
UDAs that should have inserted them to the
cql3::functions::functions::_declared map, that
should store all (user-)created functions and
aggregates.
This patch adds the missing implementation in a way
that's analogous to the method of inserting UDF to
the _declared map.
Fixes#11309
Using lambda coroutines as arguments can lead to a use-after-free.
Currently, the way these lambdas were used in do_parse_schema_tables
did not lead to such a problem, but it's better to be safe and wrap
them in coroutine::lambda(), so that they can't lead to this problem
as long as we ensure that the lambda finishes in the
do_parse_schema_tables() statement (for example using co_await).
Closes#12487
The CQL binary protocol version 3 was introduced in 2014. All Scylla
version support it, and Cassandra versions 2.1 and newer.
Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer
use 32-bit collection sizes.
Unfortunately, we implemented support for multiple serialization formats
very intrusively, by pushing the format everywhere. This avoids the need
to re-serialize (sometimes) but is quite obnoxious. It's also likely to be
broken, since it's almost untested and it's too easy to write
cql_serialization_format::internal() instead of propagating the client
specified value.
Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's
easy to verify that they are no longer in use on a running system by
examining the `system.clients` table before upgrade.
Fixes#10607Closes#12432
* github.com:scylladb/scylladb:
treewide: drop cql_serialization_format
cql: modification_statement: drop protocol check for LWT
transport: drop cql protocol versions 1 and 2
Sstable read related metrics are broken for a long time now. First, the introduction of inactive reads (https://github.com/scylladb/scylladb/issues/1865) diluted this metric, as it now also contained inactive reads (contrary to the metric's name). Then, after moving the semaphore in front of the cache (3d816b7c1) this metric became completely broken as this metric now contains all kinds of reads: disk, in-memory and inactive ones too.
This series aims to remedy this:
* `scylla_database_active_reads` is fixed to only include active reads.
* `scylla_database_active_reads_memory_consumption` is renamed to `scylla_database_reads_memory_consumption` and its description is brought up-to-date.
* `scylla_database_disk_reads` is added to track current reads that are gone to disk.
* `scylla_database_sstables_read` is added to track the number of sstables read currently.
Fixes: https://github.com/scylladb/scylladb/issues/10065Closes#12437
* github.com:scylladb/scylladb:
replica/database: add disk_reads and sstables_read metrics
sstables: wire in the reader_permit's sstable read count tracking
reader_concurrency_semaphore: add disk_reads and sstables_read stats
replica/database: fix active_reads_memory_consumption_metric
replica/database: fix active_reads metric