Currently, UDAs can't be reused if Scylla has been
restarted since they have been created. This is
caused by the missing initialization of saved
UDAs that should have inserted them to the
cql3::functions::functions::_declared map, that
should store all (user-)created functions and
aggregates.
This patch adds the missing implementation in a way
that's analogous to the method of inserting UDF to
the _declared map.
Fixes#11309
(cherry picked from commit e558c7d988)
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.
Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.
Fixes#11508Closes#11536
(cherry picked from commit 9b6fc553b4)
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
(cherry picked from commit afa7960926)
And pass a reference to it to the database rather
than having the database construct its own compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently flush is retried both by dirty_memory_manager::flush_when_needed
and table::seal_active_memtable, which may be called by other paths
like table::flush.
Unify the retry logic into seal_active_memtable so that
we have similar error handling semantics on all paths.
Refs #4174
Refs #10498
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that everything prior to flush_one is noexcept
make table::seal_active_memtable and the paths that call it
noexcept, making sure that any errors are returned only
as exceptional futures, and handle them in flush_when_needed().
The original handle_exception had a broader scope than now needed,
so this change is mostly technical, to show that we can narrow down
the error handling to the continuation of flush_one - and verify that
the unit test is not broken.
A later patch moves this error handling logic away to seal_active_memtable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
region_group is an abstraction that allows accounting for groups of
regions, but the cost/benefit ratio of maintaining the abstraction
is poor. Each time we need to change decision algorithm of memtable
flushing (admittedly rarely), we need to distill that into an abstraction
for region_groups and then use it. An example is virtual regions groups;
we wanted to account for the partially flushed memtables and had to
invent region groups to stand in their place.
Rather than continuing to invest in the abstraction, break it now
and move it to the memtable dirty memory manager which is responsible
for making those decisions. The relevant code is moved to
dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and
a new unit test file is added as well.
A downside of the change is that unit testing will be more difficult.
This patch makes memtable_flush_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch makes compaction_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch adds the _static_shares variable to the backlog_controller so that
instead of having to use a separate constructor when controller is disabled,
we can use a single constructor and periodically check on the adjust method
if we should use the static shares or the controller. This will be useful on
the next patches to make compaction_static_shares and memtable_flush_static_shares
live updateable.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This work gets us a step closer to compaction groups.
Everything in compaction layer but compaction_manager was converted to table_state.
After this work, we can start implementing compaction groups, as each group will be represented by its own table_state. User-triggered operations that span the entire table, not only a group, can be done by calling the manager operation on behalf of each group and then merging the results, if any.
Closes#11028
* github.com:scylladb/scylla:
compaction: remove forward declaration of replica::table
compaction_manager: make add() and remove() switch to table_state
compaction_manager: make run_custom_job() switch to table_state
compaction_manager: major: switch to table_state
compaction_manager: scrub: switch to table_state
compaction_manager: upgrade: switch to table_state
compaction: table_state: add get_sstables_manager()
compaction_manager: cleanup: switch to table_state
compaction_manager: offstrategy: switch to table_state()
compaction_manager: rewrite_sstables(): switch to table_state
compaction_manager: make run_with_compaction_disabled() switch to table_state
compaction_manager: compaction_reenabler: switch to table_state
compaction_manager: make submit(T) switch to table_state
compaction_manager: task: switch to table_state
compaction: table_state: Add is_auto_compaction_disabled_by_user()
compaction: table_state: Add on_compaction_completion()
compaction: table_state: Add make_sstable()
compaction_manager: make can_proceed switch to table_state
compaction_manager: make stop compaction procedures switch to table_state
compaction_manager: make get_compactions() switch to table_state
compaction_manager: change task::update_history() to use table_state instead
compaction_manager: make can_register_compaction() switch to table_state
compaction_manager: make get_candidates() switch to table_state
compaction_manager: make propagate_replacement() switch to table_state
compaction: Move table::in_strategy_sstables() and switch to table_state
compaction: table_state: Add maintenance sstable set
compaction_manager: make has_table_ongoing_compaction() switch to table_state
compaction_manager: make compaction_disabled() switch to table_state
compaction_manager: switch to table_state for mapping of compaction_state
compaction_manager: move task ctor into source
and use an async thread around `directory_lister`
rather than `lister::scan_dir` to simplify the implementation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor test_drop_table_with_auto_snapshot out of
drop_table_with_snapshots, adding a auto_snapshot param,
controlling how to configure the cql_test_env db:.config::auto_snapshot,
so we can test both cases - auto_snapshot enabled and disabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the table to remove has no snapshots then
completely remove its directory on storage
as the left-over directory slows down operations on the keyspace
and makes searching for live tables harder.
Fixes#10896
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Runs drop_column_family on all database shards.
Will be extended later to consider removing the table directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
storage_service/keyspaces?type=user along with user keyspaces returned
the keyspaces that were internal but non-system.
The list of the keyspaces for the user option
(storage_service/keyspaces?type=user) contains neither system nor
internal but only user keyspaces.
Fixes: #11042Closes#11049
Soon, the currently two distinct types of queriers will be merged, as
the template parameter differentiating them will be gone. This will make
using type based overload for insert() impossible, as 2 out of the 3
types will be the same. Use different names instead.
This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions:
1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260.
This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics.
Example:
```c++
// Before
try {
std::rethrow_exception(eptr);
} catch (std::runtime_exception& ex) {
// 1
} catch (...) {
// 2
}
// After
if (auto* ex = try_catch<std::runtime_exception>(eptr)) {
// 1
} else {
// 2
}
```
2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself.
Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write).
Results from `perf_simple_query`:
```
Before (719724e4df):
Writes:
Normal:
127841.40 tps ( 56.2 allocs/op, 13.2 tasks/op, 50042 insns/op, 0 errors)
Timeouts:
94770.81 tps ( 53.1 allocs/op, 5.1 tasks/op, 78678 insns/op, 1000000 errors)
Reads:
Normal:
138902.31 tps ( 65.1 allocs/op, 12.1 tasks/op, 43106 insns/op, 0 errors)
Timeouts:
62447.01 tps ( 49.7 allocs/op, 12.1 tasks/op, 135984 insns/op, 936846 errors)
After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f):
Writes:
Normal:
127359.12 tps ( 56.2 allocs/op, 13.2 tasks/op, 49782 insns/op, 0 errors)
Timeouts:
163068.38 tps ( 52.1 allocs/op, 5.1 tasks/op, 40615 insns/op, 1000000 errors)
Reads:
Normal:
151221.15 tps ( 65.1 allocs/op, 12.1 tasks/op, 43028 insns/op, 0 errors)
Timeouts:
192094.11 tps ( 41.2 allocs/op, 12.1 tasks/op, 33403 insns/op, 960604 errors)
```
Closes#10368
* github.com:scylladb/scylla:
database: avoid rethrows when handling exceptions from commitlog
database: convert throw_commitlog_add_error to use make_nested_exception_ptr
utils: add make_nested_exception_ptr
storage_proxy: don't rethrow when inspecting replica exceptions on write path
database: don't rethrow rate_limit_exception
storage_proxy: don't rethrow the exception in abstract_read_resolver::error
utils/exceptions.cc: don't rethrow in is_timeout_exception
utils/exceptions: add try_catch
utils: add abi/eh_ia64.hh
storage_proxy: don't rethrow exceptions from replicas when accounting read stats
message: get rid of throws in send_message{,_timeout,_abortable}
database/{query,query_mutations}: don't rethrow read semaphore exceptions
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.
This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.
We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.
Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.
I reproduced bad behavior locally on my machine with a tired (high latency)
SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking
up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement).
Without load, it is 300ms vs 50ms.
Fixes#8272Fixes#8309Fixes#1459Closes#10333
* github.com:scylladb/scylla:
config: Introduce force_schema_commit_log option
config: Introduce unsafe_ignore_truncation_record
db: Avoid memtable flush latency on schema merge
db: Allow splitting initiatlization of system tables
db: Flush system.scylla_local on change
migration_manager: Do not drop system.IndexInfo on keyspace drop
Introduce SCHEMA_COMMITLOG cluster feature
frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations
db/commitlog: Improve error messages in case of unknown column mapping
db/commitlog: Fix error format string to print the version
db: Introduce multi-table atomic apply()
"
The option controlls the IO bandwidth of the compaction sched class.
It's not set to be 16MB/s, but is unused. This set makes it 0 by
default (which means unlimited), live-updateable and plugs it to the
seastar sched group IO throttling.
branch: https://github.com/xemul/scylla/tree/br-compaction-throttling-3
tests: unit(dev),
v2: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1010/ ,
v2: manual config update
"
* 'br-compaction-throttling-3-a' of https://github.com/xemul/scylla:
compaction_manager: Add compaction throughput limit
updateable_value: Support dummy observing
serialized_action: Allow being observer for updateable_value
config: Tune the config option
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.
This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.
We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.
Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.
One complication with this change is that replay_position markers are
commitlog-domain specific and cannot cross domains. They are recorded
in various places which survive node restart: sstables are annotated
with the maximum replay position, and they are present inside
truncation records. The former annotation is used by "truncate"
operation to drop sstables. To prevent old replay positions from being
interpreted in the context in the new schema commitlog domain, the
change refuses to boot if there are truncation records, and also
prohibits truncation of schema tables.
The boot sequence needs to know whether the cluster feature associated
with this change was enabled on all nodes. Fetaures are stored in
system.scylla_local. Because we need to read it before initializing
schema tables, the initialization of tables now has to be split into
two phases. The first phase initializes all system tables except
schema tables, and later we initialize schema tables, after reading
stored cluster features.
The commitlog domain is switched only when all nodes are upgraded, and
only after new node is restarted. This is so that we don't have to add
risky code to deal with hot-switching of the commitlog domain. Cold
switching is safer. This means that after upgrade there is a need for
yet another rolling restart round.
Fixes#8272Fixes#8309Fixes#1459
Re-use eisting compaction_throughput_mb_per_sec option, push it down to
compaction manager via config and update the nderlying compaction sched
class when the option is (live)updated.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now, throw_commitlog_add_error is renamed to throw_commitlog_add_error.
Instead of wrapping the currently executing exception and rethrowing it,
it takes an std::exception_ptr, wraps it and also returns
std::exception_ptr.
Now it happens before calling database::drain() but drain is not only
flushing it does lots of other things. More elaborated logging is better
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Adds statistics which count how many times a replica has decided to
reject a write ("total_writes_rate_limited") or a read
("total_reads_rate_limited").
Adds the `db::rate_limiter` to the `database` class and modifies the
`query` and `apply` methods so that they account the read/write
operations in the rate limiter and optionally reject them.
Moves the classify_query higher and renames it to classify_request. The
function will be reused in further commits to protect non-user queries
from accidentally being rate limited.
Adds a CQL protocol extension which introduces the rate_limit_error. The
new error code will be used to indicate that the operation failed due to
it exceeding the allowed per-partition rate limit.
The error code is supposed to be returned only if the corresponding CQL
extension is enabled by the client - if it's not enabled, then
Config_error will be returned in its stead.
This is to make it constructible in a way most other services are -- all
the "scalar" parameters are passed via a config.
With this it will be much shorter to add compaction bandwidth throttling
option by just extending the config itself, not the list of constructor
arguments (and all its callers).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make struct scheduling_group be sub-class of the backlog controller. Its
new meaning is now -- the group under controller maintenance. Both
database and compaction manager derive their sched groups from this one.
This makes backlog controller construction simpler, prepares the ground
for sched groups unification in seastar and facilitates next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch that made the same for compaction manager. The
newly introduced private scheduling_group class is temporary and will go
away in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only difference between those two are in the way backlog controller
is created. It's much simpler to have the controller construction logic
in compaction manager instead. Similar "trick" is used to construct
flush controller for the database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).
One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.
Closes#10699
The manager reference is already available in constructor and thus
can be copied to on-table member.
The code that chooses the manager (user/system one) should be moved
from make_column_family_config() into add_column_family() method.
Once this happens, the get_sstables_manager() should be fixed to
return the reference from its new location. While at it -- mark the
method in question noexcept and add it's mutable overload.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In core code there's only one place that constructs table -- in
database.cc -- and this place currently has the sstables_manager pointer
sitting on table config (despite it's a pointer, it's always non-null).
All the tests always use the manager from one of _env's out there.
For now the new contructor arg is unused.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
database::truncate already flushes the table
on auto_snapshot so there is never a reason
to flush it again in table::snapshot.
Note that cf.can_flush() is false only if memtables
are empty so there nothing to flush or there is
is no seal_immediate_fn and then table::snapshot
wouldn't be able to flush either.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>