bytes_ostream is an incremental builder for a discontiguous byte container.
managed_bytes is a non-incremental (size must be known up front) byte
container, that is also compatible with LSA. So far, conversion between
them involves copying. This is unfortunate, since query_result is generated
as a bytes_ostream, but is later converted to managed_bytes (today, this
is done in cql3::expr::get_non_pk_values() and
compound_view_wrapper::explode(). If the two types could be made compatible,
we could use managed_bytes_view instead of creating new objects and avoid
a copy. It's also nicer to have one less vocabulary type.
This patch makes bytes_ostream use managed_bytes' internal representation
(blob_storage instead of bytes_ostream::chunk) and provides a conversion
to managed_bytes. All bytes_ostream users are left in place, but the goal
is to make bytes_ostream a write-only type with the only observer a conversion
to managed_bytes.
It turns out to be relatively simple. The internal representations were
already similar. I made blob_storage::ref_type self-initializing to
reduce churn (good practice anyway) and added a private constructor
to managed_bytes for the conversion.
Note that bytes_ostream can only be used to construct a non-LSA managed_bytes,
but LSA uses of managed_bytes are very strictly controlled (the entry
points to memtable and cache) so that's not a problem.
A unit test is added.
Closes#10986
After acquiring the _compaction_state write lock,
select all sstables using get_candidates and register them
as compacting, then unlock the _compaction_state lock
to let regular compaction run in parallel.
Also, run major compaction in maintenance scheduling group.
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.
Fixes#10961Closes#10984
* github.com:scylladb/scylla:
compaction_manager: major_compaction_task: run in maintenance scheduling groupt
compaction_manager: allow regular compaction to run in parallel to major
This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions:
1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260.
This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics.
Example:
```c++
// Before
try {
std::rethrow_exception(eptr);
} catch (std::runtime_exception& ex) {
// 1
} catch (...) {
// 2
}
// After
if (auto* ex = try_catch<std::runtime_exception>(eptr)) {
// 1
} else {
// 2
}
```
2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself.
Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write).
Results from `perf_simple_query`:
```
Before (719724e4df):
Writes:
Normal:
127841.40 tps ( 56.2 allocs/op, 13.2 tasks/op, 50042 insns/op, 0 errors)
Timeouts:
94770.81 tps ( 53.1 allocs/op, 5.1 tasks/op, 78678 insns/op, 1000000 errors)
Reads:
Normal:
138902.31 tps ( 65.1 allocs/op, 12.1 tasks/op, 43106 insns/op, 0 errors)
Timeouts:
62447.01 tps ( 49.7 allocs/op, 12.1 tasks/op, 135984 insns/op, 936846 errors)
After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f):
Writes:
Normal:
127359.12 tps ( 56.2 allocs/op, 13.2 tasks/op, 49782 insns/op, 0 errors)
Timeouts:
163068.38 tps ( 52.1 allocs/op, 5.1 tasks/op, 40615 insns/op, 1000000 errors)
Reads:
Normal:
151221.15 tps ( 65.1 allocs/op, 12.1 tasks/op, 43028 insns/op, 0 errors)
Timeouts:
192094.11 tps ( 41.2 allocs/op, 12.1 tasks/op, 33403 insns/op, 960604 errors)
```
Closes#10368
* github.com:scylladb/scylla:
database: avoid rethrows when handling exceptions from commitlog
database: convert throw_commitlog_add_error to use make_nested_exception_ptr
utils: add make_nested_exception_ptr
storage_proxy: don't rethrow when inspecting replica exceptions on write path
database: don't rethrow rate_limit_exception
storage_proxy: don't rethrow the exception in abstract_read_resolver::error
utils/exceptions.cc: don't rethrow in is_timeout_exception
utils/exceptions: add try_catch
utils: add abi/eh_ia64.hh
storage_proxy: don't rethrow exceptions from replicas when accounting read stats
message: get rid of throws in send_message{,_timeout,_abortable}
database/{query,query_mutations}: don't rethrow read semaphore exceptions
Recently we noticed a regression where with certain versions of the fmt
library,
SELECT value FROM system.config WHERE name = 'experimental_features'
returns string numbers, like "5", instead of feature names like "raft".
It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
1. We have an explicit operator<<.
2. We have an *implicit* convertor to the type held by T.
We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.
The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.
I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).
Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.
This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.
Fixes#11003.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11004
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This reverts commit aa8f135f64, reversing
changes made to 9a88bc260c. The patch
causes hangs during flush.
Also reverts parts of 411231da75 that impacted the unit test.
Fixes#10897.
The utils::make_nested_exception_ptr function works similar to
std::throw_with_nested, but instead of storing the currently thrown
exception as the nested exception and then immediately throwing the new
exception, it receives the nested exception as an std::exception_ptr and
also returns an std::exception_ptr.
If the standard library supports it, the function does not perform any
throws. Otherwise the fallback logic performs two throws.
Introduces a utility function which allows obtaining a pointer to the
exception data held behind an std::exception_ptr if the data matches the
requested type. It can be used to implement manual but concise
try..catch chains.
The `try_catch` has the best performance when used with libstdc++ as it
uses the stdlib specific functions for simulating a try..catch without
having to actually throw. For other stdlibs, the implementation falls
back to a throw surrounded by an actual try..catch.
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.
_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.
The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.
For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:
[a, c] @ t0
[a, b) @ t1, [b, d] @ t2
c > b
The proper sequence for such versions is (assuming d > c):
[a, b) @ t1,
[b, d] @ t2
Due to the bug, the reader will produce:
[a, b) @ t1,
[b, c] @ t0
The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).
The problem goes away once MVCC versions merge.
Fixes#10913Fixes#10830Closes#10914
The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0.
They are mostly refactors which don't affect the behavior of the system, except one: the commit 4d439a16b3 causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because:
1. eventually, we want this to be the default behavior
2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary.
Closes#10864
* github.com:scylladb/scylla:
service/raft: raft_group_registry: add assertions when fetching servers for groups
service/raft: raft_group_registry: remove `_raft_support_listener`
service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
service/raft: raft_group0: move group 0 RPC handlers from `storage_service`
service/raft: messaging: extract raft_addr/inet_addr conversion functions
service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster`
treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls
test/boost: memtable_test: perform schema operations on shard 0
test/boost: cdc_test: remove test_cdc_across_shards
message: rename `send_message_abortable` to `send_message_cancellable`
message: change parameter order in `send_message_oneway_timeout`
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache (e.g. Adding a user and changing their permissions)
can become pretty annoying.
This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.
It also adds an API for flushing the cache to make it easier for users who
don't want to modify their permissions_cache config.
branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable
CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/
dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache
* https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable:
loading_cache_test: Test loading_cache::reset and loading_cache::update_config
api: Add API for resetting authorization cache
authorization_cache: Make permissions cache and authorized prepared statements cache live updateable
auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct
utils/loading_cache.hh: Add update_config method
utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh
utils/loading_cache.hh: Add reset method
Validate that the size of the cache is zero after calling the
reset method and that the config is being updated correctly
after calling update_config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch renames the permissions_cache_config struct to loading_cache_config
and moves it to utils/loading_cache.hh. This will make it easier to handle
config updates to the authorization caches on the next patches
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
- Use `sstables::generation_type` in more places
- Enforce conceptual separation of `sstables::generation_type` and `int64_t`
- Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible
Fixes#10796.
Closes#10844
* github.com:scylladb/scylla:
sstables: make generation_type an actual separate type
sstables: use generation_type more soundly
extremum_tracker: do not require default-constructible value types
Fixes#9367
The CL counters pending_allocations and requests_blocked_memory are
exposed in graphana (etc) and often referred to as metrics on whether
we are blocking on commit log. But they don't really show this, as
they only measure whether or not we are blocked on the memory bandwidth
semaphore that provides rate back pressure (fixed num bytes/s - sortof).
However, actual tasks in allocation or segment wait is not exposed, so
if we are blocked on disk IO or waiting for segments to become available,
we have no visible metrics.
While the "old" counters certainly are valid, I have yet to ever see them
be non-zero in modern life.
Closes#9368
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.
Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933
Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.
Closes#10829
* github.com:scylladb/scylla:
query: have replica provide the last position
idl/query: add last_position to query_result
mutlishard_mutation_query: propagate compaction state to result builder
multishard_mutation_query: defer creating result builder until needed
querier: use full_position instead of ad-hoc struct
querier: rely on compactor for position tracking
mutation_compactor: add current_full_position() convenience accessor
mutation_compactor: s/_last_clustering_pos/_last_pos/
mutation_compactor: add state accessor to compact_mutation
introduce full_position
idl: move position_in_partition into own header
service/paging: use position_in_partition instead of clustering_key for last row
alternator/serialization: extract value object parsing logic
service/pagers/query_pagers.cc: fix indentation
position_in_partition: add to_string(partition_region) and parse_partition_region()
mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
As reported in #10867, newer versions of the fmt library
format %Y using 4-characters width, 0-padding the prefix
when needed, while older versions don't do that.
This change moves away from using %Y and friends
fmt specifiers to using explicit numeric-based formatting
conforming to ISO 8601 and making sure the year field
has at least 4 digits and is zero padded. When
negative, the width is upped to 5 so it would show as -0001
rather than -001.
The unit test was updated respectively.
Fixes#10867
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10870
This series moves the logic to not perform off-strategy compaction if the maintenance set is empty from the table layer down to the compaction_manager layer since it is the one that needs to make the decision.
With that compaction_manager::perform_offstrategy will return a future<bool> which resolves to true
iff off-strategy compaction was required and performed.
The sstable_compaction_test was adjusted and a new compaction_manager_for_testing class was added
to make sure the compaction manager is enabled when constructed (it wasn't so test_offstrategy_sstable_compaction didn't perform any off-strategy compactions!) and stopped before destroyed.
Closes#10848
* github.com:scylladb/scylla:
table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager
compaction_manager: offstrategy_compaction_task: refactor log printouts
test: sstable_compaction: compaction_manager_for_testing
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.
This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:
```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
'max_writes_per_second': 100,
'max_reads_per_second': 200
};
```
Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.
Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.
The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.
Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:
- Write mode:
```
8f690fdd47 (PR base):
129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op)
This PR:
125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op)
```
- Read mode:
```
8f690fdd47 (PR base):
150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op)
This PR:
151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op)
```
Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected
Fixes: #4703Closes#9810
* github.com:scylladb/scylla:
storage_proxy: metrics for per-partition rate limiting of reads
storage_proxy: metrics for per-partition rate limiting of writes
database: add stats for per partition rate limiting
tests: add per_partition_rate_limit_test
config: add add_per_partition_rate_limit_extension function for testing
cf_prop_defs: guard per-partition rate limit with a feature
query-request: add allow_limit flag
storage_proxy: add allow rate limit flag to get_read_executor
storage_proxy: resultize return type of get_read_executor
storage_proxy: add per partition rate limit info to read RPC
storage_proxy: add per partition rate limit info to query_result_local(_digest)
storage_proxy: add allow rate limit flag to mutate/mutate_result
storage_proxy: add allow rate limit flag to mutate_internal
storage_proxy: add allow rate limit flag to mutate_begin
storage_proxy: choose the right per partition rate limit info in write handler
storage_proxy: resultize return types of write handler creation path
storage_proxy: add per partition rate limit to mutation_holders
storage_proxy: add per partition rate limit info to write RPC
storage_proxy: add per partition rate limit info to mutate_locally
database: apply per-partition rate limiting for reads/writes
database: move and rename: classify_query -> classify_request
schema: add per_partition_rate_limit schema extension
db: add rate_limiter
storage_proxy: propagate rate_limit_exception through read RPC
gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
storage_proxy: pass rate_limit_exception through write RPC
replica: add rate_limit_exception and a simple serialization framework
docs: design doc for per-partition rate limiting
transport: add rate_limit_error
The test checked if creating a table with CDC enabled on shard other
than 0 would create the CDC log table as well; it was a regression test
for #5582. However we will soon bounce all schema change requests to
shard 0, so the test's purpose is gone.
I need to remove this test because `cquery_nofail` does not handle the
bouncing correctly: it silently accepts the bounce message, assumes that
the query was successful and returns. So after we change the code to
start bouncing all requests to shard 0, if a query was ran inside test
code using `cquery_nofail` on a shard different than 0 it would do
nothing and following queries executed on shard 0 would fail because they
depended on the effect of the aforementioned query.
The former allows for expressing more positions, like a position
before/after a clustering key. This practically enables the coordinator
side paging logic, for a query to be stopped at a tombstone (which can
have said positions).
Make the compaction manager for testing using
this class.
Makes sure to enable the compaction manager
and to stop it before it's destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Adds the per_partition_rate_limit_test.cc file. Currently, it only
contains a test which verifies that the feature correctly switches off
rate limiting for internal queries (!allow_limit || internal sg).
Introduces the rate_limiter, a replica-side data structure meant for
tracking the frequence with which each partition is being accessed
(separately for reads and writes) and deciding whether the request
should be accepted and processed further or rejected.
The limiter is implemented as a statically allocated hashmap which keeps
track of the frequency with which partitions are accessed. Its entries
are incremented when an operation is admitted and are decayed
exponentially over time.
If a partition is detected to be accessed more than its limit allows,
requests are rejected with a probability calculated in such a way that,
on average, the number of accepted requests is kept at the limit.
The structure currently weights a bit above 1MB and each shard is meant
to keep a separate instance. All operations are O(1), including the
periodic timer.
The time point is multiplied by an adjustment factor of 1000
for boost::posix_time::time_duration::ticks_per_second() = 1000000
when calling boost::posix_time::milliseconds(count)
and that may lead to integer overflow as reported
by the UndefinedBehaviorSanitizer.
See https://github.com/scylladb/scylla/issues/10830#issuecomment-1158899187
This change checks for possible overflow in advance and
prints the raw counter value in this case, along with
an explanation.
Refs #10830
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10831
* github.com:scylladb/scylla:
test: types: add test cases for timestamp_type to_string format
types: time_point_to_string: harden against out of range timestamps
Following the previous patch that changed time_point_to_string
we should cement the different edge cases for the next
time this function changes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the len2 argument to crc32_combine() is zero, then the crc2
argument must also be zero.
fast_crc32_combine() explicitly checks for len2==0, in which case it
ignores crc2 (which is the same as if it were zero).
zlib's crc32_combine() used to have that check prior to version
1.2.12, but then lost it, making its necessary for callers to be more
careful.
Also add the len2==0 check to the dummy fast_crc32_combine()
implementation, because it delegates to zlib's.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10731
"
In order to wire-in the compaction_throughput_mb_per_sec the compaction
creation and stopping will need to be patched. Right now both places are
quite hairy, this set coroutinizes stop() for simpler adding of stopping
bits, unifies all the compaction manager constructors and adds the
compaction_manager::config for simpler future extending.
As a side effect the backlog_controller class gets an "abstract" sched
group it controlls which in turn will facilitate seastar sched groups
unification some day.
"
* 'br-compaction-manager-start-stop-cleanup' of https://github.com/xemul/scylla:
compaction_manager: Introduce compaction_manager::config
backlog_controller: Generalize scheduling groups
database: Keep compound flushing sched group
compaction_manager: Swap groups and controller
compaction_manager: Keep compaction_sg on board
compaction_manager: Unify scheduling_group structures
compaction_manager: Merge static/dynamic constructors
compaction_manager: Coroutinuze really_do_stop()
compaction_manager: Shuffle really_do_stop()
compaction_manager: Remove try-catch around logger
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.
we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
`generation_type` is (supposed to be) conceptually different from
`int64_t` (even if physically they are the same), but at present
Scylla code still largely treats them interchangeably.
In addition to using `generation_type` in more places, we
provide (no-op) `generation_value()` and `generation_from_value()`
operations to make the smoke-and-mirrors more believable.
The churn is considerable, but all mechanical. To avoid even
more (way, way more) churn, unit test code is left untreated for
now, except where it uses the affected core APIs directly.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Commit e739f2b779 ("cql3: expr: make evaluate() return a
cql3::raw_value rather than an expr::constant") introduced
raw_value::view() as a synonym to raw_value::to_view() to reduce
churn. To fix this duplication, we now remove raw_value::to_view().
raw_value::to_view() was picked for removal because is has fewer
call sites, reducing churn again.
Closes#10819
A named bind-variable can be reused:
SELECT * FROM tab
WHERE a = :var AND b = :var
Currently, the grammar just ignores the possibility and creates
a new variable with the same name. The new variable cannot be
referenced by name since the first one shadows it.
Catch variable reuse by maintaining a map from bind variable names
to indexed, and check that when reusing a bind variable the types
match.
A unit test is added.
Fixes#10810Closes#10813
This is to make it constructible in a way most other services are -- all
the "scalar" parameters are passed via a config.
With this it will be much shorter to add compaction bandwidth throttling
option by just extending the config itself, not the list of constructor
arguments (and all its callers).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only difference between those two are in the way backlog controller
is created. It's much simpler to have the controller construction logic
in compaction manager instead. Similar "trick" is used to construct
flush controller for the database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
Closes#10807
* github.com:scylladb/scylla:
memtable: Add counters for tombstone compaction
memtable, cache: Eagerly compact data with tombstones
memtable: Subtract from flushed memory when cleaning
mvcc: Introduce apply_resume to hold state for partition version merging
test: mutation: Compare against compacted mutations
compacting_reader: Drop irrelevant tombstones
mutation_partition: Extract deletable_row::compact_and_expire()
mvcc: Apply mutations in memtable with preemption enabled
test: memtable: Make failed_flush_prevents_writes() immune to background merging
`sstable_directory::process_sstables_dir` may hit an exception when calling `handle_component`.
In this case we currently destroy the `sstable_dir_lister` variable without closing the `directory_lister` first -
leading to terminate in `~directory_lister` as seen in #10697.
This mini-series handles this exception and always closes the `directory_lister`.
Add unit test to reproduce this issue.
Fixes#10697Closes#10754
* github.com:scylladb/scylla:
sstable_directory: process_sstable_dir: fixup indentation
sstable_directory: process_sstable_dir: close directory_lister on error
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
Fixes#10801
Refs #10793
(cherry picked from commit 0e78ad50ea)
Closes#10802
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
Fixes https://github.com/scylladb/scylla/issues/4116
Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop
v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset
v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=
v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.
Closes#10717
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
random_mutation_generator: Add option to specify ks_name and cf_name
table: Prevent creating unbounded number of sstables
Otherwise, if we don't consume all lister's entries,
~directory_lister terminates since the
directory_lister is destroyed without being closed.
Add unit test to reproduce this issue.
Fixes#10697
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.
This will start happening after tombstones are compacted with rows on
partition version merging.
This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.
Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.
Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.