Currently there are four helpers, this patch makes it just two and one
of them becomes private the table thus making the API small and neat
(and easy to patch further).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.
Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.
Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
Inferring shard from generation is long gone. We still use it in
some scripts, but that's no longer needed in Scylla, when loading
the SSTables, and it also conflicts with ongoing work of UUID-based
generations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12476
This patch replaces all dependencies on the wasmtime
C++ bindings with our new ones.
The wasmtime.hh and wasm_engine.hh files are deleted.
The libwasmtime.a library is no longer required by
configure.py. The SCYLLA_ENABLE_WASMTIME macro is
removed and wasm udfs are now compiled by default
on all architectures.
In terms of implementation, most of code using
wasmtime was moved to the Rust source files. The
remaining code uses names from the new bindings
(which are mostly unchanged). Most of wasmtime objects
are now stored as a rust::Box<>, to make it compatible
with rust lifetime requirements.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
This new option allows user to control the number of compaction groups
per table per shard. It's 0 by default which implies a single compaction
group, as is today.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is the initial support for multiple groups.
_x_log2_compaction_groups controls the number of compaction groups
and the partitioning strategy within a single table.
The value in _x_log2_compaction_groups refers to log base 2 of the
actual number of groups.
0 means 1 compaction group.
1 means 2 groups and 2 most significant bits of token being
used to pick the target group.
The group partitioner should be later abstracted for making tablet
integration easier in the future.
_x_log2_compaction_groups is still a constant but a config option
will come next.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Estimates # of compaction jobs to be performed on a table.
Adaptation is done by adding estimation from all groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This will replace table::as_table_state(). The latter will be
killed once its usage drops to zero.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Allow table-wide compaction trigger, as well as fine-grained trigger
like after flushing a memtable on behalf of a single group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This variant will be useful when iterating through groups
and performing async actions on each. It guarantees that all
groups are alive by the time they're reached in the loop.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
add_memtables_to_reader_list() will be adapted to compaction groups.
For point queries, it will add memtables of a single group.
With the callback, add_memtables_to_reader_list() can tell its
caller the exact amount of memtable readers to be added, so it
can reserve precisely the readers capacity.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
active_memtable() was fine to a single group, but with multiple groups,
there will be one active memtable per group. Let's change the
interface to reflect that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Useful for iterating through all groups. This is intermediary
implementation which requires allocation as only one group
is supported today.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This change removes sstables.hh from some other headers replacing it
with version.hh and shared_sstable.hh. Also this drops
sstables_manager.hh from some more headers, because this header
propagates sstables.hh via self. That change is pretty straightforward,
but has a recochet in database.hh that needs disk-error-handler.hh.
Without the patch touch sstables/sstable.hh results in 409 targets
recompillation, with the patch -- 299 targets.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12222
Currently this is a sharded<semaphore> started/stopped in main and
referenced by database in order to be fed into sstables code. This
semaphore always comes with the "concurrency" parameter that limits the
parallel_for_each parallelizm.
This patch wraps both together into directory_semaphore class. This
makes its usage simpler and will allow extending it in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
Procedures that call this function happen to be in compaction_group,
so let's move it to group. Simplifies the change where the procedure
retrieves tracker from the group itself.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All callers of do_add_sstable() live in compaction_group, so it
should be moved into compaction_group too. It also makes easier
for the function to retrieve the backlog tracker from the group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).
Fixesscylladb/scylladb#11793
Note: table_population_metadata::start_subdir is called
in a seastar thread to facilitate backporting to old versions
that do not support coroutines yet.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's a circular dependency between system_keyspace and database. The
former needs the latter because it needs to execula local requests via
query_processor. The latter needs the former via compaction manager and
large data handler, database depends on both and these too need to
insert their entries into system keyspace.
To cut this loop the compaction manager and large data handler both get
a weak reference on the system keysace. Once system keyspace starts is
activcates this reference via the database call. When system keyspace is
shutdown-ed on stop, it deactivates the reference.
Technically the weak reference is implemented by marking the system_k.s.
object as async_sharded_service, and the "reference" in question is the
shared_from_this() pointer. When compaction manager or large data
handler need to update a system keyspace's table, they both hold an
extra reference on the system keyspace until the entry is committed,
thus making sure that sys._k.s. doesn't stop from under their feet. At
the same time, unplugging the reference on shutdown makes sure that no
new entries update will appear and the system_k.s. will eventually be
released.
It's not a C++ classical reference, because system_keyspace starts after
and stops before database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env
Closes#11738
* github.com:scylladb/scylladb:
system_keyspace: Make get_{local|saved}_tokens non static
size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
cql_test_env: Keep sharded<system_keyspace> reference
size_estimate_virtual_reader: Keep system_keyspace reference
system_keyspace: Pass sys_ks argument to install_virtual_readers()
system_keyspace: Make make() non-static
distributed_loader: Pass sys_ks argument to init_system_keyspace()
system_keyspace: Remove dangling forward declaration
The size-estimate-virtual-reader will need it, now it's available as
"this" from system_keyspace::make() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Compacted undeleted sstables are relevant for avoiding data resurrection
in the purge path. As token ranges of groups won't overlap, it's
better to isolate this data, so to prevent one group from interfering
with another.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The logic to reject explicit snapshot of views/indexes
was improved in aa127a2dbb.
However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.
This is implemented in this patch.
The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.
Fixes#11612
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs.
If `tombstone_warn_threshold:0` feature disabled.
Refs scylladb#11410
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The group is now responsible for providing the compound set.
table still has one compound set, which will span all groups for
the cases we want to ignore the group isolation.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit is restricted to moving main set into compaction_group.
Next, we'll move maintenance set into it and finally the memtable.
A method is introduced to figure out which group a sstable belongs
to, but it's still unimplemented as table is still limited to
a single group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction group is a new abstraction used to group SSTables
that are eligible to be compacted together. By this definition,
a table in a given shard has a single compaction group.
The problem with this approach is that data from different vnodes
is intermixed in the same sstable, making it hard to move data
in a given sstable around.
Therefore, we'll want to have multiple groups per table.
A group can be thought of an isolated LSM tree where its memtable
and sstable files are isolated from other groups.
As for the implementation, the idea is to take a very incremental
approach.
In this commit, we're introducing a single compaction group to
table.
Next, we'll migrate sstable and maintenance set from table
into that single compaction group. And finally, the memtable.
Cache will be shared among the groups, for simplicity.
It works due to its ability to invalidate a subset of the
token range.
There will be 1:1 relationship between compaction_group and
table_state.
We can later rename table_state to compaction_group_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
on_compaction_completion() is not very descriptive. let's rename
it, following the example of
update_sstable_lists_on_off_strategy_completion().
Also let's coroutinize it, so to remove the restriction of running
it inside a thread only.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11407
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.
Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.
Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.
Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672Closes#11053
* github.com:scylladb/scylladb:
test/cql-pytest: add test for query tombstone page limit
query-result-writer: stop when tombstone-limit is reached
service/pager: prepare for empty pages
service/storage_proxy: set smallest continue pos as query's continue pos
service/storage_proxy: propagate last position on digest reads
query: result_merger::get() don't reset last-pos on short-reads and last pages
query: add tombstone-limit to read-command
service/storage_proxy: add get_tombstone_limit()
query: add tombstone_limit type
db/config: add config item for query tombstone limit
gms: add cluster feature for empty replica pages
tree: don't use query::read_command's IDL constructor
This series converts the synchronous `effective_replication_map::get_range_addresses` to async
by calling the replication strategy async entry point with the same name, as its callers are already async
or can be made so easily.
To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map,
let the callers of this patch hold a effective_replication_map per keyspace and pass it down
to the (now asynchronous) functions that use it (making affected storage_service methods static where possible
if they no longer depend on the storage_service instance).
Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints
are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate
that is true for local_strategy and everywhere_replication_strategy, and is false otherwise.
With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow.
Refs #11005
Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping)
Closes#11009
* github.com:scylladb/scylladb:
effective_replication_map: make get_range_addresses asynchronous
range_streamer: add_ranges and friends: get erm as param
storage_service: get_new_source_ranges: get erm as param
storage_service: get_changed_ranges_for_leaving: get erm as param
storage_service: get_ranges_for_endpoint: get erm as param
repair: use get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces
storage_service: coroutinize update_pending_ranges
effective_replication_map: add get_replication_strategy
effective_replication_map: get_range_addresses: use the precalculated replication_map
abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
abstract_replication_strategy: reindent
utils: sequenced_set: expose set and `contains` method
abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
utils: sequenced_set: templatize VectorType
utils: sanitize sequenced_set
utils: sequenced_set: delete mutable get_vector method
To be used by coordinator side code to determine the correct tombstone
limit to pass to read-command (tombstone limit field added in the next
commit). When this limit is non-zero, the replica will start cutting
pages after the tombstone limit is surpassed.
This getter works similarly to `get_max_result_size()`: if the cluster
feature for empty replica pages is set, it will return the value
configured via db::config::query_tombstone_limit. System queries always
use a limit of 0 (unlimited tombstones).
To be used for getting a coheret set of all keyspaces
with non-local replication strategy and their respective
effective_replication_map.
As an example, use it in this patch in
storage_service::update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For node operations, we currently call get_non_system_keyspaces
but really want to work on all keyspace that have non-local
replication strategy as they are replicated on other nodes.
Reflect that in the replica::database function name.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.
Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that snapshot orchestration in snapshot_on_all_shards
doesn't use snapshot_manager, get rid of the data structure.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that snapshot orchestration is done solely
in snapshot_on_all_shards, the per-shard
snapshot function can be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Call take_snapshot on each shard and collect the
returns snapshot_file_set.
When all are done, move the vector<snapshot_file_set>
to finalize_snapshot.
All that without resorting to using the snapshot_manager
nor calling table::snapshot.
Both will deleted in the following patches.
Fixes#11132
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and pass it to seal_snapshot, so that the latter won't
need to lookup and access the snapshot_manager object.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>