scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 01:50:35 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	be8512d7cc	sstables, code: Wrap directory semaphore with concurrency Currently this is a sharded<semaphore> started/stopped in main and referenced by database in order to be fed into sstables code. This semaphore always comes with the "concurrency" parameter that limits the parallel_for_each parallelizm. This patch wraps both together into directory_semaphore class. This makes its usage simpler and will allow extending it in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 11:59:30 +03:00
Avi Kivity	f73a51250c	database: abort on illegal per partition rate limit operation Without memory corruption it's not possible for the switch to fall through, and the compiler will error if we forget to add a case. The compiler however is obliged to consider that we might store some other value in the variable.	2022-11-28 21:58:30 +02:00
Raphael S. Carvalho	9031dc3199	replica: Move table::backlog_tracker_adjust_charges() to compaction_group Procedures that call this function happen to be in compaction_group, so let's move it to group. Simplifies the change where the procedure retrieves tracker from the group itself. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Raphael S. Carvalho	f37a05b559	replica: Move table::do_add_sstable() to compaction_group All callers of do_add_sstable() live in compaction_group, so it should be moved into compaction_group too. It also makes easier for the function to retrieve the backlog tracker from the group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Benny Halevy	fc278be6c4	table: add perform_cleanup_compaction Move the integration with compaction_manager from the api layer to the tabel class so it can also make sure the memtable is cleaned up in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-06 19:41:33 +02:00
Benny Halevy	119c0f3983	distributed_loader: pre-load all sstables metadata for table before populating it We should scan all sstables in the table directory and its subdirectories to determine the highest sstable version and generation before using it for creating new sstables (via reshard or reshape). Fixes scylladb/scylladb#11793 Note: table_population_metadata::start_subdir is called in a seastar thread to facilitate backporting to old versions that do not support coroutines yet. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-19 14:16:57 +03:00
Pavel Emelyanov	f9b57df471	database: Plug/unplug system_keyspace There's a circular dependency between system_keyspace and database. The former needs the latter because it needs to execula local requests via query_processor. The latter needs the former via compaction manager and large data handler, database depends on both and these too need to insert their entries into system keyspace. To cut this loop the compaction manager and large data handler both get a weak reference on the system keysace. Once system keyspace starts is activcates this reference via the database call. When system keyspace is shutdown-ed on stop, it deactivates the reference. Technically the weak reference is implemented by marking the system_k.s. object as async_sharded_service, and the "reference" in question is the shared_from_this() pointer. When compaction manager or large data handler need to update a system keyspace's table, they both hold an extra reference on the system keyspace until the entry is committed, thus making sure that sys._k.s. doesn't stop from under their feet. At the same time, unplugging the reference on shutdown makes sure that no new entries update will appear and the system_k.s. will eventually be released. It's not a C++ classical reference, because system_keyspace starts after and stops before database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Botond Dénes	b247f29881	Merge 'De-static system_keyspace::get_{saved\|local}_tokens()' from Pavel Emelyanov Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env Closes #11738 * github.com:scylladb/scylladb: system_keyspace: Make get_{local\|saved}_tokens non static size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges() cql_test_env: Keep sharded<system_keyspace> reference size_estimate_virtual_reader: Keep system_keyspace reference system_keyspace: Pass sys_ks argument to install_virtual_readers() system_keyspace: Make make() non-static distributed_loader: Pass sys_ks argument to init_system_keyspace() system_keyspace: Remove dangling forward declaration	2022-10-07 11:28:32 +03:00
Pavel Emelyanov	04552f2d58	system_keyspace: Pass sys_ks argument to install_virtual_readers() The size-estimate-virtual-reader will need it, now it's available as "this" from system_keyspace::make() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:57:13 +03:00
Raphael S. Carvalho	cf3f93304e	replica: Move compacted_undeleted_sstables into compaction group Compacted undeleted sstables are relevant for avoiding data resurrection in the purge path. As token ranges of groups won't overlap, it's better to isolate this data, so to prevent one group from interfering with another. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Pavel Emelyanov	9cd1f777a5	database.hh: Remove unused headers Use forward declarations when needed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11667	2022-10-04 09:01:38 +03:00
Benny Halevy	d32c497cd9	database: automatically take snapshot of base table views The logic to reject explicit snapshot of views/indexes was improved in `aa127a2dbb`. However, we never implemented auto-snapshot of view/indexes when taking a snapshot of the base table. This is implemented in this patch. The implementation is built on top of `ba42852b0e` so it would be hard to backport to 5.1 or earlier releases. Fixes #11612 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 11:02:54 +03:00
TarasBor	1f4a93da78	Show warn message if `tombstone_warn_threshold` reached on querier. When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs. If `tombstone_warn_threshold:0` feature disabled. Refs scylladb#11410	2022-09-22 16:42:31 +03:00
Raphael S. Carvalho	f5715d3f0b	replica: Move memtables to compaction_group Now memtables live in compaction_group. Also introduced function that selects group based on token, but today table always return the single group managed by it. Once multiple groups are supported, then the function should interpret token content to select the group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	f4579795e6	replica: move compound SSTable set to compaction group The group is now responsible for providing the compound set. table still has one compound set, which will span all groups for the cases we want to ignore the group isolation. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	6717d96684	replica: move maintenance SSTable set to compaction_group This commit is restricted to moving maintenance set into compaction_group. Next, we'll introduce compound set into it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	ce8e5f354c	replica: move main SSTable set to compaction_group This commit is restricted to moving main set into compaction_group. Next, we'll move maintenance set into it and finally the memtable. A method is introduced to figure out which group a sstable belongs to, but it's still unimplemented as table is still limited to a single group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	4871f1c97c	replica: Introduce compaction_group Compaction group is a new abstraction used to group SSTables that are eligible to be compacted together. By this definition, a table in a given shard has a single compaction group. The problem with this approach is that data from different vnodes is intermixed in the same sstable, making it hard to move data in a given sstable around. Therefore, we'll want to have multiple groups per table. A group can be thought of an isolated LSM tree where its memtable and sstable files are isolated from other groups. As for the implementation, the idea is to take a very incremental approach. In this commit, we're introducing a single compaction group to table. Next, we'll migrate sstable and maintenance set from table into that single compaction group. And finally, the memtable. Cache will be shared among the groups, for simplicity. It works due to its ability to invalidate a subset of the token range. There will be 1:1 relationship between compaction_group and table_state. We can later rename table_state to compaction_group_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Benny Halevy	1ce50439af	replica: table: add get_compaction_manager function so to let a view get the tombstone_gc_state via the compaction_manager of the base table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Raphael S. Carvalho	631b2d8bdb	replica: rename table::on_compaction_completion and coroutinize it on_compaction_completion() is not very descriptive. let's rename it, following the example of update_sstable_lists_on_off_strategy_completion(). Also let's coroutinize it, so to remove the restriction of running it inside a thread only. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11407	2022-08-31 06:17:20 +03:00
Avi Kivity	e9cbc9ee85	Merge 'Add support for empty replica pages' from Botond Dénes Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones. The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by `3131cbea62`, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones. The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set. Upgrade sanity test was conducted as following: * Created cluster of 3 nodes with RF=3 with master version * Wrote small dataset of 1000 rows. * Deleted prefix of 980 rows. * Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100` * Also did some manual queries via `cqlsh` with smaller page size and tracing on. * Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`. * Confirmed there are no errors or read-repairs. Perf regression test: ``` build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60 ``` Before: ``` median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors) median absolute deviation: 973.40 maximum: 135511.63 minimum: 104978.74 ``` After: ``` median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors) median absolute deviation: 2979.13 maximum: 134538.13 minimum: 114688.07 ``` Diff: +~200 instruction/op. Fixes: https://github.com/scylladb/scylla/issues/7689 Fixes: https://github.com/scylladb/scylla/issues/3914 Fixes: https://github.com/scylladb/scylla/issues/7933 Refs: https://github.com/scylladb/scylla/issues/3672 Closes #11053 * github.com:scylladb/scylladb: test/cql-pytest: add test for query tombstone page limit query-result-writer: stop when tombstone-limit is reached service/pager: prepare for empty pages service/storage_proxy: set smallest continue pos as query's continue pos service/storage_proxy: propagate last position on digest reads query: result_merger::get() don't reset last-pos on short-reads and last pages query: add tombstone-limit to read-command service/storage_proxy: add get_tombstone_limit() query: add tombstone_limit type db/config: add config item for query tombstone limit gms: add cluster feature for empty replica pages tree: don't use query::read_command's IDL constructor	2022-08-10 13:38:06 +03:00
Raphael S. Carvalho	ace6334619	replica: table: kill unused _sstables_staging Good change as it's one less thing to worry about in compaction group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-08-10 12:32:13 +03:00
Avi Kivity	be44fd63f9	Merge 'Make get_range_addresses async and hold effective_replication_map_ptr around it' from Benny Halevy This series converts the synchronous `effective_replication_map::get_range_addresses` to async by calling the replication strategy async entry point with the same name, as its callers are already async or can be made so easily. To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map, let the callers of this patch hold a effective_replication_map per keyspace and pass it down to the (now asynchronous) functions that use it (making affected storage_service methods static where possible if they no longer depend on the storage_service instance). Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate that is true for local_strategy and everywhere_replication_strategy, and is false otherwise. With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow. Refs #11005 Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping) Closes #11009 * github.com:scylladb/scylladb: effective_replication_map: make get_range_addresses asynchronous range_streamer: add_ranges and friends: get erm as param storage_service: get_new_source_ranges: get erm as param storage_service: get_changed_ranges_for_leaving: get erm as param storage_service: get_ranges_for_endpoint: get erm as param repair: use get_non_local_strategy_keyspaces_erms database: add get_non_local_strategy_keyspaces_erms database: add get_non_local_strategy_keyspaces storage_service: coroutinize update_pending_ranges effective_replication_map: add get_replication_strategy effective_replication_map: get_range_addresses: use the precalculated replication_map abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies abstract_replication_strategy: reindent utils: sequenced_set: expose set and `contains` method abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set utils: sequenced_set: templatize VectorType utils: sanitize sequenced_set utils: sequenced_set: delete mutable get_vector method	2022-08-09 13:25:53 +03:00
Botond Dénes	1b669cefed	service/storage_proxy: add get_tombstone_limit() To be used by coordinator side code to determine the correct tombstone limit to pass to read-command (tombstone limit field added in the next commit). When this limit is non-zero, the replica will start cutting pages after the tombstone limit is surpassed. This getter works similarly to `get_max_result_size()`: if the cluster feature for empty replica pages is set, it will return the value configured via db::config::query_tombstone_limit. System queries always use a limit of 0 (unlimited tombstones).	2022-08-09 10:00:40 +03:00
Benny Halevy	db5c5ca59e	database: add get_non_local_strategy_keyspaces_erms To be used for getting a coheret set of all keyspaces with non-local replication strategy and their respective effective_replication_map. As an example, use it in this patch in storage_service::update_pending_ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	7ee6048255	database: add get_non_local_strategy_keyspaces For node operations, we currently call get_non_system_keyspaces but really want to work on all keyspace that have non-local replication strategy as they are replicated on other nodes. Reflect that in the replica::database function name. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	2b017ce285	schema, everywhere: define and use table_schema_version as a strong type Define table_schema_version as a distinct tagged_uuid class, So it can be differentiated from other uuid-class types, in particular table_id. Added reversed(table_schema_version) for convenience and uniformity since the same logic is currently open coded in several places. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:45 +03:00
Benny Halevy	257d74bb34	schema, everywhere: define and use table_id as a strong type Define table_id as a distinct utils::tagged_uuid modeled after raft tagged_id, so it can be differentiated from other uuid-class types, in particular from table_schema_version. Fixes #11207 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:41 +03:00
Benny Halevy	5316dbbe78	table: delete unused snapshot_manager and pending_snapshots Now that snapshot orchestration in snapshot_on_all_shards doesn't use snapshot_manager, get rid of the data structure. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	cca9068cfb	table: delete unused snapshot function Now that snapshot orchestration is done solely in snapshot_on_all_shards, the per-shard snapshot function can be deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	351a3a313d	table: snapshot_on_all_shards: orchestrate snapshot process Call take_snapshot on each shard and collect the returns snapshot_file_set. When all are done, move the vector<snapshot_file_set> to finalize_snapshot. All that without resorting to using the snapshot_manager nor calling table::snapshot. Both will deleted in the following patches. Fixes #11132 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	39276cacc3	table: finalize_snapshot: take the file sets as a param and pass it to seal_snapshot, so that the latter won't need to lookup and access the snapshot_manager object. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	4dd56bbd6d	table: make seal_snapshot a static member Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	12716866a9	table: refactor finalize_snapshot out of snapshot Write schema.cql and the files manifest in finalize_snapshot. Currently call it from table::snapshot, but it will be called in a later patch by snapshot_on_all_shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	240f83546d	table: snapshot: keep per-shard file sets in snapshot_manager To simplify processing of the per-shard file names for generating the manifest. We only need to print them to the manifest at the end of the process, so there's no point in copying them around in the process, just move the foreign unique unordered_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	5100c1ba68	table: take_snapshot: return foreign unique ptr Currently copying the sstable file names are created and destroyed on each shard and are copied by the "coordinator" shards using submit_to, while the coroutine holds the source on its stack frame. To prepare for the next patches that refactor this code so that the coordinator shard will submit_to each shard to perform `take_snapshot` and return the set of sstrings in the future result, we need to wrap the result in a foreign_ptr so it gets freed on the shard that created it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	ff6508aa53	table: refactor take_snapshot out of snapshot Do the actual snapshot-taking code in a per-shard take_snapshot function, to be called from snapshot_on_all_shards in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	56f336d1aa	database: get rid of timestamp_func Pass an optional truncated_at time_point to truncate_table_on_all_shards instead of the over-complicated timestamp_func that returns the same time_point on all shards anyhow, and was only used for coordination across shards. Since now we synchronize the internal execution phase in truncate_table_on_all_shards, there is no longer need for this timestamp_func. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	b640c4fd17	database: truncate: snapshot table in all-shards layer With that the database layer does no longer need to invoke the private table::snapshot function, so it can be defriended from class table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	af0c71aa12	database: truncate: flush table and views in all-shards layer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	6e07e6b7ac	database: truncate: stop and disable compaction in all-shards layer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	e78dad1dfb	database: truncate: move call to set_low_replay_position_mark to all-shards layer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	a8bd3d97b6	database: truncate: enter per-shard table async_gate in all-shards layer Start moving the per-shard state establishment logic to truncate_table_on_all_shards, so that we would evetually do only the truncate logic per-se in the per-shard truncate function. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	4d4ca40c38	table: add snapshot_on_all_shards Called from the respective database entry points. Will be called also from the database drop / truncate path and will be used for central coordination of per-shard table::snapshot so we don't have to depend on the snapshot_manager mechanism that is fragile and currently causes abort if we fail to allocate it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	be56a73e78	database: add snapshot_table_on_all_shards We need to snapshot a single table in several paths. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	d96b56fee2	database: rename {flush,snapshot}_on_all and make static Follow the convention of drop_table_on_all_shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	a1eed1a6e9	database: drop_table_on_all_shards: truncate and stop table in upper layer truncate the table on all shards then stop it on shards in the upper layer rather than in the per-shard drop_column_family() function, so we can further refactor truncate later, flushing and taking snapshot on all shards, before truncating. With that, rename drop_column_family to detach_columng_family as now it only deregisters the column family from containers that refer to it (even via its uuid) and then its caller is reponsible to take it from there. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	92cb7d448b	database: drop_table_on_all_shards: get all table shards before drop_column_family on each Se we the upper layer can flush, snapshot, and truncate the table on all shards, step by step. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	ca78a63873	database: truncate: get rid of the unused ks param Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	46e2a7c83b	database: add truncate_table_on_all_shards As a first step to decouple truncate from flush and snpashot. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00

1 2 3

146 Commits