scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 21:55:50 +00:00

Author	SHA1	Message	Date
Avi Kivity	d450a145ce	Revert "Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak" This reverts commit `4b80130b0b`, reversing changes made to `a5519c7c1f`. It's suspected of causing dtest failures due to a bug in coroutine::parallel_for_each.	2023-10-29 18:32:06 +02:00
Kamil Braun	1c0ae2e7ef	Merge 'raft topology: assign tokens after join node response rpc' from Piotr Dulikowski Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier. However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet. Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier. Refs: scylladb/scylladb#15686 Fixes: scylladb/scylladb#15738 Closes scylladb/scylladb#15724 * github.com:scylladb/scylladb: test: test_topology_ops: continuously write during the test raft topology: assign tokens after join node response rpc storage_service: fix indentation after previous commit raft topology: loosen assumptions about transition nodes having tokens	2023-10-29 18:30:32 +02:00
Marcin Maliszkiewicz	020a9c931b	db: view: run local materialized view mutations on a separate smp service group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) not committed framework for easy alternator performance testing: https://github.com/scylladb/scylladb/pull/13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes https://github.com/scylladb/scylladb/issues/15844 Closes scylladb/scylladb#15845	2023-10-29 18:30:32 +02:00
Piotr Dulikowski	63aa9332aa	raft topology: assign tokens after join node response rpc Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier. However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet. Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier.	2023-10-25 11:50:17 +02:00
Piotr Dulikowski	2d161676c7	raft topology: loosen assumptions about transition nodes having tokens In later commits, tokens for a joining/replacing node will not be inserted when the node enters `bootstrapping`/`replacing` state but at some later step of the procedure. Loosen some of the assumptions in `storage_service::topology_state_load` and `system_keyspace::load_topology_state` appropriately.	2023-10-25 11:50:17 +02:00
Nadav Har'El	4b80130b0b	Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak There are some schema modifications performed automatically (during bootstrap, upgrade etc.) by Scylla that are announced by multiple calls to `migration_manager::announce` even though they are logically one change. Precisely, they appear in: - `system_distributed_keyspace::start`, - `redis:create_keyspace_if_not_exists_impl`, - `table_helper::setup_keyspace` (for the `system_traces` keyspace). All these places contain a FIXME telling us to `announce` only once. There are a few reasons for this: - calling `migration_manager::announce` with Raft is quite expensive -- taking a `read_barrier` is necessary, and that requires contacting a leader, which then must contact a quorum, - we must implement a retrying mechanism for every automatic `announce` if `group0_concurrent_modification` occurs to enable support for concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs mentioned above would be harder, and fixing the FIXMEs later would also be harder. This PR fixes the first two FIXMEs and improves the situation with the last one by reducing the number of the `announce` calls to two. Unfortunately, reducing this number to one requires a big refactor. We can do it as a follow-up to a new, more specific issue. Also, we leave a new FIXME. Fixing the first two FIXMEs required enabling the announcement of a keyspace together with its tables. Until now, the code responsible for preparing mutations for a new table could assume the existence of the keyspace. This assumption wasn't necessary, but removing it required some refactoring. Fixes #15437 Closes scylladb/scylladb#15594 * github.com:scylladb/scylladb: table_helper: announce twice in setup_keyspace table_helper: refactor setup_table redis: create_keyspace_if_not_exists_impl: fix indentation redis: announce once in create_keyspace_if_not_exists_impl db: system_distributed_keyspace: fix indentation db: system_distributed_keyspace: announce once in start tablet_allocator: update on_before_create_column_family migration_listener: add parameter to on_before_create_column_family alternator: executor: use new prepare_new_column_family_announcement alternator: executor: introduce create_keyspace_metadata migration_manager: add new prepare_new_column_family_announcement	2023-10-24 15:42:48 +03:00
Kefu Chai	b36cef6f1a	sstable: remove _remote_prefix from s3_storage since we use the sstable.generation() for the remote prefix of the key of the object for storing the sstable component, there is no need to set remote_prefix beforehand. since `s3_storage::ensure_remote_prefix()` and `system_kesypace::sstables_registry_lookup_entry()` are not used anymore, they are removed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:08:22 +08:00
Kefu Chai	af8bc8ba63	sstable: switch to uuid identifier for naming S3 sstable objects before this change, we create a new UUID for a new sstable managed by the s3_storage, and we use the string representation of UUID defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for naming the objects stored on s3_storage. but this representation is not what we are using for storing sstables on local filesystem when the option of "uuid_sstable_identifiers_enabled" is enabled. instead, we are using a base36-based representation which is shorter. to be consistent with the naming of the sstables created for local filesystem, and more importantly, to simplify the interaction between the local copy of sstables and those stored on object storage, we should use the same string representation of the sstable identifier. so, in this change: 1. instead of creating a new UUID, just reuse the generation of the sstable for the object's key. 2. do not store the uuid in the sstable_registry system table. As we already have the generation of the sstable for the same purpose. 3. switch the sstable identifier representation from the one defined by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the base36-based one (implemented by fmt::formatter<sstables::generation_type>) 4. enable the `uuid_sstable_identifers` cluster feature if it is enabled in the `test_env_config`, so that it the sstable manager can enable the uuid-based uuid when creating a new uuid for sstable. 5. throw if the generation of sstable is not UUID-based when accessing / manipulating an sstable with S3 storage backend. as the S3 storage backend now relies on this option. as, otherwise we'd have sstables with key like s3://bucket/number/basename, which is just unable to serve as a unique id for sstable if the bucket is shared across multiple tables. Fixes #14175 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:08:22 +08:00
Kamil Braun	c1486fee40	Merge 'commitlog: drop truncation_records after replay' from Petr Gusev This is a follow-up for #15279 and it fixes two problems. First, we restore flushes on writes for the tables that were switched to the schema commitlog if `SCHEMA_COMMITLOG` feature is not yet enabled. Otherwise durability is not guaranteed. Second, we address the problem with truncation records, which could refer to the old commitlog if any of the switched tables were truncated in the past. If the node crashes later, and we replay schema commitlog, we may skip some mutations since their `replay_position`s will be smaller than the `replay_position`s stored for the old commitlog in the `truncated` table. It turned out that this problem exists even if we don't switch commitlogs for tables. If the node was rebooted the segment ids will start from some small number - they use `steady_clock` which is usually bound to boot time. This means that if the node crashed we may skip the mutations because their RPs will be smaller than the last truncation record RP. To address this problem we delete truncation records as soon as commitlog is replayed. We also include a test which demonstrates the problem. Fixes #15354 Closes scylladb/scylladb#15532 * github.com:scylladb/scylladb: add test_commitlog system.truncated: Remove replay_position data from truncated on start main.cc: flush only local memtables when replaying schema commitlog main.cc: drop redundant supervisor::notify system_keyspace: flush if schema commitlog is not available	2023-10-18 11:14:31 +02:00
Botond Dénes	7f81957437	Merge 'Initialize datadir for system and non-system keyspaces the same way' from Pavel Emelyanov When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what. This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables. Test included. fixes: #15708 closes: scylladb/scylla-manager#3603 Closes scylladb/scylladb#15723 * github.com:scylladb/scylladb: test: Add test for datadir/ layout sstable_directory: Indentation fix after previous patch db,sstables: Move storage init for system keyspace to table creation	2023-10-18 12:12:19 +03:00
Avi Kivity	f42eb4d1ce	Merge 'Store and propagage GC timestamp markers from commitlog' from Calle Wilund Fixes #14870 (Originally suggested by @avikivity). Use commit log stored GC clock min positions to narrow compaction GC bounds. (Still requires augmented manual flush:es with extensive CL clearing to pass various dtest, but this does not affect "real" execution). Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment the first time. Because GC clock is wall clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds per segment. This is then (in a rather ugly way) propagated to tombstone_gc_state to narrow the allowed GC bounds for a CF, based on what is currently left in CL. Note: this is a rather unoptimized version - no caching or anything. But even so, should not be excessively expensive, esp. since various other code paths already cache the results. Closes scylladb/scylladb#15060 * github.com:scylladb/scylladb: main/cql_test_env: Augment compaction mgr tombstone_gc_state with CL GC info tombstone_gc_state: Add optional callback to augment GC bounds commitlog: Add keeping track of approximate lowest GC clock for CF entries database: Force new commitlog segment on user initiated flush commitlog: Add helper to force new active segment	2023-10-17 18:27:43 +03:00
Calle Wilund	6fbd210679	system.truncated: Remove replay_position data from truncated on start Once we've started clean, and all replaying is done, truncation logs commit log regarding replay positions are invalid. We should exorcise them as soon as possible. Note that we cannot remove truncation data completely though, since the time stamps stored are used by things like batch log to determine if it should use or discard old batch data.	2023-10-17 18:16:48 +04:00
Petr Gusev	c89ead55ff	system_keyspace: flush if schema commitlog is not available In PR #15279 we removed flushes when writing to a number of tables from the system keyspace. This was made possible by switching these tables to the schema commitlog. Schema commitlog is enabled only when the SCHEMA_COMMITLOG feature is supported by all nodes in the cluster. Before that these tables will use the regular commitlog, which is not durable because it uses db::commitlog::sync_mode::PERIODIC. This means that we may lose data if a node crashes during upgrade to the version with schema commitlog. In this commit we fix this problem by restoring flushes after writes to the tables if the schema commitlog is not enabled yet. The patch also contains a test that demonstrates the problem. We need flush_schema_tables_after_modification option since otherwise schema changes are not durable and node fails after restart.	2023-10-17 18:14:27 +04:00
Calle Wilund	560d3c17f0	commitlog: Add keeping track of approximate lowest GC clock for CF entries Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment first. Because GC clock is wall clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds per segment. Includes of course a function to get the all-segment lowest per CF.	2023-10-17 10:26:41 +00:00
Calle Wilund	810d06946f	commitlog: Add helper to force new active segment When called, if active segment holds data, close and replace with pristine one.	2023-10-17 10:26:40 +00:00
Tomasz Grabiec	0aef0f900b	Merge 'truncation records refactorings' from Petr Gusev This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases: * drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard; * add a check that `table::_truncated_at` is properly initialized before it's accessed; * move its initialization after `init_non_system_keyspaces` Closes scylladb/scylladb#15583 * github.com:scylladb/scylladb: system_keyspace: drop truncation_record system_keyspace: remove get_truncated_at method table: get_truncation_time: check _truncated_at is initialized database: add_column_family: initialize truncation_time for new tables database: add_column_family: rename readonly parameter to is_new system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace commitlog_replayer: refactor commitlog_replayer::impl::init system_keyspace: drop redundant typedef system_keyspace: drop redundant save_truncation_record overload table: rename cache_truncation_record -> set_truncation_time system_keyspace: get_truncated_position -> get_truncated_positions	2023-10-17 10:55:30 +02:00
Pavel Emelyanov	059d7c795e	db,sstables: Move storage init for system keyspace to table creation User and system keyspaces are created and populated slightly differently. System keyspace is created via system_keyspace::make() which eventually calls calls add_column_family(). Then it's populated via init_system_keyspace() which calls sstable_directory::prepare() which, in turn, optionally creates directories in datadir/ or checks the directory permissions if it exists User keyspaces are created with the help of add_column_family_and_make_directory() call which calls the add_column_family() mentioned above _and_ calls table::init_storage() to create directories. When it's populated with init_non_system_keyspaces() it also calls sstable_directory::prepare() which notices that the directory exists and then checks the permissions. As a result, sstable_directory::prepare() initializes storage for system keyspace only and there's a BUG (#15708) that the upload/ subdir is not created. This patch makes the directories creation for _all_ keyspaces with the table::init_storage(). The change only touches system keyspace by moving the creation of directories from sstable_directory::prepare() into system_keyspace::make(). Indentation is deliberately left broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-16 16:19:25 +03:00
Patryk Jędrzejczak	98d067e77d	db: system_distributed_keyspace: fix indentation Broken in the previous commit.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	5ebc0e8617	db: system_distributed_keyspace: announce once in start We refactor system_distributed_keyspace::start so that it takes at most one group 0 guard and calls migration_manager::announce at most once. We remove a catch expression together with the FIXME from get_updated_service_levels (add_new_columns_if_missing before the patch) because we cannot treat the service_levels update differently anymore.	2023-10-16 14:59:53 +02:00
Jan Ciolek	940e44f887	db/view: change log level of failed view updates to WARN When a remote view update doesn't succeed there's a log message saying "Error applying view update...". This message had log level ERROR, but it's not really a hard error. View updates can fail for a multitude of reasons, even during normal operation. A failing view update isn't fatal, it will be saved as a view hint a retried later. Let's change the log level to WARN. It's something that shouldn't happen too much, but it's not a disaster either. ERROR log level causes trouble in tests which assume that an ERROR level message means that the test has failed. Refs: https://github.com/scylladb/scylladb/issues/15046#issuecomment-1712748784 For local view updates the log level stays at "ERROR", local view updates shouldn't fail. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes scylladb/scylladb#15640	2023-10-11 18:19:23 +03:00
Avi Kivity	35849fc901	Revert "Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun" This reverts commit `3d4398d1b2`, reversing changes made to `45dfce6632`. The commit causes some schema changes to be lost due to incorrect timestamps in some mutations. More information is available in [1]. Reopens: scylladb/scylladb#7620 Reopens: scylladb/scylladb#13957 Fixes scylladb/scylladb#15530. [1] https://github.com/scylladb/scylladb/pull/15687	2023-10-11 00:32:05 +03:00
Dawid Medrek	6fdca0d3a8	db/hints/manager: Reword comments about state The current comments should be clearer to someone not familiar with the module. This commit also makes them abide by the limit of 120 characters per line.	2023-10-06 13:25:30 +02:00
Dawid Medrek	aa38ea3642	db/hints/manager: Unfriend space_watchdog space_watchdog is a friend of shard hint manager just to be able to execute one of its functions. This commit changes that by unfriending the class and exposing the function.	2023-10-06 13:25:30 +02:00
Dawid Medrek	6cd0153954	db/hints: Remove a redundant alias	2023-10-06 13:25:30 +02:00
Dawid Medrek	ddc385bce0	db/hints: Remove an unused namespace	2023-10-06 13:25:30 +02:00
Dawid Medrek	76d414012b	db/hints: Coroutinize change_host_filter()	2023-10-06 13:25:30 +02:00
Dawid Medrek	09eb30e6f1	db/hints: Coroutinize drain_for() This commit turns the function into a coroutine and makes the code less compact and more readable.	2023-10-06 13:25:30 +02:00
Dawid Medrek	907a572e24	db/hints: Clean up can_hint_for() This commit gets rid of unnecessary additional calls to functions and makes all lines abide by the limit of 120 characters.	2023-10-06 13:25:30 +02:00
Dawid Medrek	596e1f9859	db/hints: Clean up store_hint() This commit makes the function abide by the limit of 120 characters per line.	2023-10-06 13:25:30 +02:00
Dawid Medrek	8a43f94ca6	db/hints: Clean up too_many_in_flight_hints_for() This commit makes the return statement more readable. It also makes the comment abide by the limit of 120 characters per line.	2023-10-06 13:25:30 +02:00
Dawid Medrek	96a5906621	db/hints: Refactor get_ep_manager()	2023-10-06 13:25:30 +02:00
Dawid Medrek	8b591be3c3	db/hints: Coroutinize wait_for_sync_point() This commit coroutinizes the function and adds a comment explaining a non-trivial case.	2023-10-06 13:25:27 +02:00
Dawid Medrek	fee3aafd80	db/hints: Use std::span in calculate_current_sync_point std::span is a lot more flexible than std::vector as it allows for arbitrary contiguous ranges.	2023-10-06 12:36:05 +02:00
Dawid Medrek	64fd4d6323	db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints()	2023-10-06 12:26:55 +02:00
Dawid Medrek	58cd5c4167	db/hints: Clean up manager::forbid_hints()	2023-10-06 12:26:55 +02:00
Dawid Medrek	f8ed93f5bc	db/hints: Clean up manager::allow_hints()	2023-10-06 12:26:52 +02:00
Dawid Medrek	bfe32bcf89	db/hints: Coroutinize compute_hints_dir_device_id()	2023-10-06 12:18:30 +02:00
Dawid Medrek	8f28eb6522	db/hints: Clean up manager::stop() This commit gets rid of boilerplate in the function, leverages a range pipe and explicit types to make the code more readable, and changes the logs to make it clearer what happens.	2023-10-06 12:18:30 +02:00
Dawid Medrek	a384caece0	db/hints: Clean up manager::start() This commit coroutinizes the function and makes it less compact.	2023-10-06 12:18:30 +02:00
Dawid Medrek	2db97aaf81	db/hints/manager: Clean up the constructor fmt::to_string should be preferred to seastar::format. It's clearer and simpler. Besides that, this commit makes the code abide by the limit of 120 characters per line.	2023-10-06 12:18:30 +02:00
Dawid Medrek	6c10a86791	db/hints: Remove boilerplate drain_lock()	2023-10-06 12:18:30 +02:00
Dawid Medrek	f1f35ba819	db/hints: Let drain_for() return a future Currently, the function doesn't return anything. However, if the futurue doesn't need to be awaited, the caller can decide that. There is no reason to make that decision in the function itself.	2023-10-06 12:18:25 +02:00
Dawid Medrek	79e1412f14	db/hints: Remove ep_managers_end The methods are redundant and are effectively code boilerplate.	2023-10-06 12:15:04 +02:00
Dawid Medrek	cfbacb29bb	db/hints: Remove find_ep_manager The methods are redundant and are effectively code boilerplate.	2023-10-06 12:15:04 +02:00
Dawid Medrek	1c70a18fc7	db/hints: Use manager as API for hint_endpoint_manager This commit makes with_file_update_mutex() a method of hint_endpoint_manager and introduces db::hints::manager::with_file_update_mutex_for() for accessing it from the outside. This way, hint_endpoint_manager is hidden and no one needs to know about its existence.	2023-10-06 12:15:01 +02:00
Dawid Medrek	d068143b83	db/hints: Don't mark have_ep_manager()'s definition as inline Doing that doesn't allow for external linkage, so it's not accessible from other files.	2023-10-06 11:54:15 +02:00
Dawid Medrek	58249363bc	db/hints: Remove make_directory_initializer() The function is never used. It's not even implemented.	2023-10-06 11:54:15 +02:00
Dawid Medrek	f47a669f75	db/hints/manager: Order constructors This commit orders constructors of db::hints::manager for readability.	2023-10-06 11:54:15 +02:00
Dawid Medrek	4663f72990	db/hints: Move ~manager() and mark it as noexcept The destructor is trivial and there is no reason to keep in the source file. We mark it as noexcept too.	2023-10-06 11:54:15 +02:00
Dawid Medrek	18a2831186	db/hints: Use reference for storage proxy This commit makes db::hints::manager store service::storage_proxy as a reference instead of a seastar::shared_ptr. The manager is owned by storage proxy, so it only lives as long as storage proxy does. Hence, it makes little sense to store the latter as a shared pointer; in fact, it's very confusing and may be error-prone. The field never changes, so it's safe to keep it as a reference (especially because copy and move constructors of db::hints::manager are both deleted). What's more, we ensure that the hint manager has access to storage proxy as soon as it's created. The same changes were applied to db::hints::resource_manager. The rationale is the same.	2023-10-06 11:54:15 +02:00

1 2 3 4 5 ...

3422 Commits