scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 13:45:53 +00:00

Author	SHA1	Message	Date
Nadav Har'El	62f89d49e5	tablets, mv: fix on_internal_error on write to base table This situation before this patch is that when tablets are enabled for a keyspace, we can create a materialized view but later any write to the base table fails with an on_internal_error(), saying that: "Tried to obtain per-keyspace effective replication map of test but it's per-table." Indeed, with tablets, the replication is different for each table - it's not the same for the entire keyspace. So this patch changes the view update code to take the replication map from the specific base table, not the keyspace. This is good enough to get materialized-views reads and writes working in a simple single-node case, as the included test demonstrates (the test fails with on_internal_error() before this patch, and passes afterwards). But this fix is not perfect - the base-view pairing code really needs to consider not only the base table's replication map, but also the view table's replication map - as those can be different. We'll fix this remaining problem as a followup in a separate patch - it will require a substantially more elaborate test to reproduce the need for the different mapping and to verify that fix. Fixes #16209. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16211	2023-11-29 15:29:17 +01:00
Calle Wilund	3b70fde3cd	commitlog: Make named_files in delete_segments have updated size Fixes #16207 commitlog::delete_segments deletes (or recycles) segments replayed. The actual file size here is added to footprint so actual delete then can determine iff things should be recycled or removed. However, we build a pending delete list of named_files, and the files we added did not have size set. Bad. Actual deletion then treated files as zero-byte sized, i.e. footprint calculations borked. Simple fix is just filling in the size of the objects when addind. Added unit test for the problem. Closes scylladb/scylladb#16210	2023-11-29 09:58:47 +02:00
Botond Dénes	3ed6925673	Merge 'Major compaction: flush commitlog by forcing new active segment and flushing all tables' from Benny Halevy Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Closes scylladb/scylladb#15820 * github.com:scylladb/scylladb: docs: nodetool: flush: enrich examples docs: nodetool: compact: fix example api: add /storage_service/compact api: add /storage_service/flush compaction_manager: flush_all_tables before major compaction database: add flush_all_tables api: compaction: add flush_memtables option test/nodetool: jmx: fix path to scripts/scylla-jmx scylla-nodetool, docs: improve optional params documentation	2023-11-29 08:48:40 +02:00
Kamil Braun	3582095b79	schema_tables: use smaller timestamp for base mutations included with view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR scylladb/scylladb#15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue scylladb/scylladb#15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the original base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: scylladb/scylladb#15530 Closes scylladb/scylladb#16139	2023-11-28 21:51:18 +01:00
Benny Halevy	66ba983fe0	compaction_manager: flush_all_tables before major compaction Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Botond Dénes	f46cdce9d3	Merge 'Make memtable flush tolerate misconfigured S3 storage' from Pavel Emelyanov Nowadays if memtable gets flushed into misconfigured S3 storage, the flush fails and aborts the whole scylla process. That's not very elegant. First, because upon restart garbage collecting non-sealed sstables would fail again. Second, because re-configuring an endpoint can be done runtime, scylla re-reads this config upon HUP signal. Flushing memtable restarts when seeing ENOSPC/EDQUOT errors from on-disk sstables. This PR extends this to handle misconfigured S3 endpoints as well. fixes: #13745 Closes scylladb/scylladb#15635 * github.com:scylladb/scylladb: test: Add object_store test to validate config reloading works test: Add config update facility to test cluster test: Make S3_Server export config file as pathlib.Path config: Make object storage config updateable_value_source memtable: Extend list of checking codes sstables/storage/s3: Fix missing TOC status check s3/client: Map http exceptions into storage_io_error exceptions: Extend storage_io_error construction options	2023-11-28 09:33:37 +02:00
Botond Dénes	a472700309	Merge 'Minor fixes and refactors' from Kamil Braun - remove some code that is obsolete in newer Scylla versions, - fix some minor bugs. These bugs appear to be benign, there are no known issues caused by them, but fixing them is a good idea nevertheless, - refactor some code for better maintainability. Parts of this PR were extracted from https://github.com/scylladb/scylladb/pull/15331 (which was merged but later reverted), parts of it are new. Closes scylladb/scylladb#16162 * github.com:scylladb/scylladb: test/pylib: log_browsing: fix type hint migration_manager: take `abort_source&` in get_schema_for_read/write migration_manager: inline merge_schema_in_background migration_manager: remove unused merge_schema_from overload migration_manager: assume `canonical_mutation` support migration_manager: add `std::move` to avoid a copy schema_tables: refactor `scylla_tables(schema_features)` schema_tables: pass `reload` flag when calling `merge_schema` cross-shard system_keyspace: fix outdated comment	2023-11-24 17:34:21 +02:00
Kamil Braun	269a189526	schema_tables: refactor `scylla_tables(schema_features)` The `scylla_tables` function gives a different schema definition for the `system_schema.scylla_tables` table, depending on whether certain schema features are enabled or not. The way it was implemented, we had to write `θ(2^n)` amount of code and comments to handle `n` features. Refactor it so that the amount of code we have to write to handle `n` features is `θ(n)`.	2023-11-23 17:23:47 +01:00
Kamil Braun	5223d32fab	schema_tables: pass `reload` flag when calling `merge_schema` cross-shard In `0c86abab4d` `merge_schema` obtained a new flag, `reload`. Unfortunately, the flag was assigned a default value, which I think is almost always a bad idea, and indeed it was in this case. When `merge_schema` is called on shard different than 0, it recursively calls itself on shard 0. That recursive call forgot to pass the `reload` flag. Fix this.	2023-11-23 14:06:40 +01:00
Kamil Braun	de3607810d	system_keyspace: fix outdated comment	2023-11-23 14:06:27 +01:00
Kefu Chai	55103f4a6b	hints: move formatter of db::hints::sync_point to test the operator<<() based formatter is only used in its test, so let's move it to where it is used. we can always bring it back later if it is required in other places. but better off implementing it as a fmt::formatter<> then. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16142	2023-11-23 11:22:31 +02:00
Kefu Chai	6749d963ed	config: define formatter for db::seed_provider_type before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for db::seed_provider_type. please note, we are still formatting vector<db::seed_provider_type> with the helper provided by seastar/core/sstring.hh, which uses operator<<() to print the elements in the vector being printed. so we have to keep the operator<< formatter before disabling the generic formatter for vector<T>. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16138	2023-11-23 11:04:35 +02:00
Pavel Emelyanov	210b01a5ce	config: Make object storage config updateable_value_source Now its plain updateable_value, but without the ..._source object the updateable_value is just a no-op value holder. In order for the observers to operate there must be the value source, updating it would update the attached updateable values _and_ notify the observers. In order for the config to be the u.v._source, config entries should be comparable to each other, thus the <=> operator for it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-11-21 16:47:50 +03:00
Pavel Emelyanov	3471f30b58	view_update_generator: Unplug from database later Patch `967ebacaa4` (view_update_generator: Move abort kicking to do_abort()) moved unplugging v.u.g from database from .stop() to .do_abort(). The latter call happens very early on stop -- once scylla receives SIGINT. However, database may still need v.u.g. plugged to flush views. This patch moves unplug to later, namely to .stop() method of v.u.g. which happens after database is drained and should no longer continue view updates. fixes: #16001 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#16091	2023-11-20 11:47:55 +02:00
Calle Wilund	6ffb482bf3	Commitlog replayer: Range-check skip call Fixes #15269 If segment being replayed is corrupted/truncated we can attempt skipping completely bogues byte amounts, which can cause assert (i.e. crash) in file_data_source_impl. This is not a crash-level error, so ensure we range check the distance in the reader. v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will ensure we log the fact that things are broken Closes scylladb/scylladb#15270	2023-11-19 17:44:55 +02:00
Kefu Chai	15bfa09454	treewide: do not mark return value const if this has no effect this change is a cleanup. to mark a return value without value semantics has no effect. these `const` specifier useless. so let's drop them. and, if we compile the tree with `-Wignore-qualifiers`, the compiler would warn like: ``` /home/kefu/dev/scylladb/schema/schema.hh:245:5: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers] 245 \| const index_metadata_kind kind() const; \| ^~~~~ ``` so this change also silences the above warnings. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-11-17 17:46:19 +08:00
Kefu Chai	efd65aebb2	build: cmake: add check-header target to have feature parity with `configure.py`. we won't need this once we migrate to C++20 modules. but before that day comes, we need to stick with C++ headers. we generate a rule for each .hh files to create a corresponding .cc and then compile it, in order to verify the self-containness of that header. so the number of rule is quite large, to avoid the unnecessary overhead. the check-header target is enabled only if `Scylla_CHECK_HEADERS` option is enabled. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15913	2023-11-13 10:27:06 +02:00
Kamil Braun	f094e23d84	system_keyspace: use system memory for `system.raft` table `system.raft` was using the "user memory pool", i.e. the `dirty_memory_manager` for this table was set to `database::_dirty_memory_manager` (instead of `database::_system_dirty_memory_manager`). This meant that if a write workload caused memory pressure on the user memory pool, internal `system.raft` writes would have to wait for memtables of user tables to get flushed before the write would proceed. This was observed in SCT longevity tests which ran a heavy workload on the cluster and concurrently, schema changes (which underneath use the `system.raft` table). Raft would often get stuck waiting many seconds for user memtables to get flushed. More details in issue #15622. Experiments showed that moving Raft to system memory fixed this particular issue, bringing the waits to reasonable levels. Currently `system.raft` stores only one group, group 0, which is internally used for cluster metadata operations (schema and topology changes) -- so it makes sense to keep use system memory. In the future we'd like to have other groups, for strongly consistent tables. These groups should use the user memory pool. It means we won't be able to use `system.raft` for them -- we'll just have to use a separate table. Fixes: scylladb/scylladb#15622 Closes scylladb/scylladb#15972	2023-11-08 11:21:14 +02:00
Botond Dénes	76ab66ca1f	Merge 'Support state change for S3-backed sstables' from Pavel Emelyanov The sstable currently can move between normal, staging and quarantine state runtime. For S3-backed sstables the state change means maintaining the state itself in the ownership table and updating it accordingly. There's also the upload facility that's implemented as state change too, but this PR doesn't support this part. fixes: #13017 Closes scylladb/scylladb#15829 * github.com:scylladb/scylladb: test: Make test_sstables_excluding_staging_correctness run over s3 too sstables,s3: Support state change (without generation change) system_keyspace: Add state field to system.sstables sstable_directory: Tune up sstables entries processing comment system_keyspace: Tune up status change trace message sstables: Add state string to state enum class convert	2023-11-07 10:45:41 +02:00
Benny Halevy	a1acf6854b	everywhere: reduce dependencies on i_partitioner.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:47:44 +02:00
Patryk Jędrzejczak	0357636f16	db/system_distributed_keyspace: fix indentation Broken in the previous commit.	2023-11-02 14:21:15 +01:00
Patryk Jędrzejczak	813c7a582c	db/system_distributed_keyspace: retry start on concurrent operation A concurrent group 0 operation in system_distributed_keyspace::start can happen during concurrent bootstrap in the Raft-based topology.	2023-11-02 14:21:15 +01:00
Kamil Braun	0846d324d7	Merge 'rollback topology operation on streaming failure' from Gleb This patch series adds error handling for streaming failure during topology operations instead of an infinite retry. If streaming fails the operation is rolled back: bootstrap/replace nodes move to left and decommissioned/remove nodes move back to normal state. * 'gleb/streaming-failure-rollback-v4' of github.com:scylladb/scylla-dev: raft: make sure that all operation forwarded to a leader are completed before destroying raft server storage_service: raft topology: remove code duplication from global_tablet_token_metadata_barrier tests: add tests for streaming failure in bootstrap/replace/remove/decomission test/pylib: do not stop node if decommission failed with an expected error storage_service: raft topology: fix typo in "decommission" everywhere storage_service: raft topology: add streaming error injection storage_service: raft topology: do not increase topology version during CDC repair storage_service: raft topology: rollback topology operation on streaming failure. storage_service: raft topology: load request parameters in left_token_ring state as well storage_service: raft topology: do not report term_changed_error during global_token_metadata_barrier as an error storage_service: raft topology: change global_token_metadata_barrier error handling to try/catch storage_service: raft topology: make global_token_metadata_barrier node independent storage_service: raft topology: split get_excluded_nodes from exec_global_command storage_service: raft topology: drop unused include_local and do_retake parameters from exec_global_command which are always true storage_service: raft topology: simplify streaming RPC failure handling	2023-11-02 10:15:45 +01:00
Kamil Braun	ae58e39743	Merge 'reduce announcements of the automatic schema changes' from Patryk Jędrzejczak There are some schema modifications performed automatically (during bootstrap, upgrade etc.) by Scylla that are announced by multiple calls to `migration_manager::announce` even though they are logically one change. Precisely, they appear in: - `system_distributed_keyspace::start`, - `redis:create_keyspace_if_not_exists_impl`, - `table_helper::setup_keyspace` (for the `system_traces` keyspace). All these places contain a FIXME telling us to `announce` only once. There are a few reasons for this: - calling `migration_manager::announce` with Raft is quite expensive -- taking a `read_barrier` is necessary, and that requires contacting a leader, which then must contact a quorum, - we must implement a retrying mechanism for every automatic `announce` if `group0_concurrent_modification` occurs to enable support for concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs mentioned above would be harder, and fixing the FIXMEs later would also be harder. This PR fixes the first two FIXMEs and improves the situation with the last one by reducing the number of the `announce` calls to two. Unfortunately, reducing this number to one requires a big refactor. We can do it as a follow-up to a new, more specific issue. Also, we leave a new FIXME. Fixing the first two FIXMEs required enabling the announcement of a keyspace together with its tables. Until now, the code responsible for preparing mutations for a new table could assume the existence of the keyspace. This assumption wasn't necessary, but removing it required some refactoring. Fixes scylladb/scylladb#15437 Closes scylladb/scylladb#15897 * github.com:scylladb/scylladb: table_helper: announce twice in setup_keyspace table_helper: refactor setup_table redis: create_keyspace_if_not_exists_impl: fix indentation redis: announce once in create_keyspace_if_not_exists_impl db: system_distributed_keyspace: fix indentation db: system_distributed_keyspace: announce once in start tablet_allocator: update on_before_create_column_family migration_listener: add parameter to on_before_create_column_family alternator: executor: use new prepare_new_column_family_announcement alternator: executor: introduce create_keyspace_metadata migration_manager: add new prepare_new_column_family_announcement	2023-11-02 09:32:35 +01:00
Piotr Smaroń	8c464b2ddb	guardrails: restrict replication strategy (RS) Replacing `restrict_replication_simplestrategy` config option with 2 config options: `replication_strategy_{warn,fail}_list`, which allow us to impose soft limits (issue a warning) and hard limits (not execute CQL) on replication strategy when creating/altering a keyspace. The reason to rather replace than extend `restrict_replication_simplestrategy` config option is that it was not used and we wanted to generalize it. Only soft guardrail is enabled by default and it is set to SimpleStrategy, which means that we'll generate a CQL warning whenever replication strategy is set to SimpleStrategy. For new cloud deployments we'll move SimpleStrategy from warn to the fail list. Guardrails violations will be tracked by metrics. Resolves #5224 Refs #8892 (the replication strategy part, not the RF part) Closes scylladb/scylladb#15399	2023-10-31 18:34:41 +03:00
Avi Kivity	ef7db6df99	Merge 'schema_tables: turn view schema fixing code into a sanity check' from Kamil Braun The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal with legacy materialized view schemas used for secondary indexes, schemas which were created before the notion of "computed columns" was introduced. Back then, secondary index schemas would use a regular "token" column. Later it became a computed column and old schemas would be migrated during rolling upgrade. The migration code was introduced in 2019 (`db8d4a0cc6`) and then fixed in 2020 (`d473bc9b06`). The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming that users don't try crazy things like upgrading from 2021.X to 2023.X (which we do not support), all clusters will have already executed the migration code once they upgrade to 2023.X, meaning we can get rid of it. The main motivation of this PR is to get rid of the `db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft mode this was the only call to `merge_schema` outside "group 0 code" and in fact it is unsafe -- it uses locally generated mutations with locally generated timestamp (`api::new_timestamp()`), so if we actually did it, we would permanently diverge the group 0 state machine across nodes (the schema pulling code is disabled in Raft mode). Fortunately, this should be dead code by now, as explained in the previous paragraph. The migration code is now turned into a sanity check, if the users try something crazy, they will get an error instead of silent data corruption. Closes scylladb/scylladb#15695 * github.com:scylladb/scylladb: view: remove unused `_backing_secondary_index` schema_tables: turn view schema fixing code into a sanity check schema_tables: make comment more precise feature_service: make COMPUTED_COLUMNS feature unconditionally true	2023-10-31 13:23:19 +02:00
Patryk Jędrzejczak	df199eec11	db: system_distributed_keyspace: fix indentation Broken in the previous commit.	2023-10-31 12:08:03 +01:00
Patryk Jędrzejczak	91ff8007b3	db: system_distributed_keyspace: announce once in start We refactor system_distributed_keyspace::start so that it takes at most one group 0 guard and calls migration_manager::announce at most once. We remove a catch expression together with the FIXME from get_updated_service_levels (add_new_columns_if_missing before the patch) because we cannot treat the service_levels update differently anymore.	2023-10-31 12:08:03 +01:00
Avi Kivity	d450a145ce	Revert "Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak" This reverts commit `4b80130b0b`, reversing changes made to `a5519c7c1f`. It's suspected of causing dtest failures due to a bug in coroutine::parallel_for_each.	2023-10-29 18:32:06 +02:00
Kamil Braun	1c0ae2e7ef	Merge 'raft topology: assign tokens after join node response rpc' from Piotr Dulikowski Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier. However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet. Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier. Refs: scylladb/scylladb#15686 Fixes: scylladb/scylladb#15738 Closes scylladb/scylladb#15724 * github.com:scylladb/scylladb: test: test_topology_ops: continuously write during the test raft topology: assign tokens after join node response rpc storage_service: fix indentation after previous commit raft topology: loosen assumptions about transition nodes having tokens	2023-10-29 18:30:32 +02:00
Marcin Maliszkiewicz	020a9c931b	db: view: run local materialized view mutations on a separate smp service group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) not committed framework for easy alternator performance testing: https://github.com/scylladb/scylladb/pull/13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes https://github.com/scylladb/scylladb/issues/15844 Closes scylladb/scylladb#15845	2023-10-29 18:30:32 +02:00
Gleb Natapov	0a8c3e5c78	storage_service: raft topology: load request parameters in left_token_ring state as well Next patch will want to access request parameters in left_token_ring for failure recovery purposes.	2023-10-25 12:56:27 +03:00
Piotr Dulikowski	63aa9332aa	raft topology: assign tokens after join node response rpc Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier. However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet. Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier.	2023-10-25 11:50:17 +02:00
Piotr Dulikowski	2d161676c7	raft topology: loosen assumptions about transition nodes having tokens In later commits, tokens for a joining/replacing node will not be inserted when the node enters `bootstrapping`/`replacing` state but at some later step of the procedure. Loosen some of the assumptions in `storage_service::topology_state_load` and `system_keyspace::load_topology_state` appropriately.	2023-10-25 11:50:17 +02:00
Pavel Emelyanov	d827068d01	sstables,s3: Support state change (without generation change) Now when the system.sstables has the state field, it can be changed (UPDATEd). However, when changing the state AND generation, this still won't work, because generation is the clustering key of the table in question and cannot be just changed. This, nonetheless, is OK, as generation changes with state only when moving an sstable from upload dir into normal/staging and this is separate issue for S3 (#13018). For now changing state only is OK. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-24 19:12:37 +03:00
Pavel Emelyanov	ca5d3d217f	system_keyspace: Add state field to system.sstables The state is one of <empty>(normal)/staging/quarantine. Currently when sstable is moved to non-normal state the s3 backend state_change() call throws thus such sstables do not appear. Next patches are going to change that and the new field in the system.sstables is needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-24 19:12:37 +03:00
Pavel Emelyanov	e4162227ff	system_keyspace: Tune up status change trace message There will appear very similar one tracing the state change, so it's good to tell them from one another. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-24 19:12:37 +03:00
Nadav Har'El	4b80130b0b	Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak There are some schema modifications performed automatically (during bootstrap, upgrade etc.) by Scylla that are announced by multiple calls to `migration_manager::announce` even though they are logically one change. Precisely, they appear in: - `system_distributed_keyspace::start`, - `redis:create_keyspace_if_not_exists_impl`, - `table_helper::setup_keyspace` (for the `system_traces` keyspace). All these places contain a FIXME telling us to `announce` only once. There are a few reasons for this: - calling `migration_manager::announce` with Raft is quite expensive -- taking a `read_barrier` is necessary, and that requires contacting a leader, which then must contact a quorum, - we must implement a retrying mechanism for every automatic `announce` if `group0_concurrent_modification` occurs to enable support for concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs mentioned above would be harder, and fixing the FIXMEs later would also be harder. This PR fixes the first two FIXMEs and improves the situation with the last one by reducing the number of the `announce` calls to two. Unfortunately, reducing this number to one requires a big refactor. We can do it as a follow-up to a new, more specific issue. Also, we leave a new FIXME. Fixing the first two FIXMEs required enabling the announcement of a keyspace together with its tables. Until now, the code responsible for preparing mutations for a new table could assume the existence of the keyspace. This assumption wasn't necessary, but removing it required some refactoring. Fixes #15437 Closes scylladb/scylladb#15594 * github.com:scylladb/scylladb: table_helper: announce twice in setup_keyspace table_helper: refactor setup_table redis: create_keyspace_if_not_exists_impl: fix indentation redis: announce once in create_keyspace_if_not_exists_impl db: system_distributed_keyspace: fix indentation db: system_distributed_keyspace: announce once in start tablet_allocator: update on_before_create_column_family migration_listener: add parameter to on_before_create_column_family alternator: executor: use new prepare_new_column_family_announcement alternator: executor: introduce create_keyspace_metadata migration_manager: add new prepare_new_column_family_announcement	2023-10-24 15:42:48 +03:00
Kamil Braun	db49ccccb0	view: remove unused `_backing_secondary_index` This boolean was only used for a sanity check which was replaced with a stronger sanity check in the previous commit that doesn't require the boolean.	2023-10-24 13:33:36 +02:00
Kamil Braun	3976808b12	schema_tables: turn view schema fixing code into a sanity check The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal with legacy materialized view schemas used for secondary indexes, schemas which were created before the notion of "computed columns" was introduced. Back then, secondary index schemas would use a regular "token" column. Later it became a computed column and old schemas would be migrated during rolling upgrade. The migration code was introduced in 2019 (`db8d4a0cc6`) and then fixed in 2020 (`d473bc9b06`). The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming that users don't try crazy things like upgrading from 2021.X to 2023.X (which we do not support), all clusters will have already executed the migration code once they upgrade to 2023.X, meaning we can get rid of it. The main motivation of this patch is to get rid of the `db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft mode this was the only call to `merge_schema` outside "group 0 code" and in fact it is unsafe -- it uses locally generated mutations with locally generated timestamp (`api::new_timestamp()`), so if we actually did it, we would permanently diverge the group 0 state machine across nodes (the schema pulling code is disabled in Raft mode). Fortunately, this should be dead code by now, as explained in the previous paragraph. The migration code is now turned into a sanity check, if the users try something crazy, they will get an error instead of silent data corruption.	2023-10-24 13:33:35 +02:00
Kamil Braun	f02ac9a9e7	schema_tables: make comment more precise `maybe_fix_legacy_secondary_index_mv_schema` function has this piece of code: ``` // If the first clustering key part of a view is a column with name not found in base schema, // it implies it might be backing an index created before computed columns were introduced, // and as such it must be recreated properly. if (!base_schema->columns_by_name().contains(first_view_ck.name())) { schema_builder builder{schema_ptr(v)}; builder.mark_column_computed(first_view_ck.name(), std::make_unique<legacy_token_column_computation>()); if (preserve_version) { builder.with_version(v->version()); } return view_ptr(builder.build()); } ``` The comment uses the phrase "it might be". However, the code inside the `if` assumes that it "must be": once we determined that the first column in this materialized view does not have a corresponding name in the base table, we set it to be computed using `legacy_token_column_computation`, so we assumed that the column was indeed storing the token. Doing that for a column which is not the token column would be a small disaster. Assuming that the code is correct, we can make the comment more precise. I checked the documentation and I don't see any other way how we could have such a column other than the token column which is internally created by Scylla when creating a secondary index (for example, it is forbidden to use an alias in select statement when creating materialized views, which I checked experimentally).	2023-10-24 13:30:13 +02:00
Kamil Braun	5397524875	feature_service: make COMPUTED_COLUMNS feature unconditionally true The feature is assumed to be true, it was introduced in 2019. It's still advertised in gossip, but it's assumed to always be present. The `schema_feature` enum class still contains `COMPUTED_COLUMNS`, and the `all_tables` function in schema_tables.cc still checks for the schema feature when deciding if `computed_columns()` table should be included. This is necessary because digest calculation tests contain many digests calculated with the feature disabled, if we wanted to make it unconditional in the schema_tables code we'd have to regenerate almost all digests in the tests. It is simpler to leave the possibility for the tests to disable the feature.	2023-10-24 13:30:13 +02:00
Kefu Chai	b36cef6f1a	sstable: remove _remote_prefix from s3_storage since we use the sstable.generation() for the remote prefix of the key of the object for storing the sstable component, there is no need to set remote_prefix beforehand. since `s3_storage::ensure_remote_prefix()` and `system_kesypace::sstables_registry_lookup_entry()` are not used anymore, they are removed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:08:22 +08:00
Kefu Chai	af8bc8ba63	sstable: switch to uuid identifier for naming S3 sstable objects before this change, we create a new UUID for a new sstable managed by the s3_storage, and we use the string representation of UUID defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for naming the objects stored on s3_storage. but this representation is not what we are using for storing sstables on local filesystem when the option of "uuid_sstable_identifiers_enabled" is enabled. instead, we are using a base36-based representation which is shorter. to be consistent with the naming of the sstables created for local filesystem, and more importantly, to simplify the interaction between the local copy of sstables and those stored on object storage, we should use the same string representation of the sstable identifier. so, in this change: 1. instead of creating a new UUID, just reuse the generation of the sstable for the object's key. 2. do not store the uuid in the sstable_registry system table. As we already have the generation of the sstable for the same purpose. 3. switch the sstable identifier representation from the one defined by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the base36-based one (implemented by fmt::formatter<sstables::generation_type>) 4. enable the `uuid_sstable_identifers` cluster feature if it is enabled in the `test_env_config`, so that it the sstable manager can enable the uuid-based uuid when creating a new uuid for sstable. 5. throw if the generation of sstable is not UUID-based when accessing / manipulating an sstable with S3 storage backend. as the S3 storage backend now relies on this option. as, otherwise we'd have sstables with key like s3://bucket/number/basename, which is just unable to serve as a unique id for sstable if the bucket is shared across multiple tables. Fixes #14175 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:08:22 +08:00
Kamil Braun	c1486fee40	Merge 'commitlog: drop truncation_records after replay' from Petr Gusev This is a follow-up for #15279 and it fixes two problems. First, we restore flushes on writes for the tables that were switched to the schema commitlog if `SCHEMA_COMMITLOG` feature is not yet enabled. Otherwise durability is not guaranteed. Second, we address the problem with truncation records, which could refer to the old commitlog if any of the switched tables were truncated in the past. If the node crashes later, and we replay schema commitlog, we may skip some mutations since their `replay_position`s will be smaller than the `replay_position`s stored for the old commitlog in the `truncated` table. It turned out that this problem exists even if we don't switch commitlogs for tables. If the node was rebooted the segment ids will start from some small number - they use `steady_clock` which is usually bound to boot time. This means that if the node crashed we may skip the mutations because their RPs will be smaller than the last truncation record RP. To address this problem we delete truncation records as soon as commitlog is replayed. We also include a test which demonstrates the problem. Fixes #15354 Closes scylladb/scylladb#15532 * github.com:scylladb/scylladb: add test_commitlog system.truncated: Remove replay_position data from truncated on start main.cc: flush only local memtables when replaying schema commitlog main.cc: drop redundant supervisor::notify system_keyspace: flush if schema commitlog is not available	2023-10-18 11:14:31 +02:00
Botond Dénes	7f81957437	Merge 'Initialize datadir for system and non-system keyspaces the same way' from Pavel Emelyanov When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what. This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables. Test included. fixes: #15708 closes: scylladb/scylla-manager#3603 Closes scylladb/scylladb#15723 * github.com:scylladb/scylladb: test: Add test for datadir/ layout sstable_directory: Indentation fix after previous patch db,sstables: Move storage init for system keyspace to table creation	2023-10-18 12:12:19 +03:00
Avi Kivity	f42eb4d1ce	Merge 'Store and propagage GC timestamp markers from commitlog' from Calle Wilund Fixes #14870 (Originally suggested by @avikivity). Use commit log stored GC clock min positions to narrow compaction GC bounds. (Still requires augmented manual flush:es with extensive CL clearing to pass various dtest, but this does not affect "real" execution). Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment the first time. Because GC clock is wall clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds per segment. This is then (in a rather ugly way) propagated to tombstone_gc_state to narrow the allowed GC bounds for a CF, based on what is currently left in CL. Note: this is a rather unoptimized version - no caching or anything. But even so, should not be excessively expensive, esp. since various other code paths already cache the results. Closes scylladb/scylladb#15060 * github.com:scylladb/scylladb: main/cql_test_env: Augment compaction mgr tombstone_gc_state with CL GC info tombstone_gc_state: Add optional callback to augment GC bounds commitlog: Add keeping track of approximate lowest GC clock for CF entries database: Force new commitlog segment on user initiated flush commitlog: Add helper to force new active segment	2023-10-17 18:27:43 +03:00
Calle Wilund	6fbd210679	system.truncated: Remove replay_position data from truncated on start Once we've started clean, and all replaying is done, truncation logs commit log regarding replay positions are invalid. We should exorcise them as soon as possible. Note that we cannot remove truncation data completely though, since the time stamps stored are used by things like batch log to determine if it should use or discard old batch data.	2023-10-17 18:16:48 +04:00
Petr Gusev	c89ead55ff	system_keyspace: flush if schema commitlog is not available In PR #15279 we removed flushes when writing to a number of tables from the system keyspace. This was made possible by switching these tables to the schema commitlog. Schema commitlog is enabled only when the SCHEMA_COMMITLOG feature is supported by all nodes in the cluster. Before that these tables will use the regular commitlog, which is not durable because it uses db::commitlog::sync_mode::PERIODIC. This means that we may lose data if a node crashes during upgrade to the version with schema commitlog. In this commit we fix this problem by restoring flushes after writes to the tables if the schema commitlog is not enabled yet. The patch also contains a test that demonstrates the problem. We need flush_schema_tables_after_modification option since otherwise schema changes are not durable and node fails after restart.	2023-10-17 18:14:27 +04:00
Calle Wilund	560d3c17f0	commitlog: Add keeping track of approximate lowest GC clock for CF entries Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment first. Because GC clock is wall clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds per segment. Includes of course a function to get the all-segment lowest per CF.	2023-10-17 10:26:41 +00:00

1 2 3 4 5 ...

3458 Commits