scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-02 21:17:01 +00:00

Author	SHA1	Message	Date
Yaniv Kaul	2b73793a39	Update db/view/view.cc	2023-12-03 10:07:45 +02:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Piotr Smaroń	5fd30578d7	config: introduce value_status::Deprecated Current mechanism to deprecate config options is implemented in a hacky way in `main.cpp` and doesn't account for existing `db::config/boost::po` API controlling lifetime of config options, hence it's being replaced in this PR by adding yet another `value_status` enumerator: `Deprecated`, so that deprecation of config options is controlled in one place in `config.cc`,i.e. when specifying config options. Motivation: https://docs.google.com/document/d/18urPG7qeb7z7WPpMYI2V_lCOkM5YGKsEU78SDJmt8bM/edit?usp=sharing With this change, if a `Deprecated` config option is specified as 1. a command line parameter, scylla will run and log: ``` WARN 2023-11-25 23:37:22,623 [shard 0:main] init - background-writer-scheduling-quota option ignored (deprecated) ``` (Previously it was only a message printed to standard output, not a scylla log of warn level). 2. an option in `scylla.yaml`, scylla will run and log: ``` WARN 2023-11-27 23:55:13,534 [shard 0:main] init - Option is deprecated : background_writer_scheduling_quota ``` Fixes #15887 Incorporates dropped https://github.com/scylladb/scylladb/pull/15928 Closes scylladb/scylladb#16184	2023-11-30 08:52:57 +03:00
Avi Kivity	8e9d3af431	Merge 'Commitlog: complete prerequisites and enforce hard limit by default' from Eliran Sinvani This miniset, completes the prerequisites for enabling commitlog hard limit on by default. Namely, start flushing and evacuating segments halfway to the limit in order to never hit it under normal circumstances. It is worth mentioning that hitting the limit is an exceptional condition which it's root cause need to be resolved, however, once we do hit the limit, the performance impact that is inflicted as a result of this enforcement is irrelevant. Tests: unit tests. LWT write test (#9331) A whitebox testing has been performed by @wmitros , the test aimed at putting as much pressure as possible on the commitlog segments by using a write pattern that rewrites the partitions in the memtable keeping it at ~85% occupancy so the dirty memory manager will not kick in. The test compared 3 configurations: 1. The default configuration 2. Hard limit on (without changing the flush threshold) 3. the changes in this PR applied. The last exhibited the "best" behavior in terms of metrics, the graphs were the flattest and less jaggy from the others. Closes scylladb/scylladb#10974 * github.com:scylladb/scylladb: commitlog: enforce commitlog size hard limit by default commitlog: set flush threshold to half of the limit size commitlog: unfold flush threshold assignment	2023-11-29 20:55:53 +02:00
Kamil Braun	8a14839a00	Merge 'handle more failures during topology operations' from Gleb This series adds handling for more failures during a topology operation (we already handle a failure during streaming). Here we add handling of tablet draining errors by aborting the operation and handling of errors after streaming where an operation cannot be aborted any longer. If the error happens when rollback is no longer possible we wait for ring delay and proceed to the next step. Each individual patch that adds the sleep has an explanation what the consequences of the patch are. * 'gleb/topology-coordinator-failures' of github.com:scylladb/scylla-dev: test: add test to check errro handling during tablet draining test: fix test_topology_streaming_failure test to not grep the whole file storage_service: add error injection into the tablet migration code storage_service: topology coordinator: rollback on handle_tablet_migration failure during tablet_draining stage storage_service: topology coordinator: do not retry the metadata barrier forever in write_both_read_new state storage_service: topology coordinator: do not retry the metadata barrier forever in left_token_ring state storage_service: topology coordinator: return a node that is being removed from get_excluded_nodes storage_service: topology_coordinator: use new rollback_to_normal state in the rollback procedure storage_service: topology coordinator: add rollback_to_normal node state storage_service: topology coordinator: put fence version into the raft state storage_service: topology coordinator: do fencing even if draining failed	2023-11-29 19:02:35 +01:00
Nadav Har'El	62f89d49e5	tablets, mv: fix on_internal_error on write to base table This situation before this patch is that when tablets are enabled for a keyspace, we can create a materialized view but later any write to the base table fails with an on_internal_error(), saying that: "Tried to obtain per-keyspace effective replication map of test but it's per-table." Indeed, with tablets, the replication is different for each table - it's not the same for the entire keyspace. So this patch changes the view update code to take the replication map from the specific base table, not the keyspace. This is good enough to get materialized-views reads and writes working in a simple single-node case, as the included test demonstrates (the test fails with on_internal_error() before this patch, and passes afterwards). But this fix is not perfect - the base-view pairing code really needs to consider not only the base table's replication map, but also the view table's replication map - as those can be different. We'll fix this remaining problem as a followup in a separate patch - it will require a substantially more elaborate test to reproduce the need for the different mapping and to verify that fix. Fixes #16209. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16211	2023-11-29 15:29:17 +01:00
Calle Wilund	3b70fde3cd	commitlog: Make named_files in delete_segments have updated size Fixes #16207 commitlog::delete_segments deletes (or recycles) segments replayed. The actual file size here is added to footprint so actual delete then can determine iff things should be recycled or removed. However, we build a pending delete list of named_files, and the files we added did not have size set. Bad. Actual deletion then treated files as zero-byte sized, i.e. footprint calculations borked. Simple fix is just filling in the size of the objects when addind. Added unit test for the problem. Closes scylladb/scylladb#16210	2023-11-29 09:58:47 +02:00
Botond Dénes	3ed6925673	Merge 'Major compaction: flush commitlog by forcing new active segment and flushing all tables' from Benny Halevy Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Closes scylladb/scylladb#15820 * github.com:scylladb/scylladb: docs: nodetool: flush: enrich examples docs: nodetool: compact: fix example api: add /storage_service/compact api: add /storage_service/flush compaction_manager: flush_all_tables before major compaction database: add flush_all_tables api: compaction: add flush_memtables option test/nodetool: jmx: fix path to scripts/scylla-jmx scylla-nodetool, docs: improve optional params documentation	2023-11-29 08:48:40 +02:00
Kamil Braun	3582095b79	schema_tables: use smaller timestamp for base mutations included with view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR scylladb/scylladb#15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue scylladb/scylladb#15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the original base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: scylladb/scylladb#15530 Closes scylladb/scylladb#16139	2023-11-28 21:51:18 +01:00
Benny Halevy	66ba983fe0	compaction_manager: flush_all_tables before major compaction Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Botond Dénes	f46cdce9d3	Merge 'Make memtable flush tolerate misconfigured S3 storage' from Pavel Emelyanov Nowadays if memtable gets flushed into misconfigured S3 storage, the flush fails and aborts the whole scylla process. That's not very elegant. First, because upon restart garbage collecting non-sealed sstables would fail again. Second, because re-configuring an endpoint can be done runtime, scylla re-reads this config upon HUP signal. Flushing memtable restarts when seeing ENOSPC/EDQUOT errors from on-disk sstables. This PR extends this to handle misconfigured S3 endpoints as well. fixes: #13745 Closes scylladb/scylladb#15635 * github.com:scylladb/scylladb: test: Add object_store test to validate config reloading works test: Add config update facility to test cluster test: Make S3_Server export config file as pathlib.Path config: Make object storage config updateable_value_source memtable: Extend list of checking codes sstables/storage/s3: Fix missing TOC status check s3/client: Map http exceptions into storage_io_error exceptions: Extend storage_io_error construction options	2023-11-28 09:33:37 +02:00
Botond Dénes	a472700309	Merge 'Minor fixes and refactors' from Kamil Braun - remove some code that is obsolete in newer Scylla versions, - fix some minor bugs. These bugs appear to be benign, there are no known issues caused by them, but fixing them is a good idea nevertheless, - refactor some code for better maintainability. Parts of this PR were extracted from https://github.com/scylladb/scylladb/pull/15331 (which was merged but later reverted), parts of it are new. Closes scylladb/scylladb#16162 * github.com:scylladb/scylladb: test/pylib: log_browsing: fix type hint migration_manager: take `abort_source&` in get_schema_for_read/write migration_manager: inline merge_schema_in_background migration_manager: remove unused merge_schema_from overload migration_manager: assume `canonical_mutation` support migration_manager: add `std::move` to avoid a copy schema_tables: refactor `scylla_tables(schema_features)` schema_tables: pass `reload` flag when calling `merge_schema` cross-shard system_keyspace: fix outdated comment	2023-11-24 17:34:21 +02:00
Kamil Braun	269a189526	schema_tables: refactor `scylla_tables(schema_features)` The `scylla_tables` function gives a different schema definition for the `system_schema.scylla_tables` table, depending on whether certain schema features are enabled or not. The way it was implemented, we had to write `θ(2^n)` amount of code and comments to handle `n` features. Refactor it so that the amount of code we have to write to handle `n` features is `θ(n)`.	2023-11-23 17:23:47 +01:00
Gleb Natapov	95dd0e453d	storage_service: topology coordinator: add rollback_to_normal node state When a topology coordinator rolls back from unsuccessful topology operation it advances the fence (which is now in the raft state) after moving to normal state. We do not want this to fail (only majority of nodes is needed for it to not to), but currently it may fail in case the coordinator moves to another node after changing the rollback node's state to normal, but before updating the fence. To solve that the rollback operation needs to go through a new rollback_to_normal state that will do the fencing before moving to normal. This patch introduces that state, but does not use it yet.	2023-11-23 15:27:28 +02:00
Kamil Braun	5223d32fab	schema_tables: pass `reload` flag when calling `merge_schema` cross-shard In `0c86abab4d` `merge_schema` obtained a new flag, `reload`. Unfortunately, the flag was assigned a default value, which I think is almost always a bad idea, and indeed it was in this case. When `merge_schema` is called on shard different than 0, it recursively calls itself on shard 0. That recursive call forgot to pass the `reload` flag. Fix this.	2023-11-23 14:06:40 +01:00
Kamil Braun	de3607810d	system_keyspace: fix outdated comment	2023-11-23 14:06:27 +01:00
Kefu Chai	55103f4a6b	hints: move formatter of db::hints::sync_point to test the operator<<() based formatter is only used in its test, so let's move it to where it is used. we can always bring it back later if it is required in other places. but better off implementing it as a fmt::formatter<> then. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16142	2023-11-23 11:22:31 +02:00
Kefu Chai	6749d963ed	config: define formatter for db::seed_provider_type before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for db::seed_provider_type. please note, we are still formatting vector<db::seed_provider_type> with the helper provided by seastar/core/sstring.hh, which uses operator<<() to print the elements in the vector being printed. so we have to keep the operator<< formatter before disabling the generic formatter for vector<T>. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16138	2023-11-23 11:04:35 +02:00
Eliran Sinvani	bfa839ce92	commitlog: enforce commitlog size hard limit by default Since the commitlog size hard limit is a failsafe mechanism, we don't expect to ever hit it. If we do hit the limit, it means that we have an exceptional condition in the system. Hence, the impact of enforcing the commitlog hard limit is irrelevant. Here we enforce the limit by default. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-11-22 08:48:28 +02:00
Eliran Sinvani	63d62a7db2	commitlog: set flush threshold to half of the limit size Once we enable commitlog hard limit by default, we would like to have some room in case flushing memtables takes some time to catch up. This threshold is half the limit. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-11-22 08:48:28 +02:00
Eliran Sinvani	d2a8651bce	commitlog: unfold flush threshold assignment This commit is only a cosmetic change. It is meant to make the flush threshold assignment more readable and comprehensible so future changes are easier to review. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-11-22 08:48:28 +02:00
Pavel Emelyanov	210b01a5ce	config: Make object storage config updateable_value_source Now its plain updateable_value, but without the ..._source object the updateable_value is just a no-op value holder. In order for the observers to operate there must be the value source, updating it would update the attached updateable values _and_ notify the observers. In order for the config to be the u.v._source, config entries should be comparable to each other, thus the <=> operator for it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-11-21 16:47:50 +03:00
Calle Wilund	6b66daabfc	commitlog: Remove entry CRC from file format Since CRC is already handled by disk blocks, we can remove some of the entry CRC:ing, both simplifying code and making at least that part of both write and read faster.	2023-11-21 08:50:57 +00:00
Calle Wilund	e29bf6f9e8	commitlog: Implement new format using CRC:ed sectors Breaks the file into individually tagged + crc:ed pages. Each page (sized as disk write alignment) gets a trailing 12-byte metadata, including CRC of the first page-12 bytes, and the ID of the segment being written. When reading, each page read is CRC:ed and checked to be part of the expected segment by comparing ID:s. If crc is broken, we have broken data. If crc is ok, but ID does not match, we have a prematurely terminated segment (truncated), which, depending on whether we use batch mode or not, implied data loss.	2023-11-21 08:50:54 +00:00
Calle Wilund	18e79d730e	commitlog: Add iterator adaptor for doing buffer splitting into sub-page ranges With somewhat less overhead than creating 100+ temporary_buffer proxies	2023-11-21 08:42:33 +00:00
Calle Wilund	862f4f2ed3	commitlog_replayer: differentiate between truncated file and corrupt entries Refs #11845 When replaying, differentiate between the two cases for failure we have: - A broken actual entry - i.e. entry header/data does not hold up to crc scrutiny - Truncated file - i.e. a chunk header is broken or unreadable. This can be due to either "corruption" (i.e. borked write, post-corruption, hw whatever), or simply an unterminated segment. The difference is that the former is recoverable, the latter is not. We now signal and report the two separately. The end result for a user is not much different, in either case they imply data loss and the need for repair. But there is some value in differentiating which of the two we encountered. Modifies and adds test cases.	2023-11-21 08:42:33 +00:00
Pavel Emelyanov	3471f30b58	view_update_generator: Unplug from database later Patch `967ebacaa4` (view_update_generator: Move abort kicking to do_abort()) moved unplugging v.u.g from database from .stop() to .do_abort(). The latter call happens very early on stop -- once scylla receives SIGINT. However, database may still need v.u.g. plugged to flush views. This patch moves unplug to later, namely to .stop() method of v.u.g. which happens after database is drained and should no longer continue view updates. fixes: #16001 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#16091	2023-11-20 11:47:55 +02:00
Calle Wilund	6ffb482bf3	Commitlog replayer: Range-check skip call Fixes #15269 If segment being replayed is corrupted/truncated we can attempt skipping completely bogues byte amounts, which can cause assert (i.e. crash) in file_data_source_impl. This is not a crash-level error, so ensure we range check the distance in the reader. v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will ensure we log the fact that things are broken Closes scylladb/scylladb#15270	2023-11-19 17:44:55 +02:00
Gleb Natapov	6edbf4b663	storage_service: topology coordinator: put fence version into the raft state Currently when the coordinator decides to move the fence it issues an RPC to each node and each node locally advances fence version. This is fine if there are no failures or failures are handled by retrying fencing, but if we want to allow topology changes to progress even in the presence of barrier failures it is easier to store the fence version in the raft state. The nodes that missed fence rpc may easily catch up to the latest fence version by simply executing a raft barrier.	2023-11-19 15:28:08 +02:00
Kefu Chai	15bfa09454	treewide: do not mark return value const if this has no effect this change is a cleanup. to mark a return value without value semantics has no effect. these `const` specifier useless. so let's drop them. and, if we compile the tree with `-Wignore-qualifiers`, the compiler would warn like: ``` /home/kefu/dev/scylladb/schema/schema.hh:245:5: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers] 245 \| const index_metadata_kind kind() const; \| ^~~~~ ``` so this change also silences the above warnings. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-11-17 17:46:19 +08:00
Kefu Chai	efd65aebb2	build: cmake: add check-header target to have feature parity with `configure.py`. we won't need this once we migrate to C++20 modules. but before that day comes, we need to stick with C++ headers. we generate a rule for each .hh files to create a corresponding .cc and then compile it, in order to verify the self-containness of that header. so the number of rule is quite large, to avoid the unnecessary overhead. the check-header target is enabled only if `Scylla_CHECK_HEADERS` option is enabled. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15913	2023-11-13 10:27:06 +02:00
Kamil Braun	f094e23d84	system_keyspace: use system memory for `system.raft` table `system.raft` was using the "user memory pool", i.e. the `dirty_memory_manager` for this table was set to `database::_dirty_memory_manager` (instead of `database::_system_dirty_memory_manager`). This meant that if a write workload caused memory pressure on the user memory pool, internal `system.raft` writes would have to wait for memtables of user tables to get flushed before the write would proceed. This was observed in SCT longevity tests which ran a heavy workload on the cluster and concurrently, schema changes (which underneath use the `system.raft` table). Raft would often get stuck waiting many seconds for user memtables to get flushed. More details in issue #15622. Experiments showed that moving Raft to system memory fixed this particular issue, bringing the waits to reasonable levels. Currently `system.raft` stores only one group, group 0, which is internally used for cluster metadata operations (schema and topology changes) -- so it makes sense to keep use system memory. In the future we'd like to have other groups, for strongly consistent tables. These groups should use the user memory pool. It means we won't be able to use `system.raft` for them -- we'll just have to use a separate table. Fixes: scylladb/scylladb#15622 Closes scylladb/scylladb#15972	2023-11-08 11:21:14 +02:00
Botond Dénes	76ab66ca1f	Merge 'Support state change for S3-backed sstables' from Pavel Emelyanov The sstable currently can move between normal, staging and quarantine state runtime. For S3-backed sstables the state change means maintaining the state itself in the ownership table and updating it accordingly. There's also the upload facility that's implemented as state change too, but this PR doesn't support this part. fixes: #13017 Closes scylladb/scylladb#15829 * github.com:scylladb/scylladb: test: Make test_sstables_excluding_staging_correctness run over s3 too sstables,s3: Support state change (without generation change) system_keyspace: Add state field to system.sstables sstable_directory: Tune up sstables entries processing comment system_keyspace: Tune up status change trace message sstables: Add state string to state enum class convert	2023-11-07 10:45:41 +02:00
Benny Halevy	a1acf6854b	everywhere: reduce dependencies on i_partitioner.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:47:44 +02:00
Patryk Jędrzejczak	0357636f16	db/system_distributed_keyspace: fix indentation Broken in the previous commit.	2023-11-02 14:21:15 +01:00
Patryk Jędrzejczak	813c7a582c	db/system_distributed_keyspace: retry start on concurrent operation A concurrent group 0 operation in system_distributed_keyspace::start can happen during concurrent bootstrap in the Raft-based topology.	2023-11-02 14:21:15 +01:00
Kamil Braun	0846d324d7	Merge 'rollback topology operation on streaming failure' from Gleb This patch series adds error handling for streaming failure during topology operations instead of an infinite retry. If streaming fails the operation is rolled back: bootstrap/replace nodes move to left and decommissioned/remove nodes move back to normal state. * 'gleb/streaming-failure-rollback-v4' of github.com:scylladb/scylla-dev: raft: make sure that all operation forwarded to a leader are completed before destroying raft server storage_service: raft topology: remove code duplication from global_tablet_token_metadata_barrier tests: add tests for streaming failure in bootstrap/replace/remove/decomission test/pylib: do not stop node if decommission failed with an expected error storage_service: raft topology: fix typo in "decommission" everywhere storage_service: raft topology: add streaming error injection storage_service: raft topology: do not increase topology version during CDC repair storage_service: raft topology: rollback topology operation on streaming failure. storage_service: raft topology: load request parameters in left_token_ring state as well storage_service: raft topology: do not report term_changed_error during global_token_metadata_barrier as an error storage_service: raft topology: change global_token_metadata_barrier error handling to try/catch storage_service: raft topology: make global_token_metadata_barrier node independent storage_service: raft topology: split get_excluded_nodes from exec_global_command storage_service: raft topology: drop unused include_local and do_retake parameters from exec_global_command which are always true storage_service: raft topology: simplify streaming RPC failure handling	2023-11-02 10:15:45 +01:00
Kamil Braun	ae58e39743	Merge 'reduce announcements of the automatic schema changes' from Patryk Jędrzejczak There are some schema modifications performed automatically (during bootstrap, upgrade etc.) by Scylla that are announced by multiple calls to `migration_manager::announce` even though they are logically one change. Precisely, they appear in: - `system_distributed_keyspace::start`, - `redis:create_keyspace_if_not_exists_impl`, - `table_helper::setup_keyspace` (for the `system_traces` keyspace). All these places contain a FIXME telling us to `announce` only once. There are a few reasons for this: - calling `migration_manager::announce` with Raft is quite expensive -- taking a `read_barrier` is necessary, and that requires contacting a leader, which then must contact a quorum, - we must implement a retrying mechanism for every automatic `announce` if `group0_concurrent_modification` occurs to enable support for concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs mentioned above would be harder, and fixing the FIXMEs later would also be harder. This PR fixes the first two FIXMEs and improves the situation with the last one by reducing the number of the `announce` calls to two. Unfortunately, reducing this number to one requires a big refactor. We can do it as a follow-up to a new, more specific issue. Also, we leave a new FIXME. Fixing the first two FIXMEs required enabling the announcement of a keyspace together with its tables. Until now, the code responsible for preparing mutations for a new table could assume the existence of the keyspace. This assumption wasn't necessary, but removing it required some refactoring. Fixes scylladb/scylladb#15437 Closes scylladb/scylladb#15897 * github.com:scylladb/scylladb: table_helper: announce twice in setup_keyspace table_helper: refactor setup_table redis: create_keyspace_if_not_exists_impl: fix indentation redis: announce once in create_keyspace_if_not_exists_impl db: system_distributed_keyspace: fix indentation db: system_distributed_keyspace: announce once in start tablet_allocator: update on_before_create_column_family migration_listener: add parameter to on_before_create_column_family alternator: executor: use new prepare_new_column_family_announcement alternator: executor: introduce create_keyspace_metadata migration_manager: add new prepare_new_column_family_announcement	2023-11-02 09:32:35 +01:00
Piotr Smaroń	8c464b2ddb	guardrails: restrict replication strategy (RS) Replacing `restrict_replication_simplestrategy` config option with 2 config options: `replication_strategy_{warn,fail}_list`, which allow us to impose soft limits (issue a warning) and hard limits (not execute CQL) on replication strategy when creating/altering a keyspace. The reason to rather replace than extend `restrict_replication_simplestrategy` config option is that it was not used and we wanted to generalize it. Only soft guardrail is enabled by default and it is set to SimpleStrategy, which means that we'll generate a CQL warning whenever replication strategy is set to SimpleStrategy. For new cloud deployments we'll move SimpleStrategy from warn to the fail list. Guardrails violations will be tracked by metrics. Resolves #5224 Refs #8892 (the replication strategy part, not the RF part) Closes scylladb/scylladb#15399	2023-10-31 18:34:41 +03:00
Avi Kivity	ef7db6df99	Merge 'schema_tables: turn view schema fixing code into a sanity check' from Kamil Braun The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal with legacy materialized view schemas used for secondary indexes, schemas which were created before the notion of "computed columns" was introduced. Back then, secondary index schemas would use a regular "token" column. Later it became a computed column and old schemas would be migrated during rolling upgrade. The migration code was introduced in 2019 (`db8d4a0cc6`) and then fixed in 2020 (`d473bc9b06`). The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming that users don't try crazy things like upgrading from 2021.X to 2023.X (which we do not support), all clusters will have already executed the migration code once they upgrade to 2023.X, meaning we can get rid of it. The main motivation of this PR is to get rid of the `db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft mode this was the only call to `merge_schema` outside "group 0 code" and in fact it is unsafe -- it uses locally generated mutations with locally generated timestamp (`api::new_timestamp()`), so if we actually did it, we would permanently diverge the group 0 state machine across nodes (the schema pulling code is disabled in Raft mode). Fortunately, this should be dead code by now, as explained in the previous paragraph. The migration code is now turned into a sanity check, if the users try something crazy, they will get an error instead of silent data corruption. Closes scylladb/scylladb#15695 * github.com:scylladb/scylladb: view: remove unused `_backing_secondary_index` schema_tables: turn view schema fixing code into a sanity check schema_tables: make comment more precise feature_service: make COMPUTED_COLUMNS feature unconditionally true	2023-10-31 13:23:19 +02:00
Patryk Jędrzejczak	df199eec11	db: system_distributed_keyspace: fix indentation Broken in the previous commit.	2023-10-31 12:08:03 +01:00
Patryk Jędrzejczak	91ff8007b3	db: system_distributed_keyspace: announce once in start We refactor system_distributed_keyspace::start so that it takes at most one group 0 guard and calls migration_manager::announce at most once. We remove a catch expression together with the FIXME from get_updated_service_levels (add_new_columns_if_missing before the patch) because we cannot treat the service_levels update differently anymore.	2023-10-31 12:08:03 +01:00
Avi Kivity	d450a145ce	Revert "Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak" This reverts commit `4b80130b0b`, reversing changes made to `a5519c7c1f`. It's suspected of causing dtest failures due to a bug in coroutine::parallel_for_each.	2023-10-29 18:32:06 +02:00
Kamil Braun	1c0ae2e7ef	Merge 'raft topology: assign tokens after join node response rpc' from Piotr Dulikowski Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier. However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet. Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier. Refs: scylladb/scylladb#15686 Fixes: scylladb/scylladb#15738 Closes scylladb/scylladb#15724 * github.com:scylladb/scylladb: test: test_topology_ops: continuously write during the test raft topology: assign tokens after join node response rpc storage_service: fix indentation after previous commit raft topology: loosen assumptions about transition nodes having tokens	2023-10-29 18:30:32 +02:00
Marcin Maliszkiewicz	020a9c931b	db: view: run local materialized view mutations on a separate smp service group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) not committed framework for easy alternator performance testing: https://github.com/scylladb/scylladb/pull/13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes https://github.com/scylladb/scylladb/issues/15844 Closes scylladb/scylladb#15845	2023-10-29 18:30:32 +02:00
Gleb Natapov	0a8c3e5c78	storage_service: raft topology: load request parameters in left_token_ring state as well Next patch will want to access request parameters in left_token_ring for failure recovery purposes.	2023-10-25 12:56:27 +03:00
Piotr Dulikowski	63aa9332aa	raft topology: assign tokens after join node response rpc Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier. However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet. Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier.	2023-10-25 11:50:17 +02:00
Piotr Dulikowski	2d161676c7	raft topology: loosen assumptions about transition nodes having tokens In later commits, tokens for a joining/replacing node will not be inserted when the node enters `bootstrapping`/`replacing` state but at some later step of the procedure. Loosen some of the assumptions in `storage_service::topology_state_load` and `system_keyspace::load_topology_state` appropriately.	2023-10-25 11:50:17 +02:00
Pavel Emelyanov	d827068d01	sstables,s3: Support state change (without generation change) Now when the system.sstables has the state field, it can be changed (UPDATEd). However, when changing the state AND generation, this still won't work, because generation is the clustering key of the table in question and cannot be just changed. This, nonetheless, is OK, as generation changes with state only when moving an sstable from upload dir into normal/staging and this is separate issue for S3 (#13018). For now changing state only is OK. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-24 19:12:37 +03:00
Pavel Emelyanov	ca5d3d217f	system_keyspace: Add state field to system.sstables The state is one of <empty>(normal)/staging/quarantine. Currently when sstable is moved to non-normal state the s3 backend state_change() call throws thus such sstables do not appear. Next patches are going to change that and the new field in the system.sstables is needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-24 19:12:37 +03:00

... 29 30 31 32 33 ...

4972 Commits