scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-08 07:53:20 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	effb9fb3cb	Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620). If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957). When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary. We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`. Fixes: #7620 Fixes: #13957 --- This is a reincarnation of PR scylladb/scylladb#15331. The previous PR was reverted due to a bug it unmasked; the bug has now been fixed (scylladb/scylladb#16139). Some refactors from the previous PR were already merged separately, so this one is a bit smaller. I have checked with @Lorak-mmk's reproducer (https://github.com/Lorak-mmk/udt_schema_change_reproducer -- many thanks for it!) that the originally exposed bug is no longer reproducing on this PR, and that it can still be reproduced if I revert the aforementioned fix on top of this PR. Closes scylladb/scylladb#16242 * github.com:scylladb/scylladb: docs: describe group 0 schema versioning in raft docs test: add test for group 0 schema versioning feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0 migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations schema_tables: use schema version from group 0 if present migration_manager: store `group0_schema_version` in `scylla_local` during schema changes system_keyspace: make `get/set_scylla_local_param` public feature_service: add `GROUP0_SCHEMA_VERSIONING` feature	2023-12-11 12:17:57 +01:00
Kamil Braun	522540da40	schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0 As explained in the previous commit, we use the new `committed_by_group0` flag attached to each row of a `scylla_tables` mutation to decide whether the `version` cell needs to be deleted or not. The rest of #13957 is solved by pre-existing code -- if the `version` column is present in the mutation, we don't calculate a hash for `schema::version()`, but take the value from the column: ``` table_schema_version schema_mutations::digest(db::schema_features sf) const { if (_scylla_tables) { auto rs = query::result_set(_scylla_tables); if (!rs.empty()) { auto&& row = rs.row(0); auto val = row.get<utils::UUID>("version"); if (val) { return table_schema_version(val); } } } ... ``` The issue will therefore be fixed once we enable `GROUP0_SCHEMA_VERSIONING`.	2023-12-08 17:46:31 +01:00
Kamil Braun	defcf9915c	migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations As described in #13957, when creating or altering a table in group 0 mode, we don't want each node to calculate `schema::version()`s independently using a hash algorithm. Instead, we want to all nodes to use a single version for that table, commited by the group 0 command. There's even a column ready for this in `system.scylla_tables` -- `version`. This column is currently being set for system tables, but it's not being used for user tables. Similarly to what we did with global schema version in earlier commits, the obvious thing to do would be to include a live cell for the `version` column in the `system.scylla_tables` mutation when we perform the schema change in Raft mode, and to include a tombstone when performing it outside of Raft mode, for the RECOVERY case. But it's not that simple because as it turns out, we're already sending a `version` live cell (and also a tombstone, with timestamp decremented by 1) in all `system.scylla_tables` mutations. But then we delete that cell when doing schema merge (which begs the question why were we sending it in the first place? but I digress): ``` // We must force recalculation of schema version after the merge, since the resulting // schema may be a mix of the old and new schemas. delete_schema_version(mutation); ``` the above function removes the `version` cell from the mutation. So we need another way of distinguishing the cases of schema change originating from group 0 vs outside group 0 (e.g. RECOVERY). The method I chose is to extend `system.scylla_tables` with a boolean column, `committed_by_group0`, and extend schema mutations to set this column. In the next commit we'll decide whether or not the `version` cell should be deleted based on the value of this new column.	2023-12-08 17:46:31 +01:00
Kamil Braun	87b2c8a041	schema_tables: use schema version from group 0 if present As promised in the previous commit, if we persisted a schema version through a group 0 command, use it after a schema merge instead of calculating a digest. Ref: #7620 The above issue will be fixed once we enable the `GROUP0_SCHEMA_VERSIONING` feature.	2023-12-08 17:46:31 +01:00
Kamil Braun	3db8ac80cb	migration_manager: store `group0_schema_version` in `scylla_local` during schema changes We extend schema mutations with an additional mutation to the `system.scylla_local` table which: - in Raft mode, stores a UUID under the `group0_schema_version` key. - outside Raft mode, stores a tombstone under that key. As we will see in later commits, nodes will use this after applying schema mutations. If the key is absent or has a tombstone, they'll calculate the global schema digest on their own -- using the old way. If the key is present, they'll take the schema version from there. The Raft-mode schema version is equal to the group 0 state ID of this schema command. The tombstone is necessary for the case of performing a schema change in RECOVERY mode. It will force a revert to the old digest-based way. Note that extending schema mutations with a `system.scylla_local` mutation is possible thanks to earlier commits which moved `system.scylla_local` to schema commitlog, so all mutations in the schema mutations vector still go to the same commitlog domain. Also, since we introduce a replicated tombstone to `system.scylla_local`, we need to set GC grace to nonzero. We set it to `schema_gc_grace`, which makes sense given the use case.	2023-12-08 17:45:41 +01:00
Avi Kivity	9c0f05efa1	Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later. This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted. The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained. The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was. This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas. Closes scylladb/scylladb#15847 * github.com:scylladb/scylladb: test: tablets: Add test for failed streaming being fenced away error_injection: Introduce poll_for_message() error_injection: Make is_enabled() public api: Add API to kill connection to a particular host range_streamer: Do not block topology change barriers around streaming range_streamer, tablets: Do not keep token metadata around streaming tablets: Fail gracefully when migrating tablet has no pending replica storage_service, api: Add API to disable tablet balancing storage_service, api: Add API to migrate a tablet storage_service, raft topology: Run streaming under session topology guard storage_service, tablets: Use session to guard tablet streaming tablets: Add per-tablet session id field to tablet metadata service: range_streamer: Propagate topology_guard to receivers streaming: Always close the rpc::sink storage_service: Introduce concept of a topology_guard storage_service: Introduce session concept tablets: Fix topology_metadata_guard holding on to the old erm docs: Document the topology_guard mechanism	2023-12-07 16:29:02 +02:00
Avi Kivity	ed2a9b8750	Merge 'Commitlog: Fix reading/writing position calculations and allocation size checks' from Calle Wilund Fixes #16298 The adjusted buffer position calculation in buffer_position(), introduced in https://github.com/scylladb/scylladb/pull/15494 was in fact broken. It calculated (like previously) a "position" based on diff between underlying buffer size and ostream size() (i.e. avail), then adjusted this according to sector overhead rules. However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted. The two cannot be compared as such, which means the "positions" we get here are borked. Luckily for us (sarcasm), the position calculation in replayer made a similar error, in that it adjusts up current position by one sector overhead to much, leading to us more or less getting the same, erroneous results in both ends. However, when/iff one needs to adjust the segment file format further, one might very quickly realize that this does not work well if, say, one needs to be able to safely read some extra bytes before first chunk in a segment. Conversely, trying to adjust this also exposes a latent potential error in the skip mechanism, manifesting here. Issue fixed by keeping track of the initial ostream capacity for segment buffer, and use this for position calculation, and in the case of replayer, move file pos adjustment from read_data() to subroutine (shared with skipping), that better takes data stream position vs. file position adjustment. In implementaion terms, we first inc the "data stream" pos (i.e. pos in data without overhead), then adjust for overhead. Also fix replayer::skip, so that we handle the buffer/pos relation correctly now. Added test for intial entry position, as well as data replay consistency for single entry_writer paths. Fixes #16301 The calculation on whether data may be added is based on position vs. size of incoming data. However, it did not take sector overhead into account, which lead us to writing past allowed segment end, which in turn also leads to metrics overflows. Closes scylladb/scylladb#16302 * github.com:scylladb/scylladb: commitlog: Fix allocation size check to take sector overhead into account. commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart	2023-12-07 12:27:54 +02:00
Calle Wilund	dba39b47bd	commitlog: Fix allocation size check to take sector overhead into account. Fixes #16301 The calculation on whether data may be added is based on position vs. size of incoming data. However, it did not take sector overhead into account, which lead us to writing past allowed segment end, which in turn also leads to metrics overflows.	2023-12-07 07:36:27 +00:00
Calle Wilund	0d35c96ef4	commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart Fixes #16298 The adjusted buffer position calculation in buffer_position(), introduced in #15494 was in fact broken. It calculated (like previously) a "position" based on diff between underlying buffer size and ostream size() (i.e. avail), then adjusted this according to sector overhead rules. However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted. The two cannot be compared as such, which means the "positions" we get here are borked. Luckily for us (sarcasm), the position calculation in replayer made a similar error, in that it adjusts up current position by one sector overhead to much, leading to us more or less getting the same, erroneous results in both ends. However, when/iff one needs to adjust the segment file format further, one might very quickly realize that this does not work well if, say, one needs to be able to safely read some extra bytes before first chunk in a segment. Conversely, trying to adjust this also exposes a latent potential error in the skip mechanism, manifesting here. Issue fixed by keeping track of the initial ostream capacity for segment buffer, and use this for position calculation, and in the case of replayer, move file pos adjustment from read_data() to subroutine (shared with skipping), that better takes data stream position vs. file position adjustment. In implementaion terms, we first inc the "data stream" pos (i.e. pos in data without overhead), then adjust for overhead. Also fix replayer::skip, so that we handle the buffer/pos relation correctly now. Added test for intial entry position, as well as data replay consistency for single entry_writer paths.	2023-12-07 07:36:27 +00:00
Tomasz Grabiec	d1c1b59236	storage_service, api: Add API to disable tablet balancing Load balancing needs to be disabled before making a series of manual migrations so that we don't fight with the load balancer. Also will be used in tests to ensure tablets stick to expected locations.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	31c995332c	storage_service, raft topology: Run streaming under session topology guard Prevents stale streaming operation from running beyond topology operation they were started in. After the session field is cleared, or changed to something else, the old topology_guard used by streaming is interrupted and fenced and the next barrier will join with any remaining work.	2023-12-06 18:36:17 +01:00
Nadav Har'El	300e549267	tablets, mv: disable self-pairing when tablets are used A write to a base table can generate one or more writes to a materialized view. The write to RF base replicas need to cause writes to RF view replicas. Our MV implementation, based on Cassandra's implementation, does this via "pairing": Each one of the base replicas involved in this write sends each view update to exactly one view replica. The function get_view_natural_endpoint() tells a base replica which of the view replicas it should send the update to. The standard pairing is based on the ring order: The first owner of the base token sends to the first owner of the view token, the second to the second, and so on. However, the existing code also uses an optimization we call self-pairing: If a single node is both a base replica and a base replica, the pairing is modified so this node sends the update to itself. This patch disables the self-pairing optimization in keyspaces that use tablets: The self-pairing optimization can cause the pairing to change after token ranges are moved between nodes, so it can break base-view consistency in some edge cases, leading to "ghost rows". With tablets, these range movements become even more frequent - they can happen even if the cluster doesn't grow. This is why we want to solve this problem for tablets. For backward compatibility and to avoid sudden inconsistencies emerging during upgrades, we decided to continue using the self-pairing optimization for keyspaces that are not using tablets (i.e., using vnoodes). Currently, we don't introduce a "CREATE MATERIALIZED VIEW" option to override these defaults - i.e., we don't provide a way to disable self-pairing with vnodes or to enable them with tablets. We could introduce such a schema flag later, if we ever want to (and I'm not sure we want to). It's important to note, that in some cases, this change has implications on when view updates become synchronous, in the tablets case. For example: * If we have 3 nodes and RF=3, with the self-pairing optimization each node is paired with itself, the view update is local, and is implicitly synchronous (without requiring a "synchronous_updates" flag). * In the same setup with tablets, without the self-pairing optimization (due to this patch), this is not guaranteed. Some view updates may not be synchronous, i.e., the base write will not wait for the view write. If the user really wants synchronous updates, they should be requested explicitly, with the "synchronous_updates" view option. Fixes #16260. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16272	2023-12-06 17:11:17 +02:00
Botond Dénes	d2a88cd8de	Merge 'Typos: fix typos in code' from Yaniv Kaul Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors. Refs: https://github.com/scylladb/scylladb/issues/16255 Closes scylladb/scylladb#16289 * github.com:scylladb/scylladb: Update unified/build_unified.sh Update main.cc Update dist/common/scripts/scylla-housekeeping Typos: fix typos in code	2023-12-06 07:36:41 +02:00
Yaniv Kaul	ae2ab6000a	Typos: fix typos in code Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors. Refs: https://github.com/scylladb/scylladb/issues/16255	2023-12-05 15:18:11 +02:00
Kamil Braun	1763c65662	system_keyspace: make `get/set_scylla_local_param` public We'll use it outside `system_keyspace` code in later commit.	2023-12-05 13:03:29 +01:00
Kamil Braun	07984215a3	feature_service: add `GROUP0_SCHEMA_VERSIONING` feature This feature, when enabled, will modify how schema versions are calculated and stored. - In group 0 mode, schema versions are persisted by the group 0 command that performs the schema change, then reused by each node instead of being calculated as a digest (hash) by each node independently. - In RECOVERY mode or before Raft upgrade procedure finishes, when we perform a schema change, we revert to the old digest-based way, taking into account the possibility of having performed group0-mode schema changes (that used persistent versions). As we will see in future commits, this will be done by storing additional flags and tombstones in system tables. By "schema versions" we mean both the UUIDs returned from `schema::version()` and the "global" schema version (the one we gossip as `application_state::SCHEMA`). For now, in this commit, the feature is always disabled. Once all necessary code is setup in following commits, we will enable it together with Raft.	2023-12-05 13:03:28 +01:00
Benny Halevy	6c00c9a45d	raft: use locator::topology/messaging rather than fb_utilities Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 13:26:46 +02:00
Benny Halevy	63b556123b	db/view: use locator::topology rather than fb_utilities Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:55:46 +02:00
Benny Halevy	64145388c9	db/system_keyspace: use topology via db rather than fb_utilities So not to rely on fb_utilities. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:42:49 +02:00
Benny Halevy	4bb4d673c3	db/system_keyspace: save_local_info: get broadcast addresses from caller So not to rely on fb_utilities. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:42:49 +02:00
Benny Halevy	6e79d647e6	db/hints/manager: use locator::topology rather than fb_utilities Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:42:49 +02:00
Benny Halevy	4c20b84680	db/consistency_level: use locator::topology rather than fb_utilities Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:42:49 +02:00
Patryk Jędrzejczak	c8ee7d4499	db: make schema commitlog feature mandatory Using consistent cluster management and not using schema commitlog ends with a bad configuration throw during bootstrap. Soon, we will make consistent cluster management mandatory. This forces us to also make schema commitlog mandatory, which we do in this patch. A booting node decides to use schema commitlog if at least one of the two statements below is true: - the node has `force_schema_commitlog=true` config, - the node knows that the cluster supports the `SCHEMA_COMMITLOG` cluster feature. The `SCHEMA_COMMITLOG` cluster feature has been added in version 5.1. This patch is supposed to be a part of version 6.0. We don't support a direct upgrade from 5.1 to 6.0 because it skips two versions - 5.2 and 5.4. So, in a supported upgrade we can assume that the version which we upgrade from has schema commitlog. This means that we don't need to check the `SCHEMA_COMMITLOG` feature during an upgrade. The reasoning above also applies to Scylla Enterprise. Version 2024.2 will be based on 6.0. Probably, we will only support an upgrade to 2024.2 from 2024.1, which is based on 5.4. But even if we support an upgrade from 2023.x, this patch won't break anything because 2023.1 is based on 5.2, which has schema commitlog. Upgrades from 2022.x definitely won't be supported. When we populate a new cluster, we can use the `force_schema_commitlog=true` config to use schema commitlog unconditionally. Then, the cluster feature check is irrelevant. This check could fail because we initiate schema commitlog before we learn about the features. The `force_schema_commitlog=true` config is especially useful when we want to use consistent cluster management. Failing feature checks would lead to crashes during initial bootstraps. Moreover, there is no point in creating a new cluster with `consistent_cluster_management=true` and `force_schema_commitlog=false`. It would just cause some initial bootstraps to fail, and after successful restarts, the result would be the same as if we used `force_schema_commitlog=true` from the start. In conclusion, we can unconditionally use schema commitlog without any checks in 6.0 because we can always safely upgrade a cluster and start a new cluster. Apart from making schema commitlog mandatory, this patch adds two changes that are its consequences: - making the unneeded `force_schema_commitlog` config unused, - deprecating the `SCHEMA_COMMITLOG` feature, which is always assumed to be true. Closes scylladb/scylladb#16254	2023-12-04 21:02:16 +02:00
Calle Wilund	75a8be5b87	commitlog.hh: Fix numeric constant for file format version 3 to be actual '3' Fixes #16277 When the PR for 'tagged pages' was submitted for RFC, it was assumed that PR #12849 (compression) would be committed first. The latter introduced v3 format, and the format in #12849 (tagged pages) was assumed to have to be bumped to 4. This ended up not the case, and I missed that the code went in with file format tag numeric value being '4' (and constant named v3). While not detrimental, it is confusing, and should be changed asap (before anything depends on files with the tag applied). Closes scylladb/scylladb#16278	2023-12-04 21:01:44 +02:00
Nadav Har'El	4505a86f46	tablets, mv: fix base-view pairing to consider base replication map In the view update code, the function get_view_natural_endpoint() determines which view replica this base replica should send an update to. It currently gets the view table's replication map (i.e., the map from view tokens to lists of replicas holding the token), but assumes that this is also the base table's replication map. This assumption was true with vnodes, but is no longer true with tablets - the base table's replication map can be completely different from the view table's. By looking at the wrong mapping, get_view_natural_endpoint() can believe that this node isn't really a base-replica and drop the view update. Alternatively, it can think it is a base replica - but use the wrong base-view pairing and create base-view inconsistencies. This patch solves this bug - get_view_natural_endpoint() now gets two separate replication maps - the base's and the view's. The callers need to remember what the base table was (in some cases they didn't care at the point of the call), and pass it to the function call. This patch also includes a simple test that reproduces the bug, and confirms it is fixed: The test has a 6-node cluster using tablets and a base table with RF=1, and writes one row to it. Before this patch, the code usually gets confused, thinking the base replica isn't a replica and loses the view update. With this patch, the view update works. Fixes #16227. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16228	2023-12-04 16:38:54 +02:00
Avi Kivity	60af2f3cb2	Merge 'New commitlog file format using tagged pages' from Calle Wilund Prototype implementation of format suggested/requested by @avikivity: Divides segments into disk-write-alignment sized pages, each tagged with segment ID + CRC of data content. When read, we both verify sector integrity (CRC) to detect corruption, as well as matching ID read with expected one. If the latter mismatches we have a prematurely terminated segment (read truncation), which, depending on whether the CL is written in batch or periodic mode, as well as explicit sync, can mean data loss. Note: all-zero pages are treated as kosher, both to align with newly allocated segments, as well as fully terminated (zero-page) ones. Note: This is a preview/RFC - the rest of the file format is not modified. At least parts of entry CRC could probably be removed, but I have not done so yet (needs some thinking). Note: Some slight abstraction breaks in impl. and probably less than maximal efficiency. v2: * Removed entry CRC:s in file format. * Added docs on format v3 * Added one more test for recycling-truncation v3: * Fixed typos in size calc and docs * Changed sect metadata order * Explicit iter type Closes scylladb/scylladb#15494 * github.com:scylladb/scylladb: commitlog_test: Add test for replaying large-ish mutation commitlog_test: Add additional test for segmnent truncation docs: Add docs on commitlog format 3 commitlog: Remove entry CRC from file format commitlog: Implement new format using CRC:ed sectors commitlog: Add iterator adaptor for doing buffer splitting into sub-page ranges fragmented_temporary_buffer: Add const iterator access to underlying buffers commitlog_replayer: differentiate between truncated file and corrupt entries	2023-12-04 13:31:13 +01:00
Yaniv Kaul	2b73793a39	Update db/view/view.cc	2023-12-03 10:07:45 +02:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Piotr Smaroń	5fd30578d7	config: introduce value_status::Deprecated Current mechanism to deprecate config options is implemented in a hacky way in `main.cpp` and doesn't account for existing `db::config/boost::po` API controlling lifetime of config options, hence it's being replaced in this PR by adding yet another `value_status` enumerator: `Deprecated`, so that deprecation of config options is controlled in one place in `config.cc`,i.e. when specifying config options. Motivation: https://docs.google.com/document/d/18urPG7qeb7z7WPpMYI2V_lCOkM5YGKsEU78SDJmt8bM/edit?usp=sharing With this change, if a `Deprecated` config option is specified as 1. a command line parameter, scylla will run and log: ``` WARN 2023-11-25 23:37:22,623 [shard 0:main] init - background-writer-scheduling-quota option ignored (deprecated) ``` (Previously it was only a message printed to standard output, not a scylla log of warn level). 2. an option in `scylla.yaml`, scylla will run and log: ``` WARN 2023-11-27 23:55:13,534 [shard 0:main] init - Option is deprecated : background_writer_scheduling_quota ``` Fixes #15887 Incorporates dropped https://github.com/scylladb/scylladb/pull/15928 Closes scylladb/scylladb#16184	2023-11-30 08:52:57 +03:00
Avi Kivity	8e9d3af431	Merge 'Commitlog: complete prerequisites and enforce hard limit by default' from Eliran Sinvani This miniset, completes the prerequisites for enabling commitlog hard limit on by default. Namely, start flushing and evacuating segments halfway to the limit in order to never hit it under normal circumstances. It is worth mentioning that hitting the limit is an exceptional condition which it's root cause need to be resolved, however, once we do hit the limit, the performance impact that is inflicted as a result of this enforcement is irrelevant. Tests: unit tests. LWT write test (#9331) A whitebox testing has been performed by @wmitros , the test aimed at putting as much pressure as possible on the commitlog segments by using a write pattern that rewrites the partitions in the memtable keeping it at ~85% occupancy so the dirty memory manager will not kick in. The test compared 3 configurations: 1. The default configuration 2. Hard limit on (without changing the flush threshold) 3. the changes in this PR applied. The last exhibited the "best" behavior in terms of metrics, the graphs were the flattest and less jaggy from the others. Closes scylladb/scylladb#10974 * github.com:scylladb/scylladb: commitlog: enforce commitlog size hard limit by default commitlog: set flush threshold to half of the limit size commitlog: unfold flush threshold assignment	2023-11-29 20:55:53 +02:00
Kamil Braun	8a14839a00	Merge 'handle more failures during topology operations' from Gleb This series adds handling for more failures during a topology operation (we already handle a failure during streaming). Here we add handling of tablet draining errors by aborting the operation and handling of errors after streaming where an operation cannot be aborted any longer. If the error happens when rollback is no longer possible we wait for ring delay and proceed to the next step. Each individual patch that adds the sleep has an explanation what the consequences of the patch are. * 'gleb/topology-coordinator-failures' of github.com:scylladb/scylla-dev: test: add test to check errro handling during tablet draining test: fix test_topology_streaming_failure test to not grep the whole file storage_service: add error injection into the tablet migration code storage_service: topology coordinator: rollback on handle_tablet_migration failure during tablet_draining stage storage_service: topology coordinator: do not retry the metadata barrier forever in write_both_read_new state storage_service: topology coordinator: do not retry the metadata barrier forever in left_token_ring state storage_service: topology coordinator: return a node that is being removed from get_excluded_nodes storage_service: topology_coordinator: use new rollback_to_normal state in the rollback procedure storage_service: topology coordinator: add rollback_to_normal node state storage_service: topology coordinator: put fence version into the raft state storage_service: topology coordinator: do fencing even if draining failed	2023-11-29 19:02:35 +01:00
Nadav Har'El	62f89d49e5	tablets, mv: fix on_internal_error on write to base table This situation before this patch is that when tablets are enabled for a keyspace, we can create a materialized view but later any write to the base table fails with an on_internal_error(), saying that: "Tried to obtain per-keyspace effective replication map of test but it's per-table." Indeed, with tablets, the replication is different for each table - it's not the same for the entire keyspace. So this patch changes the view update code to take the replication map from the specific base table, not the keyspace. This is good enough to get materialized-views reads and writes working in a simple single-node case, as the included test demonstrates (the test fails with on_internal_error() before this patch, and passes afterwards). But this fix is not perfect - the base-view pairing code really needs to consider not only the base table's replication map, but also the view table's replication map - as those can be different. We'll fix this remaining problem as a followup in a separate patch - it will require a substantially more elaborate test to reproduce the need for the different mapping and to verify that fix. Fixes #16209. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16211	2023-11-29 15:29:17 +01:00
Calle Wilund	3b70fde3cd	commitlog: Make named_files in delete_segments have updated size Fixes #16207 commitlog::delete_segments deletes (or recycles) segments replayed. The actual file size here is added to footprint so actual delete then can determine iff things should be recycled or removed. However, we build a pending delete list of named_files, and the files we added did not have size set. Bad. Actual deletion then treated files as zero-byte sized, i.e. footprint calculations borked. Simple fix is just filling in the size of the objects when addind. Added unit test for the problem. Closes scylladb/scylladb#16210	2023-11-29 09:58:47 +02:00
Botond Dénes	3ed6925673	Merge 'Major compaction: flush commitlog by forcing new active segment and flushing all tables' from Benny Halevy Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Closes scylladb/scylladb#15820 * github.com:scylladb/scylladb: docs: nodetool: flush: enrich examples docs: nodetool: compact: fix example api: add /storage_service/compact api: add /storage_service/flush compaction_manager: flush_all_tables before major compaction database: add flush_all_tables api: compaction: add flush_memtables option test/nodetool: jmx: fix path to scripts/scylla-jmx scylla-nodetool, docs: improve optional params documentation	2023-11-29 08:48:40 +02:00
Kamil Braun	3582095b79	schema_tables: use smaller timestamp for base mutations included with view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR scylladb/scylladb#15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue scylladb/scylladb#15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the original base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: scylladb/scylladb#15530 Closes scylladb/scylladb#16139	2023-11-28 21:51:18 +01:00
Benny Halevy	66ba983fe0	compaction_manager: flush_all_tables before major compaction Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Botond Dénes	f46cdce9d3	Merge 'Make memtable flush tolerate misconfigured S3 storage' from Pavel Emelyanov Nowadays if memtable gets flushed into misconfigured S3 storage, the flush fails and aborts the whole scylla process. That's not very elegant. First, because upon restart garbage collecting non-sealed sstables would fail again. Second, because re-configuring an endpoint can be done runtime, scylla re-reads this config upon HUP signal. Flushing memtable restarts when seeing ENOSPC/EDQUOT errors from on-disk sstables. This PR extends this to handle misconfigured S3 endpoints as well. fixes: #13745 Closes scylladb/scylladb#15635 * github.com:scylladb/scylladb: test: Add object_store test to validate config reloading works test: Add config update facility to test cluster test: Make S3_Server export config file as pathlib.Path config: Make object storage config updateable_value_source memtable: Extend list of checking codes sstables/storage/s3: Fix missing TOC status check s3/client: Map http exceptions into storage_io_error exceptions: Extend storage_io_error construction options	2023-11-28 09:33:37 +02:00
Botond Dénes	a472700309	Merge 'Minor fixes and refactors' from Kamil Braun - remove some code that is obsolete in newer Scylla versions, - fix some minor bugs. These bugs appear to be benign, there are no known issues caused by them, but fixing them is a good idea nevertheless, - refactor some code for better maintainability. Parts of this PR were extracted from https://github.com/scylladb/scylladb/pull/15331 (which was merged but later reverted), parts of it are new. Closes scylladb/scylladb#16162 * github.com:scylladb/scylladb: test/pylib: log_browsing: fix type hint migration_manager: take `abort_source&` in get_schema_for_read/write migration_manager: inline merge_schema_in_background migration_manager: remove unused merge_schema_from overload migration_manager: assume `canonical_mutation` support migration_manager: add `std::move` to avoid a copy schema_tables: refactor `scylla_tables(schema_features)` schema_tables: pass `reload` flag when calling `merge_schema` cross-shard system_keyspace: fix outdated comment	2023-11-24 17:34:21 +02:00
Kamil Braun	269a189526	schema_tables: refactor `scylla_tables(schema_features)` The `scylla_tables` function gives a different schema definition for the `system_schema.scylla_tables` table, depending on whether certain schema features are enabled or not. The way it was implemented, we had to write `θ(2^n)` amount of code and comments to handle `n` features. Refactor it so that the amount of code we have to write to handle `n` features is `θ(n)`.	2023-11-23 17:23:47 +01:00
Gleb Natapov	95dd0e453d	storage_service: topology coordinator: add rollback_to_normal node state When a topology coordinator rolls back from unsuccessful topology operation it advances the fence (which is now in the raft state) after moving to normal state. We do not want this to fail (only majority of nodes is needed for it to not to), but currently it may fail in case the coordinator moves to another node after changing the rollback node's state to normal, but before updating the fence. To solve that the rollback operation needs to go through a new rollback_to_normal state that will do the fencing before moving to normal. This patch introduces that state, but does not use it yet.	2023-11-23 15:27:28 +02:00
Kamil Braun	5223d32fab	schema_tables: pass `reload` flag when calling `merge_schema` cross-shard In `0c86abab4d` `merge_schema` obtained a new flag, `reload`. Unfortunately, the flag was assigned a default value, which I think is almost always a bad idea, and indeed it was in this case. When `merge_schema` is called on shard different than 0, it recursively calls itself on shard 0. That recursive call forgot to pass the `reload` flag. Fix this.	2023-11-23 14:06:40 +01:00
Kamil Braun	de3607810d	system_keyspace: fix outdated comment	2023-11-23 14:06:27 +01:00
Kefu Chai	55103f4a6b	hints: move formatter of db::hints::sync_point to test the operator<<() based formatter is only used in its test, so let's move it to where it is used. we can always bring it back later if it is required in other places. but better off implementing it as a fmt::formatter<> then. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16142	2023-11-23 11:22:31 +02:00
Kefu Chai	6749d963ed	config: define formatter for db::seed_provider_type before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for db::seed_provider_type. please note, we are still formatting vector<db::seed_provider_type> with the helper provided by seastar/core/sstring.hh, which uses operator<<() to print the elements in the vector being printed. so we have to keep the operator<< formatter before disabling the generic formatter for vector<T>. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16138	2023-11-23 11:04:35 +02:00
Eliran Sinvani	bfa839ce92	commitlog: enforce commitlog size hard limit by default Since the commitlog size hard limit is a failsafe mechanism, we don't expect to ever hit it. If we do hit the limit, it means that we have an exceptional condition in the system. Hence, the impact of enforcing the commitlog hard limit is irrelevant. Here we enforce the limit by default. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-11-22 08:48:28 +02:00
Eliran Sinvani	63d62a7db2	commitlog: set flush threshold to half of the limit size Once we enable commitlog hard limit by default, we would like to have some room in case flushing memtables takes some time to catch up. This threshold is half the limit. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-11-22 08:48:28 +02:00
Eliran Sinvani	d2a8651bce	commitlog: unfold flush threshold assignment This commit is only a cosmetic change. It is meant to make the flush threshold assignment more readable and comprehensible so future changes are easier to review. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-11-22 08:48:28 +02:00
Pavel Emelyanov	210b01a5ce	config: Make object storage config updateable_value_source Now its plain updateable_value, but without the ..._source object the updateable_value is just a no-op value holder. In order for the observers to operate there must be the value source, updating it would update the attached updateable values _and_ notify the observers. In order for the config to be the u.v._source, config entries should be comparable to each other, thus the <=> operator for it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-11-21 16:47:50 +03:00
Calle Wilund	6b66daabfc	commitlog: Remove entry CRC from file format Since CRC is already handled by disk blocks, we can remove some of the entry CRC:ing, both simplifying code and making at least that part of both write and read faster.	2023-11-21 08:50:57 +00:00
Calle Wilund	e29bf6f9e8	commitlog: Implement new format using CRC:ed sectors Breaks the file into individually tagged + crc:ed pages. Each page (sized as disk write alignment) gets a trailing 12-byte metadata, including CRC of the first page-12 bytes, and the ID of the segment being written. When reading, each page read is CRC:ed and checked to be part of the expected segment by comparing ID:s. If crc is broken, we have broken data. If crc is ok, but ID does not match, we have a prematurely terminated segment (truncated), which, depending on whether we use batch mode or not, implied data loss.	2023-11-21 08:50:54 +00:00

1 2 3 4 5 ...

3498 Commits