scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Author	SHA1	Message	Date
Kamil Braun	acfb6bf3ed	topology_state_machine: introduce `global_topology_request` `topology` currently contains the `requests` map, which is suitable for node-specific requests such as "this node wants to join" or "this node must be removed". But for requests for operations that affect the cluster as a whole, a separate request type and field is more appropriate. Introduce one. The enum currently contains the option `new_cdc_generation` for requests to create a new CDC generation in the cluster. We will implement the whole procedure in later commits.	2023-05-08 16:46:14 +02:00
Kamil Braun	93dcdcd4eb	raft topology: make `new_cdc_generation_data_uuid` topology-global - make it a static column in `system.topology` - move it from node-specific `ring_slice` to cluster-global `topology` We will use it in scenarios where no node is transitioning. Also make it `std::optional` in topology for consistency with other fields (previously, the 'no value' state for this field was represented using default-constructed `utils::UUID`).	2023-05-08 16:46:14 +02:00
Botond Dénes	48b9f31a08	Merge 'db, sstable: use generation_type instead of its value when appropriate' from Kefu Chai in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id. Closes #13652 * github.com:scylladb/scylladb: db: use uuid for the generation column in sstable registry table db, sstable: add operator data_value() for generation_type db, sstable: print generation instead of its value	2023-05-03 09:04:54 +03:00
Kefu Chai	74e9e6dd1a	db: use uuid for the generation column in sstable registry table * change the "generation" column of sstable registry table from bigint to uuid * from helper to convert UUID back to the original generation in the long run, we encourage user to use uuid based generation identifier. but in the transition period, both bigint based and uuid based identifiers are used for the generation. so to cater both needs, we use a hackish way to store the integer into UUID. to differentiate the was-integer UUID from the geniune UUID, we check the UUID's most_significant_bits. because we only support serialize UUID v1, so if the timestamp in the UUID is zero, we assume the UUID was generated from an integer when converting it back to a generation identififer. also, please note, the only use case of using generation as a column is the sstable_registry table, but since its schema is fixed, we cannot store both a bigint and a UUID as the value of its `generation` column, the simpler way forward is to use a single type for the generation. to be more efficient and to preserve the type of the generation, instead of using types like ascii string or bytes, we will always store the generation as a UUID in this table, if the generation's identifier is a int64_t, the value of the integer will be used as the least significant bits of the UUID. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-02 19:23:22 +08:00
Tomasz Grabiec	aba5667760	Merge 'raft topology: refactor the coordinator to allow non-node specific topology transitions' from Kamil Braun We change the meaning and name of `replication_state`: previously it was meant to describe the "state of tokens" of a specific node; now it describes the topology as a whole - the current step in the 'topology saga'. It was moved from `ring_slice` into `topology`, renamed into `transition_state`, and the topology coordinator code was modified to switch on it first instead of node state - because there may be no single transitioning node, but the topology itself may be transitioning. This PR was extracted from #13683, it contains only the part which refactors the infrastructure to prepare for non-node specific topology transitions. Closes #13690 * github.com:scylladb/scylladb: raft topology: rename `update_replica_state` -> `update_topology_state` raft topology: remove `transition_state::normal` raft topology: switch on `transition_state` first raft topology: `handle_ring_transition`: rename `res` to `exec_command_res` raft topology: parse replaced node in `exec_global_command` raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on` storage_service: extract raft topology coordinator fiber to separate class raft topology: rename `replication_state` to `transition_state` raft topology: make `replication_state` a topology-global state	2023-04-30 10:55:24 +02:00
Kefu Chai	ba8402067f	db, sstable: add operator data_value() for generation_type so we can apply `execute_cql()` on `generation_type` directly without extracting its value using `generation.value()`. this paves the road to adding UUID based generation id to `generation_type`. as by then, we will have both UUID based and integer based `generation_type`, so `generation_type::value()` will not be able to represent its value anymore. and this method will be replaced by `operator data_value()` in this use case. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 20:39:12 +08:00
Kefu Chai	ae9aa9c4bd	db, sstable: print generation instead of its value this change prepares for the change to use `variant<UUID, int64_t>` as the value of `generation_type`. as after this change, the "value" of a generation would be a UUID or an integer, and we don't want to expose the variant in generation's public interface. so the `value()` method would be changed or removed by then. this change takes advantage of the fact that the formatter of `generation_type` always prints its value. also, it's better to reuse `generation_type` formatter when appropriate. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 20:39:12 +08:00
Kamil Braun	22ab5982e7	raft topology: remove `transition_state::normal` What this state really represented is that there is currently no transition. So remove it and make `transition_state` optional instead.	2023-04-27 15:18:32 +02:00
Kamil Braun	defa63dc20	raft topology: rename `replication_state` to `transition_state` The new name is more generic - it describes the current step of a 'topology saga` (a sequence of steps used to implement a larger topology operation such as bootstrap).	2023-04-27 11:39:38 +02:00
Kamil Braun	af1ea2bb16	raft topology: make `replication_state` a topology-global state Previously it was part of `ring_slice`, belonging to a specific node. This commit moves it into `topology`, making it a cluster-global property. The `replication_state` column in `system.topology` is now `static`. This will allow us to easily introduce topology transition states that do not refer to any specific node. `commit_cdc_generation` will be such a state, allowing us to commit a new CDC generation even though all nodes are normal (none are transitioning). One could argue that the other states are conceptually already cluster-global: for example, `write_both_read_new` doesn't affect only the tokens of a bootstrapping (or decommissioning etc.) node; it affects replica sets of other tokens as well (with RFs greater than 1).	2023-04-27 11:39:38 +02:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Tomasz Grabiec	ce94a2a5b0	Merge 'Fixes and tests for raft-based topology changes' from Kamil Braun Fix two issues with the replace operation introduced by recent PRs. Add a test which performs a sequence of basic topology operations (bootstrap, decommission, removenode, replace) in a new suite that enables the `raft` experimental feature (so that the new topology change coordinator code is used). Fixes: #13651 Closes #13655 * github.com:scylladb/scylladb: test: new suite for testing raft-based topology test: remove topology_custom/test_custom.py raft topology: don't require new CDC generation UUID to always be present raft topology: include shard_count/ignore_msb during replace	2023-04-26 11:38:07 +02:00
Kamil Braun	3f0498ca53	raft topology: don't require new CDC generation UUID to always be present During node replace we don't introduce a new CDC generation, only during regular bootstrap. Instead of checking that `new_cdc_generation_uuid` must be present whenever there's a topology transition, only check it when we're in `commit_cdc_generation` state.	2023-04-24 14:41:33 +02:00
Tomasz Grabiec	9d786c1ebc	db: tablets: Add persistence layer	2023-04-24 10:49:37 +02:00
Botond Dénes	2d8d8043be	Merge 'Coroutinize system_keyspace::get_compaction_history' from Pavel Emelyanov Closes #13620 * github.com:scylladb/scylladb: system_keyspace: Fix indentation after previous patch system_keyspace: Coroutinize get_compaction_history()	2023-04-24 09:48:01 +03:00
Benny Halevy	2d20ee7d61	gms: version_generator: define version_type and generation_type strong types Derived from utils::tagged_integer, using different tags, the types are incompatible with each other and require explicit typecasting to- and from- their value type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:47:17 +03:00
Benny Halevy	d1817e9e1b	utils: move generation-number to gms Although get_generation_number implementation is completely generic, it is used exclusively to seed the gossip generation number. Following patches will define a strong gms::generation_id type and this function should return it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:37:32 +03:00
Tomasz Grabiec	bd0b299322	Merge 'Manage CDC generations when bootstrapping nodes using Raft Group 0 topology coordinator' from Kamil Braun Introduce a new table `CDC_GENERATIONS_V3` (`system.cdc_generations_v3`). The table schema is a copy-paste of the `CDC_GENERATIONS_V2` schema. The difference is that V2 lives in `system_distributed_keyspace` and writes to it are distributed using regular `storage_proxy` replication mechanisms based on the token ring. The V3 table lives in `system_keyspace` and any mutations written to it will go through group 0. Extend the `TOPOLOGY` schema with new columns: - `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping node's `ring_slice`, it stores UUID of a newly introduced CDC generation which is used as partition key for the `CDC_GENERATIONS_V3` table to access this new generation's data. It's a regular column, meaning that every row (corresponding to a node) will have its own. - `current_cdc_generation_uuid` and `current_cdc_generation_timestamp` together form the ID of the newest CDC generation in the cluster. (the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is when the CDC generation starts operating). Those are static columns since there's a single newest CDC generation. When topology coordinator handles a request for node to join, calculate a new CDC generation using the bootstrapping node's tokens, translate it to mutation format, and insert this mutation to the CDC_GENERATIONS_V3 table through group 0 at the same time we assign tokens to the node in Raft topology. The partition key for this data is stored in the bootstrapping node's `ring_slice`. After inserting new CDC generation data , we need to pick a timestamp for this generation and commit it, telling all nodes in the cluster to start using the generation for CDC log writes once their clocks cross that timestamp. We introduce a separate step to the bootstrap saga, before `write_both_read_old`, called `commit_cdc_generation`. In this step, the coordinator takes the `new_cdc_generation_data_uuid` stored in a bootstrapping node's `ring_slice` - which serves as the key to the table where the CDC generation data is stored - and combines it with a timestamp which it generates a bit into the future (as in old gossiper-based code, we use 2 * ring_delay, by default 1 minute). This gives us a CDC generation ID which we commit into the topology state as the `current_cdc_generation_id` while switching the saga to the next step, `write_both_read_old`. Once a new CDC generation is committed to the cluster by the topology coordinator, we also need to publish it to the user-facing description tables so CDC applications know which streams to read from. This uses regular distributed table writes underneath (tables living in the `system_distributed` keyspace) so it requires `token_metadata` to be nonempty. We need a hack for the case of bootstrapping the first node in the cluster - turning the tokens into normal tokens earlier in the procedure in `token_metadata`, but this is fine for the single-node case since no streaming is happening. When a node notices that a new CDC generation was introduced in `storage_service::topology_state_load`, it updates its internal data structures that are used when coordinating writes to CDC log tables. We include the current CDC generation data in topology snapshot transfers. Some fixes and refactors included. Closes #13385 * github.com:scylladb/scylladb: docs: cdc: describe generation changes using group 0 topology coordinator cdc: generation_service: add a FIXME cdc: generation_service: add legacy_ prefix for gossiper-based functions storage_service: include current CDC generation data in topology snapshots db: system_keyspace: introduce `query_mutations` with range/slice storage_service: hold group 0 apply mutex when reading topology snapshot service: raft_group0_client: introduce `hold_read_apply_mutex` storage_service: use CDC generations introduced by Raft topology raft topology: publish new CDC generation to the user description tables raft topology: commit a new CDC generation on node bootstrap raft topology: create new CDC generation data during node bootstrap service: topology_state_machine: make topology::find const db: system_keyspace: small refactor of `load_topology_state` cdc: generation: extract pure parts of `make_new_generation` outside db: system_keyspace: add storage for CDC generations managed by group 0 service: topology_state_machine: better error checking for state name (de)serialization service: raft: plumbing `cdc::generation_service&` cdc: generation: `get_cdc_generation_mutations`: take timestamp as parameter cdc: generation: make `topology_description_generator::get_sharding_info` a parameter sys_dist_ks: make `get_cdc_generation_mutations` public sys_dist_ks: move find_schema outside `get_cdc_generation_mutations` sys_dist_ks: move mutation size threshold calculation outside `get_cdc_generation_mutations` service/raft: group0_state_machine: signal topology state machine in `load_snapshot`	2023-04-21 18:11:27 +02:00
Pavel Emelyanov	2aabaada9e	system_keyspace: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-21 17:32:57 +03:00
Pavel Emelyanov	6290849f11	system_keyspace: Coroutinize get_compaction_history() In order not to copy the rvalue consumer arg -- instantly convert it into value. No other tricks. Indentation is deliberately left broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-21 17:32:02 +03:00
Kamil Braun	f9d8118c8d	db: system_keyspace: use microsecond resolution for group0_history range tombstone in `make_group0_history_state_id_mutation`, when adding a new entry to the group 0 history table, if the parameter `gc_older_than` is engaged, we create a range tombstone in the mutation which deletes entries older than the new one by `gc_older_than`. In particular if `gc_older_than = 0`, we want to delete all older entries. There was a subtle bug there: we were using millisecond resolution when generating the tombstone, while the provided state IDs used microsecond resolution. On a super fast machine it could happen that we managed to perform two schema changes in a single millisecond; this happened sometimes in `group0_test.test_group0_history_clearing_old_entries` on our new CI/promotion machines, causing the test to fail because the tombstone didn't clear the entry correspodning to the previous schema change when performing the next schema change (since they happened in the same millisecond). Use microsecond resolution to fix that. The consecutive state IDs used in group 0 mutations are guaranteed to be strictly monotonic at microsecond resolution (see `generate_group0_state_id` in service/raft/raft_group0_client.cc). Fixes #13594	2023-04-21 10:33:05 +02:00
Kamil Braun	3d96bc5dba	db: system_keyspace: introduce `query_mutations` with range/slice There is a `query_mutations` function which loads the entire contents of a given table into memory. There was no function for e.g. loading just a single partition in the form of mutations. Introduce one.	2023-04-20 16:36:41 +02:00
Kamil Braun	5f2b297f99	raft topology: publish new CDC generation to the user description tables Once a new CDC generation is committed to the cluster by the topology coordinator, we also need to publish it to the user-facing description tables so CDC applications know which streams to read from. This uses regular distributed table writes underneath (tables living in the `system_distributed` keyspace) so it requires `token_metadata` to be nonempty. We need a hack for the case of bootstrapping the first node in the cluster - turning the tokens into normal tokens earlier in the procedure in `token_metadata`, but this is fine for the single-node case since no streaming is happening.	2023-04-20 16:36:41 +02:00
Kamil Braun	58baf998c1	raft topology: commit a new CDC generation on node bootstrap After inserting new CDC generation data (see previous commit), we need to pick a timestamp for this generation and commit it, telling all nodes in the cluster to start using the generation for CDC log writes once their clocks cross that timestamp. We introduce a separate step to the bootstrap saga, before `write_both_read_old`, called `commit_cdc_generation`. In this step, the coordinator takes the `new_cdc_generation_data_uuid` stored in a bootstrapping node's `ring_slice` - which serves as the key to the table where the CDC generation data is stored - and combines it with a timestamp which it generates a bit into the future (as in old gossiper-based code, we use 2 * ring_delay, by default 1 minute). This gives us a CDC generation ID which we commit into the topology state as the `current_cdc_generation_id` while switching the saga to the next step, `write_both_read_old`. `system_keyspace::load_topology_state` is extended to load `current_cdc_generation_id`. For now, nodes don't react to `current_cdc_generation_id`. In later commit we'll extend `storage_service::topology_state_load` to start using the current CDC generation for CDC log table writes. The solution with specifying a timestamp into the future is the same as it is for gossip-based topology changes and it has the same consistency problem - if some node is temporarily partitioned away from the quorum, it might not learn about the new CDC generation before its clock crosses the generation's timestamp, causing it to temporarily send writes to the wrong CDC streams (until it learns about the new timestamp). I left a FIXME which describes an alternative solution which wasn't viable for gossiper-based topology changes, but it is viable when we have a fault-tolerant topology coordinator.	2023-04-20 16:36:41 +02:00
Kamil Braun	5942237a79	raft topology: create new CDC generation data during node bootstrap Calculate a new CDC generation using the bootstrapping node's tokens, translate it to mutation format, and insert this mutation to the CDC_GENERATIONS_V3 table through group 0 at the same time we assign tokens to the node in Raft topology. The partition key for this data is stored in the bootstrapping node's `ring_slice`. The data is inserted, but it's not used for anything yet, we'll do it in later commits. Two FIXMEs are left for follow-ups: - in `get_sharding_info` we shouldn't have to use the token owner's IP, but get the host ID directly from token metadata (#12279), - splitting the CDC generation data write into multiple commands. The comment elaborates.	2023-04-20 16:35:37 +02:00
Kamil Braun	22094f1509	db: system_keyspace: small refactor of `load_topology_state` The variables necessary for constructing a `ring_slice` are now living in a local block of code. This makes it easier to see which data is part of the `ring_slice` and will make it easier to add more data to `ring_slice` in following commits. Also add some more sanity checking.	2023-04-20 15:40:23 +02:00
Kamil Braun	2233d8f54d	db: system_keyspace: add storage for CDC generations managed by group 0 The `CDC_GENERATIONS_V3` table schema is a copy-paste of the `CDC_GENERATIONS_V2` schema. The difference is that V2 lives in `system_distributed_keyspace` and writes to it are distributed using regular `storage_proxy` replication mechanisms based on the token ring. The V3 table lives in `system_keyspace` and any mutations written to it will go through group 0. Also extend the `TOPOLOGY` schema with new columns: - `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping node's `ring_slice`, it stores UUID of a newly introduced CDC generation which is used as partition key for the `CDC_GENERATIONS_V3` table to access this new generation's data. It's a regular column, meaning that every row (corresponding to a node) will have its own. - `current_cdc_generation_uuid` and `current_cdc_generation_timestamp` together form the ID of the newest CDC generation in the cluster. (the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is when the CDC generation starts operating). Those are static columns since there's a single newest CDC generation.	2023-04-20 15:38:58 +02:00
Pavel Emelyanov	9628d07adb	Put storage_service.hh on a diet By removing unneeded headers inclusions. At the cost of few more forward declarations and a couple of extra includes in other .cc files. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13552	2023-04-18 14:53:17 +03:00
Kamil Braun	f9051dccaa	raft topology: store `shard_count` and `ignore_msb` in topology Add new columns to the `system.topology` table: `shard_count` and `ignore_msb`. When a node bootstraps or restarts and observes that the values stored in `topology` are different than the local values, it updates them. This is done in the `update_topology_with_local_metadata` function (the 'metadata' here being the two values). Additional flag persisted in `system.scylla_local` is used to safely avoid performing read barriers when the values didn't change on node restart. A comment in `update_topology_with_local_metadata` explains why this flag is needed. An example use case where `shard_count` and `ignore_msb` are needed is creating CDC generations. Fixes: #13508	2023-04-17 10:45:30 +02:00
Pavel Emelyanov	08e9046d07	system_keyspace: Add ownership table The schema is CREATE TABLE system.sstables ( location text, generation bigint, format text, status text, uuid uuid, version text, PRIMARY KEY (location, generation) ) A sample entry looks like: location \| generation \| format \| status \| uuid \| version ---------------------------------------------------------------------+------------+--------+--------+--------------------------------------+--------- /data/object_storage_ks/test_table-d096a1e0ad3811ed85b539b6b0998182 \| 2 \| big \| sealed \| d0a743b0-ad38-11ed-85b5-39b6b0998182 \| me The uuid field points to the "folder" on the storage where the sstable components are. Like this: s3 `- test_bucket `- f7548f00-a64d-11ed-865a-0c1fbc116bb3 `- Data.db - Index.db - Filter.db - ... It's not very nice that the whole /var/lib/... path is in fact used as location, it needs the PR #12707 to fix this place. Also, the "status" part is not yet fully functional, it only supports three options: - creating -- the same as TemporaryTOC file exists on disk - sealed -- default state - deleting -- the analogy for the deletion log on disk The latter needs support from the distributed_loader, which's not yet there. In fact, distributes_loader also needs to be patched to actualy select entries from this table on load. Also it needs the mentioned PR #12707 to support staging and quarantine sstables. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-10 16:44:28 +03:00
Pavel Emelyanov	18333b4225	system_keyspace.hh: Remove unneeded headers Now this header can replace lots of used types with plain forward declarations Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-06 12:37:00 +03:00
Pavel Emelyanov	1af373cf0a	system_keyspace: Move topology_mutation_builder to storage_service The latter is the only user of the class. This keeps system keyspace code free from unrelated logic and from raft::server_id type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-06 12:36:02 +03:00
Pavel Emelyanov	45de375126	system_keyspace: Move group0_upgrade_state conversions to group0 code In order to keep system keyspace free from group0 logic and from the service::group0_upgrade_state type Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-06 12:35:07 +03:00
Kamil Braun	cd282cf0ab	Merge 'Raft, use schema commit log' from Gusev Petr We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog. Fixes: #12642 Closes #13134 * github.com:scylladb/scylladb: test_raft_upgrade: add a test for schema commit log feature scylla_cluster.py: add start flag to server_add ServerInfo: drop host_id scylla_cluster.py: add config to server_add scylla_cluster.py: add expected_error to server_start scylla_cluster.py: ScyllaServer.start, refactor error reporting scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed raft: check if schema commitlog is initialized Refuse to boot if neither the schema commitlog feature nor force_schema_commit_log is set. For the upgrade procedure the user should wait until the schema commitlog feature is enabled before enabling consistent_cluster_management. raft: move raft initialization after init_system_keyspace database: rename before_schema_keyspace_init->maybe_init_schema_commitlog raft: use schema commitlog for raft tables init_system_keyspace: refactoring towards explicit load phases	2023-03-27 13:27:30 +02:00
Tomasz Grabiec	c54a3d9c10	Merge 'Clean enabled features manipulations in system keyspace' from Pavel Emelyanov There was an attempt to cut feature-service -> system-keyspace dependency (#13172) which turned out to require more changes. Here's a preparation squeezing from this future work. This set - leaves only batch-enabling API in feature service - keeps the need for async context in feature service - narrows down system keyspace features API to only load and store records - relaxes features updating logic in sys.ks. - cosmetic Closes #13264 * github.com:scylladb/scylladb: feature_service: Indentation fix after previous patch feature_service: Move async context into enable() system_keyspace: Refactor local features load/save helpers feature_service: Mark supported_feature_set() const feature_service: Remove single feature enabling method boot: Enable features in batch gossiper: Enable features in batch	2023-03-24 13:12:49 +01:00
Petr Gusev	769732d095	database: rename before_schema_keyspace_init->maybe_init_schema_commitlog We are going to move the raft tables from the first load phase to the second. This means the second init_system_keyspace call will load raft tables along with the schema, making the name of this function imprecise.	2023-03-24 15:54:52 +04:00
Petr Gusev	273e70e1f9	raft: use schema commitlog for raft tables Fixes: #12642	2023-03-24 15:54:52 +04:00
Petr Gusev	5a5d664a5a	init_system_keyspace: refactoring towards explicit load phases We aim (#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace. The call site for populate_keyspace in init_system_keyspace was changed, table_selector.contains_keyspace was replaced with db.local().has_keyspace. This check prevents calling populate_keyspace(system_schema) on phase1, but allows for populate_keyspace(system) on phase2 (to init raft tables). On this second call some tables from system keyspace (e.g. system.local) may have already been populated on phase1. This check protects from double-populating them, since every populated cf is marked as ready_for_writes.	2023-03-24 15:54:46 +04:00
Gleb Natapov	5e232ebee5	system_keyspace: add a table to persist topology change state machine's state Add local table to store topology change state machine's state there. Also add a function that loads the state to memory.	2023-03-21 16:06:43 +02:00
Pavel Emelyanov	8600cb2db0	feature_service: Move async context into enable() Callers don't need to know that enabling features has this requirement Indentation is deliberately left broken (until next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-03-21 11:59:34 +03:00
Pavel Emelyanov	ae6e29a919	system_keyspace: Refactor local features load/save helpers Introduce load_local_enabled_features() and save_local_enabled_features() that get and put std::set<sstring> with feature names (and perform set to string and back conversions on their own). They look natural next to existing sys.ks. methods to get/set local-supported features and peer features. Using the new API, the more generic functions to preserve individual features and load them on startup can become much shorter and cleaner. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-03-21 11:54:02 +03:00
Pavel Emelyanov	b27d2c9399	boot: Enable features in batch On boot main calls enable_features_on_startup() which at the end scans through the list of features and enables them. Same as in previous patch -- it makes sense to use batch enabling here. Note, that despite the loop that collects features is not as trivial as in previous patch (gossiper case), it still operates with local copies of feature sets so delaying the feature's enabling doesn't affect other features' need to be enabled too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-03-21 11:12:25 +03:00
Kefu Chai	c37f4e5252	treewide: use fmt::join() when appropriate now that fmtlib provides fmt::join(). see https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view there is not need to revent the wheel. so in this change, the homebrew join() is replaced with fmt::join(). as fmt::join() returns an join_view(), this could improve the performance under certain circumstances where the fully materialized string is not needed. please note, the goal of this change is to use fmt::join(), and this change does not intend to improve the performance of existing implementation based on "operator<<" unless the new implementation is much more complicated. we will address the unnecessarily materialized strings in a follow-up commit. some noteworthy things related to this change: * unlike the existing `join()`, `fmt::join()` returns a view. so we have to materialize the view if what we expect is a `sstring` * `fmt::format()` does not accept a view, so we cannot pass the return value of `fmt::join()` to `fmt::format()` * fmtlib does not format a typed pointer, i.e., it does not format, for instance, a `const std::string`. but operator<<() always print a typed pointer. so if we want to format a typed pointer, we either need to cast the pointer to `void` or use `fmt::ptr()`. * fmtlib is not able to pick up the overload of `operator<<(std::ostream& os, const column_definition* cd)`, so we have to use a wrapper class of `maybe_column_definition` for printing a pointer to `column_definition`. since the overload is only used by the two overloads of `statement_restrictions::add_single_column_parition_key_restriction()`, the operator<< for `const column_definition*` is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-16 20:34:18 +08:00
Petr Gusev	3ef201d67a	schema.hh: use schema_static_props for wait_for_sync_to_commitlog This patch continues the refactoring, now we move wait_for_sync_to_commitlog property from schema_builder to schema_static_props. The patch replaces schema_builder::set_wait_for_sync_to_commitlog and is_extra_durable with two register_static_configurator, one in system_keyspace and another in system_distributed_keyspace. They correspond to the two parts of the original disjunction in schema_tables::is_extra_durable.	2023-03-14 19:26:05 +04:00
Petr Gusev	349bc1a9b6	schema.hh: introduce schema_static_props, use it for null_sharder Our goal (#12642) is to mark raft tables to use schema commitlog. There are two similar cases in code right now - with_null_sharder and set_wait_for_sync_to_commitlog schema_builder methods. The problem is that if we need to mark some new schema with one of these methods we need to do this twice - first in a method describing the schema (e.g. system_keyspace::raft()) and second in the function create_table_from_mutations, which is not obvious and easy to forget. create_table_from_mutations is called when schema object is reconstructed from mutations, with_null_sharder and set_wait_for_sync_to_commitlog must be called from it since the schema properties they describe are not included in the mutation representation of the schema. This patch proposes to distinguish between the schema properties that get into mutations and those that do not. The former are described with schema_builder, while for the latter we introduce schema_static_props struct and the schema_builder::register_static_configurator method. This way we can formulate a rule once in the code about which schemas should have a null sharder, and it will be enforced in all cases.	2023-03-14 18:29:34 +04:00
Botond Dénes	d1619eb38a	Merge 'Remove qctx from helpers that retrieve truncation record' from Pavel Emelyanov There are two places that do it -- commitlog and batchlog replayers. Both can have local system-keyspace reference and use system-keyspace local query-processor for it. The peering save_truncation_record() is not that simple and is not patched by this PR Closes #13087 * github.com:scylladb/scylladb: system_keyspace: Unstatic get_truncation_record() system_keyspace: Unstatic get_truncated_at() batchlog_manager: Add system_keyspace dependency main: Swap batchlog manager and system keyspace starts system_keyspace: Unstatic get_truncated_position() system_keyspace: Remove unused method commitlog: Create commitlog_replayer with system keyspace test: Make cql_test_env::get_system_keyspace() return sharded commiltlog: Line-up field definitions	2023-03-07 10:19:55 +02:00
Kefu Chai	020483aa59	Update seastar submodule and main this change also includes change to main, to make this commit compile. see below: * seastar 9b6e181e42...9cbc1fe889 (46): > Merge 'Make io-tester jobs share sched classes' from Pavel Emelyanov > io_tester.md: Update the `rps` configuration option description > io_tester: Add option to limit total number of requests sent > Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov > io_tester: Add option to share classes between jobs > rpc: Abort connection if send_entry() fails > Merge 'build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS' from Kefu Chai > build: cooking.sh: use the same BUILD_SHARED_LIBS when building ingredients > build: cooking.sh: use the same generator when building ingredients > core/memory: handle `strerror_r` returning static string > Merge 'build, rpc: lz4 related cleanups' from Kefu Chai > build, rpc: do not support lz4 < 1.7.3 > build: set the correct version when finding lz4 > build: include CheckSymbolExists > rpc: do not include lz4.h in header > build: set CMP0135 for Cooking.cmake > docs: drop building-.md > Merge 'seastar-addr2line: cleanups' from Kefu Chai > seastar-addr2line: refactor tests using unittest > seastar-addr2line: extract do_test() and main() > seastar-addr2line: do not import unused modules > scheduling: add a `rename` callback to scheduling_group_key_config > reactor: syscall thread: wakeup up reactor with finer granularity > build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS > build: extract dpdk_extra_cflags out > core/sstring: remove a temporary variable > Merge 'treewide: include what we use, and add a checkheaders target' from Kefu Chai > perftune.py: auto-select the same number of IRQ cores on each NUMA > prometheus: remove unused headers > core/sstring: define <=> operator for sstring > Merge 'core: s/reserve_additional_memory/reserve_additional_memory_per_shard/' from Kefu Chai > include: do not include <concepts> directly > coding_style: note on self-contained header requirement > circileci: build checkheaders in addition to default target > build: add checkheaders target > net/toeplitz: s/u_int/unsigned/ > net/tcp-stack: add forward declaration for seastar::socket > core, net, util: include used headers main: set reserved memory for wasm on per-shard basis this change is a follow-up of `f05d612da8` and `4a0134a097`. this change depends on the related change in Seastar to reserve additional memory on a per-shard basis. per Wojciech Mitros's comment: > it should have probably been 50MB per shard in other words, as we always execute the same set of udf on all shards. and since one cannot predict the number of shards, but she could have a rough estimation on the size of memory a regular (set of) udf could use. so a per-shard setting makes more sense. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-06 18:41:34 +02:00
Pavel Emelyanov	1be9b0df50	system_keyspace: Unstatic get_truncation_record() Now when both callers of this method are non-static, it can be made non-static too. While at it make two more changes: 1. move the thing to private 2. remove explicit cql3::query_processor::cache_internal::yes argument, the system_keyspace::execute_cql() applies it on itw own Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-03-06 13:28:40 +03:00
Pavel Emelyanov	2501ba3887	system_keyspace: Remove unused method The get_truncated_position() overload that filters records by shard is nowadays unused. Drop one Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-03-06 13:28:40 +03:00
Kefu Chai	df63e2ba27	types: move types.{cc,hh} into types they are part of the CQL type system, and are "closer" to types. let's move them into "types" directory. the building systems are updated accordingly. the source files referencing `types.hh` were updated using following command: ``` find . -name "*.{cc,hh}" -exec sed -i 's/\"types.hh\"/\"types\/types.hh\"/' {} + ``` the source files under sstables include "types.hh", which is indeed the one located under "sstables", so include "sstables/types.hh" instea, so it's more explicit. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12926	2023-02-19 21:05:45 +02:00

1 2 3 4 5 ...

535 Commits