scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Michał Chojnowski	cdb3e71045	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202	2022-09-15 17:16:26 +03:00
Pavel Emelyanov	82162be1f1	messaging_service: Remove init/uninit helpers These two are just getting in the way when touching inter-components dependencies around messaging service. Without it m.-s. start/stop just looks like any other service out there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11535	2022-09-15 11:54:46 +03:00
Botond Dénes	5374f0edbf	Merge 'Task manager' from Aleksandra Martyniuk Task manager for observing and managing long-running, asynchronous tasks in Scylla with the interface for the user. It will allow listing of tasks, getting detailed task status and progression, waiting for their completion, and aborting them. The task manager will be configured with a “task ttl” that determines how long the task status is kept in memory after the task completes. At first it will support repair and compaction tasks, and possibly more in the future. Currently: Sharded `task_manager` is started in `main.cc` where it is further passed to `http_context` for the purpose of user interface. Task manager's tasks are implemented in two two layers: the abstract and the implementation one. The latter is a pure virtual class which needs to be overriden by each module. Abstract layer provides the methods that are shared by all modules and the access to module-specific methods. Each module can access task manager, create and manage its tasks through `task_manager::module` object. This way data specific to a module can be separated from the other modules. User can access task manager rest api interface to track asynchronous tasks. The available options consist of: - getting a list of modules - getting a list of basic stats of all tasks in the requested module - getting the detailed status of the requested task - aborting the requested task - waiting for the requested task to finish To enable testing of the provided api, test specific task implementation and module are provided. Their lifetime can be simulated with the standalone test api. These components are compiled and the tests are run in all but release build modes. Fixes: #9809 Closes #11216 * github.com:scylladb/scylladb: test: task manager api test task_manager: test api layer implementation task_manager: add test specific classes task_manager: test api layer task_manager: api layer implementation task_manager: api layer task_manager: keep task_manager reference in http_context start sharded task manager task_manager: create task manager object	2022-09-12 09:26:46 +03:00
Kamil Braun	dba595d347	Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch Broadcast tables are tables for which all statements are strongly consistent (linearizable), replicated to every node in the cluster and available as long as a majority of the cluster is available. If a user wants to store a “small” volume of metadata that is not modified “too often” but provides high resiliency against failures and strong consistency of operations, they can use broadcast tables. The main goal of the broadcast tables project is to solve problems which need to be solved when we eventually implement general-purpose strongly consistent tables: designing the data structure for the Raft command, ensuring that the commands are idempotent, handling snapshots correctly, and so on. In this MVP (Minimum Viable Product), statements are limited to simple SELECT and UPDATE operations on the built-in table. In the future, other statements and data types will be available but with this PR we can already work on features like idempotent commands or snapshotting. Snapshotting is not handled yet which means that restarting a node or performing too many operations (which would cause a snapshot to be created) will give incorrect results. In a follow-up, we plan to add end-to-end Jepsen tests (https://jepsen.io/). With this PR we can already simulate operations on lists and test linearizability in linear complexity. This can also test Scylla's implementation of persistent storage, failure detector, RPC, etc. Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing Closes #11164 * github.com:scylladb/scylladb: raft: broadcast_tables: add broadcast_kv_store test raft: broadcast_tables: add returning query result raft: broadcast_tables: add execution of intermediate language raft: broadcast_tables: add compilation of cql to intermediate language raft: broadcast_tables: add definition of intermediate language db: system_keyspace: add broadcast_kv_store table db: config: add BROADCAST_TABLES feature flag	2022-09-09 18:05:37 +02:00
Aleksandra Martyniuk	ec86410094	task_manager: test api layer implementation The implementation of a test api that helps testing task manager api. It provides methods to simulate the operations that can happen on modules and theirs task. Through the api user can: register and unregister the test module and the tasks belonging to the module, and finish the tasks with success or custom error.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	c9637705a6	task_manager: api layer implementation The implementation of a task manager api layer. It provides methods to list the modules registered in task_manager, list tasks belonging to the given module, abort, wait for or retrieve a status of the given task.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	b87a0a74ab	task_manager: keep task_manager reference in http_context Keep a reference to sharded<task_manager> as a member of http_context so it can be reached from rest api.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	9e68c8d445	start sharded task manager Sharded task manager object is started in main.cc.	2022-09-09 14:29:28 +02:00
Mikołaj Grzebieluch	82df8a9905	raft: broadcast_tables: add compilation of cql to intermediate language We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement` and `strongly_consistent_select_statement`. Statements operating on system.broadcast_kv_store will be compiled to these new subclasses if BROADCAST_TABLES flag is enabled. If the query is executed on a shard other than 0 it's bounced to that shard.	2022-09-08 15:25:36 +02:00
Pavel Emelyanov	41973c5bf7	topology: Add "enable proximity sorting" bit There's one corner case in nodes sorting by snitch. The simple snitch code overloads the call and doesn't sort anything. The same behavior should be preserved by (future) topology implementation, but it doesn't know the snitch name. To address that the patch adds a boolean switch on topology that's turned off by main code when it sees the snitch is "simple" one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-05 15:15:07 +03:00
Mikołaj Grzebieluch	5b1421cc33	db: config: add BROADCAST_TABLES feature flag Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature. This feature requires raft group0, thus enabling it without RAFT will cause an error.	2022-09-05 11:11:08 +02:00
Pavel Emelyanov	e147681d85	messaging, topology: Keep shared_token_metadata* on messaging Messaging will need to call topology methods to compare DC/RACK of peers with local node. Topology now resides on token metadata, so messaging needs to get the dependency reference. However, messaging only needs the topology when it's up and running, so instead of producing a life-time reference, add a pointer, that's set up on .start_listen(), before any client pops up, and is cleared on .shutdown() after all connections are dropped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-01 11:32:34 +03:00
Kefu Chai	a5e696fab8	storage_service, test: drop unused storage_service_config this setting was removed back in `dcdd207349`, so despite that we are still passing `storage_service_config` to the ctor of `storage_service`, `storage_service::storage_service()` just drops it on the floor. in this change, `storage_service_config` class is removed, and all places referencing it are updated accordingly. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes #11415	2022-08-31 19:49:13 +03:00
Kamil Braun	e350e37605	service/raft: raft_group0: implement upgrade procedure A listener is created inside `raft_group0` for acting when the SUPPORTS_RAFT feature is enabled. The listener is established after the node enters NORMAL status (in `raft_group0::finish_setup_after_join()`, called at the end of `storage_service::join_cluster()`). The listener starts the `upgrade_to_group0` procedure. The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled (see earlier commit which implemented this logic) - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations (only those for now). The devil lies in the details, and the implementation is ugly compared to this nice description; for example there are many retry loops for handling intermittent network failures. Read the code. `leave_group0` and `remove_group0` were adjusted to handle the upgrade procedure being run correctly; if necessary, they will wait for the procedure to finish. If the upgrade procedure gets stuck (and it may, since it requires all nodes to be available to contact them to correctly establish a single group 0 raft cluster); or if a running cluster permanently loses a majority of nodes, causing group 0 unavailability; the cluster admin is not left without help. We introduce a recovery mode, which allows the admin to completely get rid of traces of existing group 0 and restart the upgrade procedure - which will establish a new group 0. This works even in clusters that never upgraded but were bootstrapped using group 0 from scratch. To do that, the admin does the following on every node: - writes 'recovery' under 'group0_upgrade_state' key in `system.scylla_local` table, - truncates the `system.discovery` table, - truncates the `system.group0_history` table, - deletes group 0 ID and group 0 server ID from `system.scylla_local` (the keys are `raft_group0_id` and `raft_server_id` then the admin performs a rolling restart of their cluster. The nodes restart in a "group 0 recovery mode", which simply means that the nodes won't try to perform any group 0 operations. Then the admin calls `removenode` to remove the nodes that are down. Finally, the admin removes the `group0_upgrade_state` key from `system.scylla_local`, rolling-restarts the cluster, and the cluster should establish group 0 anew. Note that this recovery procedure will have to be extended when new stuff is added to group 0 - like topology change state. Indeed, observe that a minority of nodes aren't able to receive committed entries from a leader, so they may end up in inconsistent group 0 states. It wouldn't be safe to simply create group 0 on those nodes without first ensuring that they have the same state from which group 0 will start. Right now the state only consist of schema tables, and the upgrade procedure ensures to synchronize them, so even if the nodes started in inconsistent schema states, group 0 will correctly be established. (TODO: create a tracking issue? something needs to remind us of this whenever we extend group 0 with new stuff...)	2022-08-23 13:51:01 +02:00
Kamil Braun	43687be1f1	service/raft: raft_group0_client: prepare for upgrade procedure Now, whether an 'group 0 operation' (today it means schema change) is performed using the old or new methods, doesn't depend on the local RAFT fature being enabled, but on the state of the upgrade procedure. In this commit the state of the upgrade is always `use_pre_raft_procedures` because the upgrade procedure is not implemented yet. But stay tuned. The upgrade procedure will need certain guarantees: at some point it switches from `use_pre_raft_procedures` to `synchronize` state. During `synchronize` schema changes must be disabled, so the procedure can ensure that schema is in sync across the entire cluster before establishing group 0. Thus, when the switch happens, no schema change can be in progress. To handle all this weirdness we introduce `_upgrade_lock` and `get_group0_upgrade_state` which takes this lock whenever it returns `use_pre_raft_procedures`. Creating a `group0_guard` - which happens at the start of every group 0 operation - will take this lock, and the lock holder shall be stored inside the guard (note: the holder only holds the lock if `use_pre_raft_procedures` was returned, no need to hold it for other cases). Because `group0_guard` is held for the entire duration of a group 0 operation, and because the upgrade procedure will also have to take this lock whenever it wants to change the upgrade state (it's an rwlock), this ensures that no group 0 operation that uses the old ways is happening when we change the state. We also implement `wait_until_group0_upgraded` using a condition variable. It will be used by certain methods during upgrade (later commits; stay tuned). Some additional comments were written.	2022-08-19 19:15:19 +02:00
Calle Wilund	a729c2438e	commitlog: Make get_segments_to_replay on-demand Refs #11237 Don't store segments found on init scan in all shard instances, instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store pointless string lists in shards 1 to X, and also we can potentially ask for the list twice. More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is considered candidate for recycling, and included in footprint.	2022-08-11 06:41:23 +00:00
Takuya ASADA	3ffc978166	main: move preinit_description to main() We don't need to wait for handling version options after scylla_main() called, we can handle it in main() instead. Closes #11221	2022-08-08 18:31:43 +03:00
Pavel Emelyanov	527b345079	Merge 'storage_proxy: introduce a `remote` "subservice"' from Kamil Braun Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema. The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`. The long game here is to split the initialization of `storage_proxy` into two steps: - the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`. - the second step will take those references and add the `remote` part to `storage_proxy`. This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`. Closes #11088 * github.com:scylladb/scylladb: service: storage_proxy: pass `migration_manager*` to `init_messaging_service` service: storage_proxy: `remote`: make `_gossiper` a const reference gms: gossiper: mark some member functions const db: consistency_level: `filter_for_query`: take `const gossiper&` replica: table: `get_hit_rate`: take `const gossiper&` gms: gossiper: move `endpoint_filter` to `storage_proxy` module service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager` service: storage_proxy: establish private section in `remote` service: storage_proxy: remove `migration_manager` pointer service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote` service: storage_proxy: remove `_gossiper` field alternator: ttl: pass `gossiper&` to `expiration_service` service: storage_proxy: move `truncate_blocking` implementation to `remote` service: storage_proxy: introduce `is_alive` helper service: storage_proxy: remove `_messaging` reference service: storage_proxy: move `connection_dropped` to `remote` service: storage_proxy: make `encode_replica_exception_for_rpc` a static function service: storage_proxy: move `handle_write` to `remote` service: storage_proxy: move `handle_paxos_prune` to `remote` service: storage_proxy: move `handle_paxos_accept` to `remote` service: storage_proxy: move `handle_paxos_prepare` to `remote` service: storage_proxy: move `handle_truncate` to `remote` service: storage_proxy: move `handle_read_digest` to `remote` service: storage_proxy: move `handle_read_mutation_data` to `remote` service: storage_proxy: move `handle_read_data` to `remote` service: storage_proxy: move `handle_mutation_failed` to `remote` service: storage_proxy: move `handle_mutation_done` to `remote` service: storage_proxy: move `handle_paxos_learn` to `remote` service: storage_proxy: move `receive_mutation_handler` to `remote` service: storage_proxy: move `handle_counter_mutation` to `remote` service: storage_proxy: remove `get_local_shared_storage_proxy` service: storage_proxy: (de)register RPC handlers in `remote` service: storage_proxy: introduce `remote`	2022-08-04 17:50:20 +03:00
Kamil Braun	0a4e701b50	service: storage_proxy: pass `migration_manager*` to `init_messaging_service` `migration_manager` lifetime is longer than the lifetime of "storage proxy's messaging service part" - that is, `init_messaging_service` is called after `migration_manager` is started, and `uninit_messaging_service` is called before `migration_manager` is stopped. Thus we don't need to hold an owning pointer to `migration_manager` here. Later, when `init_messaging_service` will actually construct `remote`, this will be a reference, not a pointer. Also observe that `_mm` in `remote` is only used in handlers, and handlers are unregistered before `_mm` is nullified, which ensures that handlers are not running when `_mm` is nullified. (This argument shows why the code made sense regardless of our switch from shared_ptr to raw ptr).	2022-08-04 12:19:43 +02:00
Kamil Braun	078900042f	service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager` No need to call `_remote->gossiper().shared_from_this()` from within storage_proxy now.	2022-08-04 12:16:09 +02:00
Kamil Braun	ab946e392f	alternator: ttl: pass `gossiper&` to `expiration_service` This allows us to remove the `gossiper()` getter from `storage_proxy`.	2022-08-04 12:12:43 +02:00
Takuya ASADA	d7dfd0a696	main: run --version before app_template initialize Even on the environment which causes error during initalize Scylla, "scylla --version" should be able to run without error. To do so, we need to parse and execute these options before initializing Scylla/Seastar classes. Fixes #11117 Closes #11179	2022-08-03 11:25:28 +03:00
Avi Kivity	a4844826fc	Merge 'Decouple compaction manager from database' from Benny Halevy Start compaction_manager as a sharded service and pass a reference to it to the database rather than having the database construct its own compaction_manager. This is part of the wider scope effort to decouple compaction from replica database and table. Closes #11099 * github.com:scylladb/scylladb: compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges main: start compaction_manager as a sharded service compaction_manager: keep config as member backlog_controller: keep scheduling_group by value backlog_controller: scheduling_group: keep io_priority_class by value backlog_controller: scheduling_group: define default member initializers backlog_controller: get rid of _interval member	2022-08-02 19:02:46 +03:00
Avi Kivity	268e4abe77	Merge 'wasm: reuse instances for wasm UDFs' from Wojciech Mitros Calling WebAssembly UDFs requires wasmtime instance. Creating such an instance is expensive, but these instances can be reused for subsequent calls of the same UDF on various inputs. This patch introduces a way of reusing wasmtime instances: a wasm instance cache. The cache stores a wasmtime instance for each UDF and scheduling group. The instances are evicted using LRU strategy and their size is based on the size of their wasm memories. The instances stored in the cache are also dropped when the UDF is dropped itself. For that reason, the first patch modifies the current implementation of UDF dropping, so that the instance dropping may be added later. The patch also removes the need of compiling the UDF again when dropping it. The second patch contains the implementation and use of the new cache. The cache is implemented in `lang/wasm_instance_cache.hh` and the main ways of using it are the `run_script` methods from `wasm.hh` The third patch adds tests to `test_wasm.py` that check the correctness and performance of the new cache. The tests confirm the instance reuse, size limits, instance eviction after timeout and after dropping the UDF. Closes #10306 * github.com:scylladb/scylladb: wasm: test instances reuse wasm: reuse UDF instances schema_tables: simplify merge_functions and avoid extra compilation	2022-08-02 13:51:16 +03:00
Benny Halevy	e4e92d44ae	main: start compaction_manager as a sharded service And pass a reference to it to the database rather than having the database construct its own compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:50:15 +03:00
Michał Sala	041cb77ad0	alternator, db: move the tag code to db/tags Tags are a useful mechanism that could be used outside of alternator namespace. My motivation to move tags_extension and other utilities to db/tags/ was that I wanted to use them to mark "synchronous mode" views. I have extracted `get_tags_of_table`, `find_tag` and `update_tags` method to db/tags/utils.cc and moved alternator/tags_extension.hh to db/tags/. The signature of `get_tags_of_table` was changed from `const std::map<sstring, sstring>&` to `const std::map<sstring, sstring>*` Original behavior of this function was to throw an `alternator::api_error` exception. This was undesirable, as it introduced a dependency on the alternator module. I chose to change it to return a potentially null value, and added a wrapper function to the alternator module - `get_tags_of_table_or_throw` to keep the previous throwing behavior.	2022-07-25 09:53:33 +02:00
Wojciech Mitros	9281ba3919	wasm: reuse UDF instances When executing a wasm UDF, most of the time is spent on setting up the instance. To minimize its cost, we reuse the instance using wasm::instance_cache. This patch adds a wasm instance cache, that stores a wasmtime instance for each UDF and scheduling group. The instances are evicted using LRU strategy. The cache may store some entries for the UDF after evicting the instance, but they are evicted when the corresponding UDF is dropped, which greatly limits their number. The size of stored instances is estimated using the size of their WASM memories. In order to be able to read the size of memory, we require that the memory is exported by the client. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-07-20 18:19:22 +02:00
Pavel Emelyanov	a246b6d3eb	streaming: Pass db::config& to manager constructor The stream_manager will bookkeep the streaming bandwidth option, to subscribe on its changes it needs the config reference. It would be better if it was stream_manager::config, but currently subscription on db::config::<stuff> updates is not very shard-friendly, so we need to carry the config reference itself around. Similar trouble is there for compaction_manager. The option is passed through its own config, but the config is created on each shard by database code. Stream manager config would be created once by main code on shard 0. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:18:08 +03:00
Nadav Har'El	cc69177dcc	config: fix printing of experimental feature list Recently we noticed a regression where with certain versions of the fmt library, SELECT value FROM system.config WHERE name = 'experimental_features' returns string numbers, like "5", instead of feature names like "raft". It turns out that the fmt library keep changing their overload resolution order when there are several ways to print something. For enum_option<T> we happen to have to conflicting ways to print it: 1. We have an explicit operator<<. 2. We have an implicit convertor to the type held by T. We were hoping that the operator<< always wins. But in fmt 8.1, there is special logic that if the type is convertable to an int, this is used before operator<<()! For experimental_features_t, the type held in it was an old-style enum, so it is indeed convertible to int. The solution I used in this patch is to replace the old-style enum in experimental_features_t by the newer and more recommended "enum class", which does not have an implicit conversion to int. I could have fixed it in other ways, but it wouldn't have been much prettier. For example, dropping the implicit convertor would require us to change a bunch of switch() statements over enum_option (and not just experimental_features_t, but other types of enum_option). Going forward, all uses of enum_option should use "enum class", not "enum". tri_mode_restriction_t was already using an enum class, and now so does experimental_features_t. I changed the examples in the comments to also use "enum class" instead of enum. This patch also adds to the existing experimental_features test a check that the feature names are words that are not numbers. Fixes #11003. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11004	2022-07-11 09:17:30 +02:00
Tomasz Grabiec	6b316f267f	db: Avoid memtable flush latency on schema merge Currently, applying schema mutations involves flushing all schema tables so that on restart commit log replay is performed on top of latest schema (for correctness). The downside is that schema merge is very sensitive to fdatasync latency. Flushing a single memtable involves many syncs, and we flush several of them. It was observed to take as long as 30 seconds on GCE disks under some conditions. This patch changes the schema merge to rely on a separate commit log to replay the mutations on restart. This way it doesn't have to wait for memtables to be flushed. It has to wait for the commitlog to be synced, but this cost is well amortized. We put the mutations into a separate commit log so that schema can be recovered before replaying user mutations. This is necessary because regular writes have a dependency on schema version, and replaying on top of latest schema satisfies all dependencies. Without this, we could get loss of writes if we replay a write which depends on the latest schema on top of old schema. Also, if we have a separate commit log for schema we can delay schema parsing for after the replay and avoid complexity of recognizing schema transactions in the log and invoking the schema merge logic. One complication with this change is that replay_position markers are commitlog-domain specific and cannot cross domains. They are recorded in various places which survive node restart: sstables are annotated with the maximum replay position, and they are present inside truncation records. The former annotation is used by "truncate" operation to drop sstables. To prevent old replay positions from being interpreted in the context in the new schema commitlog domain, the change refuses to boot if there are truncation records, and also prohibits truncation of schema tables. The boot sequence needs to know whether the cluster feature associated with this change was enabled on all nodes. Fetaures are stored in system.scylla_local. Because we need to read it before initializing schema tables, the initialization of tables now has to be split into two phases. The first phase initializes all system tables except schema tables, and later we initialize schema tables, after reading stored cluster features. The commitlog domain is switched only when all nodes are upgraded, and only after new node is restarted. This is so that we don't have to add risky code to deal with hot-switching of the commitlog domain. Cold switching is safer. This means that after upgrade there is a need for yet another rolling restart round. Fixes #8272 Fixes #8309 Fixes #1459	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	c5ad05c819	db: Allow splitting initiatlization of system tables We will need some system tables to be initialized earlier in the boot so that system.scylla_local can be read before schema tables are initialized.	2022-07-06 22:08:56 +02:00
Pavel Emelyanov	85033ea6ae	Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0. They are mostly refactors which don't affect the behavior of the system, except one: the commit `4d439a16b3` causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because: 1. eventually, we want this to be the default behavior 2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary. Closes #10864 * github.com:scylladb/scylla: service/raft: raft_group_registry: add assertions when fetching servers for groups service/raft: raft_group_registry: remove `_raft_support_listener` service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map service/raft: raft_group0: move group 0 RPC handlers from `storage_service` service/raft: messaging: extract raft_addr/inet_addr conversion functions service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls test/boost: memtable_test: perform schema operations on shard 0 test/boost: cdc_test: remove test_cdc_across_shards message: rename `send_message_abortable` to `send_message_cancellable` message: change parameter order in `send_message_oneway_timeout`	2022-06-29 16:51:54 +03:00
Pavel Emelyanov	3a753068be	Merge "Make permissions cache live updateable and add an API for resetting authorization cache" from Igor Ribeiro Barbosa Duarte Currently, for users who have permissions_cache configs set to very high values (and thus can't wait for the configured times to pass) having to restart the service every time they make a change related to permissions or prepared_statements cache (e.g. Adding a user and changing their permissions) can become pretty annoying. This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries live updateable so that restarting the service is not necessary anymore for these cases. It also adds an API for flushing the cache to make it easier for users who don't want to modify their permissions_cache config. branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/ dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache * https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable: loading_cache_test: Test loading_cache::reset and loading_cache::update_config api: Add API for resetting authorization cache authorization_cache: Make permissions cache and authorized prepared statements cache live updateable auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct utils/loading_cache.hh: Add update_config method utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh utils/loading_cache.hh: Add reset method	2022-06-29 11:14:13 +03:00
Igor Ribeiro Barbosa Duarte	a23c3d6338	api: Add API for resetting authorization cache For cases where we have very high values set to permissions_cache validity and update interval (E.g.: 1 day), whenever a change to permissions is made it's necessary to update scylla config and decrease these values, since waiting for all this time to pass wouldn't be viable. This patch adds an API for resetting the authorization cache so that changing the config won't be mandatory for these cases. Usage: $ curl -X POST http://localhost:10000/authorization_cache/reset Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte	b9051c79bc	authorization_cache: Make permissions cache and authorized prepared statements cache live updateable Currently, for users who have permissions_cache configs set to very high values (and thus can't wait for the configured times to pass) having to restart the service every time they make a change related to permissions or prepared_statements cache(e.g.: Adding a user) can become pretty annoying. This patch make permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries live updateable so that restarting the service is not necessary anymore for these cases. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte	c8c48a98fa	auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct This patch makes authorized_prepared_statements_cache acccept a config struct, similarly to permissions_cache. This will make it easier to make this cache live updateable on the next patch. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:57:52 -03:00
Igor Ribeiro Barbosa Duarte	667840a7eb	utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh This patch renames the permissions_cache_config struct to loading_cache_config and moves it to utils/loading_cache.hh. This will make it easier to handle config updates to the authorization caches on the next patches Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:46:22 -03:00
Amnon Heiman	6b9b76c919	main.cc: add trailing backslash to the API directories The API uses the http server to serve two directories: the api_ui_dir where the swagger-ui directory is found and the api_doc_dir where the swagger definition files are found. Internally, the API uses the httpd::directory_handler that append the files it gets from the path to the base directory name. A user can override the default configuration and set a directory name that will not end with a backslash. This will result with files not found. This patch check if that backslash is missing, and if it is, adds it to the API configuration. Fixes #10700 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #10877	2022-06-26 20:05:37 +03:00
Avi Kivity	dab56b82fa	Merge 'Per-partition rate limiting' from Piotr Dulikowski Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode. This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension: ``` ALTER TABLE ks.tbl WITH per_partition_rate_limit = { 'max_writes_per_second': 100, 'max_reads_per_second': 200 }; ``` Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead. Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all. The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here. Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`: - Write mode: ``` `8f690fdd47` (PR base): 129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op) This PR: 125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op) ``` - Read mode: ``` `8f690fdd47` (PR base): 150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op) This PR: 151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op) ``` Manual upgrade test: - Start 3 nodes, 4 shards each, Scylla version `8f690fdd47` - Create a keyspace with scylla-bench, RF=3 - Start reading and writing with scylla-bench with CL=QUORUM - Manually upgrade nodes one by one to the version from this PR - Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded - Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected Fixes: #4703 Closes #9810 * github.com:scylladb/scylla: storage_proxy: metrics for per-partition rate limiting of reads storage_proxy: metrics for per-partition rate limiting of writes database: add stats for per partition rate limiting tests: add per_partition_rate_limit_test config: add add_per_partition_rate_limit_extension function for testing cf_prop_defs: guard per-partition rate limit with a feature query-request: add allow_limit flag storage_proxy: add allow rate limit flag to get_read_executor storage_proxy: resultize return type of get_read_executor storage_proxy: add per partition rate limit info to read RPC storage_proxy: add per partition rate limit info to query_result_local(_digest) storage_proxy: add allow rate limit flag to mutate/mutate_result storage_proxy: add allow rate limit flag to mutate_internal storage_proxy: add allow rate limit flag to mutate_begin storage_proxy: choose the right per partition rate limit info in write handler storage_proxy: resultize return types of write handler creation path storage_proxy: add per partition rate limit to mutation_holders storage_proxy: add per partition rate limit info to write RPC storage_proxy: add per partition rate limit info to mutate_locally database: apply per-partition rate limiting for reads/writes database: move and rename: classify_query -> classify_request schema: add per_partition_rate_limit schema extension db: add rate_limiter storage_proxy: propagate rate_limit_exception through read RPC gms: add TYPED_ERRORS_IN_READ_RPC cluster feature storage_proxy: pass rate_limit_exception through write RPC replica: add rate_limit_exception and a simple serialization framework docs: design doc for per-partition rate limiting transport: add rate_limit_error	2022-06-24 01:32:13 +03:00
Kamil Braun	bb58ee0b2e	service/raft: raft_group_registry: remove `_raft_support_listener` It did nothing. It will be readded in `raft_group0` and it will do something, stay tuned. With this we can remove the `feature_service` reference from `raft_group_registry`.	2022-06-23 16:14:41 +02:00
Kamil Braun	8e907cbf57	service/raft: raft_group0: move group 0 RPC handlers from `storage_service` And generate the boilerplate from IDL declarations. Simplifies the code, and the code now resides where it belongs.	2022-06-23 16:14:41 +02:00
Kamil Braun	5da163e0b8	service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` `raft_group0` was constructed at the beginning of `join_cluster`, which required passing references to 3 additional services to `join_cluster` used only for that purpose (group 0 client, raft group registry, and query processor). Now we initialize `raft_group0` in main - like all other services - and pass a reference to `join_cluster` so `storage_service` can store a pointer to group 0. We initialize `raft_group0` before we start listening for RPCs in `messaging_service`. In a later commit we'll move the initialization of group 0 related verbs to the constructor of `raft_group0` from `storage_service`, so they will be initialized before we start listening for RPCs.	2022-06-23 16:14:41 +02:00
Piotr Dulikowski	dccb8a5729	schema: add per_partition_rate_limit schema extension Adds the new `per_partition_rate_limit` schema extension. It has two parameters: `max_writes_per_second` and `max_reads_per_second`. In the future commits they will control how many operations of given type are allowed for each partition in the given table.	2022-06-22 20:16:48 +02:00
Pavel Emelyanov' via ScyllaDB development	b0b29edcd7	distributed-loader: Remove ensure_system_table_directories It looks like the exactly same code is called few steps above via distributed_loader::init_system_keyspace `- distributed_loader::populate_keyspace While at it -- move the supervisor::notify("loading system sstables") handing around in the more suitable location. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/981/ Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220621165313.31284-1-xemul@scylladb.com>	2022-06-22 13:59:00 +03:00
Gleb Natapov	6e100d1ea3	main: stop raft before the migration manager Since the group0 uses migration manager to apply commands we need to stop raft before we stopping migration manager.	2022-06-09 09:40:55 +03:00
Gleb Natapov	70b7b2b4d6	storage_service: do not pass the raft group manager to storage_service constructor Reduce the storage_service's dependency on the raft group manager. The group manager is needed only during bootstrap and in an rpc handler, so pass it to those functions directly.	2022-06-09 09:40:55 +03:00
Gleb Natapov	89fe305888	main: destroy the group0_client after stopping the group0 The group0_client uses the group0 internally and cannot be destroyed until the group0 is stopped to guaranty no ongoing calls into it by the group0_client.	2022-06-09 09:23:53 +03:00
Benny Halevy	1daa7820c9	main: shutdown: do not abort on storage_io_error Do not abort in defer_verbose_shutdown if the callback throws storage_io_error, similar and in addition to the system errors handling that was added in `132c9d5933` As seen in https://github.com/scylladb/scylla/issues/9573#issuecomment-1148238291 Fixes #9573 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10740	2022-06-07 16:55:08 +03:00
Pavel Emelyanov	d755fdc1f4	storage_service: Remove global proxy call Storage service needs it to calculate schema version on join. The proxy at this point can be passed as an argument to the joining helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	bc051387c5	storage_service: Remove sys_dist_ks from storage_service dependencies The service in question is only needed join_cluster-time, no need to keep it in the dependencies list. This also solves the dependency trouble -- the distributed keyspace is sharded::start-ed after it's passed to storage_service initialization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00

1 2 3 4 5 ...

836 Commits