scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Avi Kivity	c96fc1d585	Merge "Introduce row level repair" from Asias " === How the the partition level repair works - The repair master decides which ranges to work on. - The repair master splits the ranges to sub ranges which contains around 100 partitions. - The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions. - If the checksum matches, the data in this sub range is synced. - If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers. === Major problems with partition level repair - A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions. - Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice === Row level repair Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch === How the row level repair works - To solve the problem of reading data twice Read the data only once for both checksum and synchronization between nodes. We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers. We need to find a sync boundary among the nodes which contains only N bytes of rows. - To solve the problem of sending unnecessary data. We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems. For example: Node1 has set1 = {row1, row2, row3} Node2 has set2 = { row2, row3} Node3 has set3 = {row1, row2, row4} To repair: Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3. Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2 Node1 sends row3 (set1 + set2 + set3 - set3) to Node3. === How to implement repair with set reconciliation - Step A: Negotiate sync boundary class repair_sync_boundary { dht::decorated_key pk; position_in_partition position } Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary. - Step B: Get missing rows from peer nodes so that repair master contains all the rows Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers. At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes. Now, local node contains all the rows. - Step C: Send missing rows to the peer nodes Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes. === How the RPC API looks like - repair_range_start() Step A: - request_sync_boundary() Step B: - request_combined_row_hashes() - reqeust_full_row_hashes() - request_row_diff() Step C: - send_row_diff() - repair_range_stop() === Performance evaluation We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below. 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows. Time to repair: old = 87 min new = 70 min (rebuild took 50 minutes) improvement = 19.54% 2) 100% synced: all of the 3 nodes have 1 billion identical rows. Time to repair: old = 43 min new = 24 min improvement = 44.18% 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. Time to repair: old: 211 min new: 44 min improvement: 79.15% Bytes sent on wire for repair: old: tx= 162 GiB, rx = 90 GiB new: tx= 1.15 GiB, tx = 0.57 GiB improvement: tx = 99.29%, rx = 99.36% It is worth noting that row level repair sends and receives exactly the number of rows needed in theory. In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair slave 1 and 1 million rows from repair slave 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair slave 1 to repair slave 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair slave 2 to repair slave 1. In the result, we saw the rows on wire were as expected. tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000 Fixes: #3033 Tests: dtests/repair_additional_test.py " * 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits) repair: Enable row level repair repair: Add row_level_repair repair: Add docs for row level repair repair: Add repair_init_messaging_service_handler repair: Add repair_meta repair: Add repair_writer repair: Add repair_reader repair: Add repair_row repair: Add fragment_hasher repair: Add decorated_key_with_hash repair: Add get_random_seed repair: Add get_common_diff_detect_algorithm repair: Add shard_config repair: Add suportted_diff_detect_algorithms repair: Add repair_stats to repair_info repair: Introduce repair_stats flat_mutation_reader: Add make_generating_reader storage_service: Introduce ROW_LEVEL_REPAIR feature messaging_service: Add RPC verbs for row level repair repair: Export the repair logger ...	2018-12-25 13:13:00 +02:00
Duarte Nunes	6df32bfb0c	main: Start and stop the view_update_backlog_broker Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	776fdd4d1a	service/storage_proxy: Expose local view update backlog The local view update backlog is the max backlog out of the relative memory backlog size and the relative hints backlog size. We leverage the db::view::node_update_backlog class so we can send the max backlog out of the node's shards. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Asias He	b9e0db801d	repair: Enable row level repair Finally, enable new row level repair if the cluster supports it. If not, fallback to the old partition level repair. Fixes #3033	2018-12-12 16:49:01 +08:00
Avi Kivity	89be47e291	batchlog_manager: remove dependency on db::config Extract configuration into a new struct batchlog_manager_config and have the callers populate it using db::config. This reduces dependencies on global objects.	2018-12-09 20:11:38 +02:00
Tomasz Grabiec	6012a63660	Merge "Fix window during init where waiting for a feature can be ignored" from Avi storage_service keeps a bunch of "feature" variables, indicating cluster-wide supported features, and has the ability to wait until the entire cluster supports a given feature. The propagation of features depends on gossip, but gossip is initialized after storage_service, so the current code late-initializes the features. However, that means that whoever waits on a feature between storage_service initialization and gossip initialization loses their wait entry. In #3952, we have proof that this in fact happens. Fix this by removing the circular dependency. We now store features in a new service, feature_service, that is started before both gossip and storage_service. Gossip updates feature_service while storage_service reads for it. Fixes #3953. * https://github.com/avikivity/3953/v4.1: storage_service: deinline enable_all_features() gossiper: keep features registered tests/gossip: switch to seastar::thread storage_service: deinline init/deinit functions gossiper: split feature storage into a new feature_service gossiper: maybe enable features after start_gossiping() storage_service: fix gap when feature::when_enabled() doesn't work	2018-12-06 15:42:26 +01:00
Avi Kivity	4e553b692e	gossiper: split feature storage into a new feature_service Feature lifetime is tied to storage_service lifetime, but features are now managed by gossip. To avoid circular dependency, add a new feature_service service to manage feature lifetime. To work around the problem, the current code re-initializes features after gossip is initialized. This patch does not fix this problem; it only makes it possible to solve it by untyping features from gossip.	2018-12-06 16:31:04 +02:00
Glauber Costa	fee4d2eb9b	compaction_manager: delay initialization of the compaction manager. If the compaction manager is started, compactions may start (this is regardless of whether or not we trigger them). The problem with that is that they start at a time in which we are flushing the commitlog and the initialization procedure waits for the commitlog to be fully flushed and the resulting memtables flushed before we move on. Because there are no incoming writes, the amount of shares in memtable flushes decrease as memory used decreases and that can cause the startup procedure to take a long time. We have recently started to bump the shares manually for manual flushes. While that guarantees that we will not drive the shares to zero, I will make the argument that we can do better by making sure that those things are, at this point, running alone: user experience is affected by startup times and the bump we give to user-triggered operations will only do so much. Even if we increase the shares a lot flushes will still be fighting for resources with compactions and startup will take longer than it could. By making sure that flushes are this point running alone we improve the user experience by making sure the startup is as fast as it can be. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-12-04 13:48:42 -05:00
Piotr Sarna	6ab8235369	main: fix deinitialization order for view update generator View update generator should be stopped only after drain_on_shutdown() is performed on storage service. Message-Id: <4d2bda4c73422a2ebf46d6dcd06c95d960839889.1543230849.git.sarna@scylladb.com>	2018-11-26 11:21:37 +00:00
Avi Kivity	775b7e41f4	Update seastar submodule * seastar d59fcef...b924495 (2): > build: Fix protobuf generation rules > Merge "Restructure files" from Jesse Includes fixup patch from Jesse: " Update Seastar `#include`s to reflect restructure All Seastar header files are now prefixed with "seastar" and the configure script reflects the new locations of files. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com> "	2018-11-21 00:01:44 +02:00
Piotr Sarna	16c042039c	main: add registering staging sstables read from disk Staging sstables read from disk are registered to the view update generator right after initializing non system keyspaces. Fixes #3275	2018-11-13 15:04:43 +01:00
Piotr Sarna	dc74887ff3	streaming: add system distributed keyspace ref to streaming Streaming code needs system distributed keyspace to check if streamed sstables should be staging, so a proper reference is added.	2018-11-13 15:01:53 +01:00
Piotr Sarna	7ef5e1b685	streaming: add view update generator reference to streaming Streaming code may need view update generator service to generate and send view updates, so a proper reference is added.	2018-11-13 15:01:53 +01:00
Piotr Sarna	eb0c507a45	main: add generating missed mv updates from staging sstables If any sstables are found in the staging directory, it means that they missed generating view updates, so it's performed now.	2018-11-13 15:01:53 +01:00
Avi Kivity	a71ab365e3	toplevel: convert sprint() to format() sprint() recently became more strict, throwing on sprint("%s", 5). Replace with the more modern format(). Mechanically converted with https://github.com/avikivity/unsprint.	2018-11-01 13:16:17 +00:00
Vlad Zolotarov	aca0882a3f	hinted handoff: enable storing hints before starting messaging_service When messaging_service is started we may immediately receive a mutation from another node (e.g. in the MV update context). If hinted handoff is not ready to store hints at that point we may fail some of MV updates. We are going to resolve this by start()ing hints::managers before we start messaging_service and blocking hints replaying until all relevant objects are initialized. Refs #3828 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-10-18 16:49:58 -04:00
Avi Kivity	e7ae4beef0	main: run prometheus and API servers under streaming group Both the Prometheus and the API servers are used for maintenance operations, similarly to streaming. Run them under the streaming scheduling group to prevent them from impacting normal operations, and rename the streaming scheduling group to reflect the more generic role. This helps to prevent spikes from Prometheus or API requests from interfering with the normal workload. Using an existing group is preferable to creating a new group because in the worst case, all the non-main-workload groups compete with the main workload. Consolidating them allows us to give them significant shares in total without increasing competition in the worst case. The group's label is unchanged to preserve compatibility with dashboards. A nice side effect is that repair, which is initiated by API calls, gets placed into the maintenance group naturally. Compaction tasks which are run by compaction manager are not changed. Message-Id: <20180714160723.23655-1-avi@scylladb.com>	2018-07-30 15:07:33 +01:00
Avi Kivity	8c993e0728	messaging: tag RPC services with scheduling groups Assign a scheduling_group for each RPC service. Assignement is done by connection (get_rpc_client_idx()) - all verbs on the same connection are assigned the same group. While this may seem arbitrary, it avoids priority inversion; if two verbs on the same connection have different scheduling groups, the verb with the low shares may cause a backlog and stall the connection, including following requests from verbs that ought to have higher shares. The scheduling_group parameters are encapsulated in different classes as they are passed around to avoid adding dependencies. Message-Id: <20180708140433.6426-1-avi@scylladb.com>	2018-07-13 13:57:08 +02:00
Vlad Zolotarov	c65a110839	main: remove the "experimental" tag from the hinted handoff feature Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-07-06 19:19:40 -04:00
Vlad Zolotarov	83ba6d84a1	db::hints::manager: implement rebalance() method Rebalance hints segments that need to be sent among all present shards. Ensure that after rebalancing the difference between the number of segments of any two shards is not greater than 1. Try to minimize the amount of "file rename" operations in order to achieve the needed result. Note: "Resharding" is a particular case of rebalancing. Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-07-06 19:18:46 -04:00
Avi Kivity	f55a2fe3a7	main: improve reporting of dns resolution errors A report that C-Ares returned some errors tells the user nothing. Improve the error message by including the name of the configuration variable and its value. Message-Id: <20180705084959.10872-1-avi@scylladb.com>	2018-07-05 10:24:41 +01:00
Tomasz Grabiec	074be4d4e8	memtable, cache: Run mutation_cleaner worker in its own scheduling group The worker is responsible for merging MVCC snapshots, which is similar to merging sstables, but in memory. The new scheduling group will be therefore called "memory compaction". We should run it in a separate scheduling group instead of main/memtables, so that it doesn't disrupt writes and other system activities. It's also nice for monitoring how much CPU time we spend on this.	2018-06-27 21:51:04 +02:00
Avi Kivity	ea39e3e9d4	main: start client protocol servers under the statement scheduling group This will isolate client protocol and coordinator-side processing from the rest of the system.	2018-06-18 18:30:21 +03:00
Gleb Natapov	da20d86423	Configure authorized_prepared_statment_cache memory limit during object creation	2018-06-11 15:34:14 +03:00
Gleb Natapov	b38ced0fcd	Configure logalloc memory size during initialization	2018-06-11 15:34:14 +03:00
Gleb Natapov	646e400918	Provide available memory size to messaging_service object during creation	2018-06-11 15:34:13 +03:00
Gleb Natapov	ac88935baa	Provide available memory size to storage_proxy object during creation	2018-06-11 15:34:13 +03:00
Gleb Natapov	f41575a156	Provide available memory size to database object during creation	2018-06-11 15:34:13 +03:00
Gleb Natapov	461f20e7b1	Configure prepared_statements_cache memory limit from outside Pass desirable memory limit during construction instead of querying memory size explicitly.	2018-06-11 15:34:13 +03:00
Piotr Sarna	f12fdcffdb	storage_proxy: restore optional hinted handoff Since hinted handoff for materialized views is now a separate entity, regular hinted handoff can go back to being optional.	2018-06-04 09:46:06 +02:00
Piotr Sarna	a791dce0ae	db, config: add view_pending_updates directory Hints for materialized view updates need to be kept somewhere, because their dedicated hints manager has to have a root directory. view_pending_updates directory resides in /data and is used for that purpose.	2018-06-04 09:46:06 +02:00
Piotr Sarna	b7ac2da238	main: initialize hints manager unconditionally This commit makes sure that hints manager is always initialized, including creating hints directories and starting it. It needs to be fixed because hints manager is internally used to store failed materialized view replicas. Fixes #3451 Message-Id: <44532fd3704e20cabeb9c4985dace5650fd22d2c.1527018865.git.sarna@scylladb.com>	2018-05-22 22:21:50 +01:00
Glauber Costa	3d2c4c1cf8	main: change I/O scheduler verification code Before we accept running while not in developer mode, we verify that the I/O Scheduler is properly configured. Up until now, that meant verifying that --max-io-requests is properly set and that the number of I/O Queues is enough to leave at least 4 requests per I/O Queue. Systems that move to newer versions of Scylla may continue doing that, so we need to be backwards compatible and keep testing for that. However, newer systems will not set that option, but pass a YAML property file (or string) instead. So we need to make sure that either one of those is set. If the property file is set, I am deciding here not to test for number of I/O queues. scylla_io_setup will usually configure that anyway, plus we plan on soon moving to all-shards-dispatch making that less important. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180509163737.5907-1-glauber@scylladb.com>	2018-05-13 19:22:54 +03:00
Vlad Zolotarov	48c96d09d6	db::hints::manager: drain hints when the node is decommissioned/removed When node is decommissioned/removed it will drain all its hints and all remote nodes that have hints to it will drain their hints to this node. What "drain" means? - The node that "drains" hints to a specific destination will ignore failures and will continue sending hints till the end of the current segment, erase it and move to the next one till there are no more segments left. After all hints are drained the corresponding hints directory is removed. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-05-08 22:29:21 +01:00
Duarte Nunes	bf5045c7eb	db/view: Require configuration option to enable view building View building, enabled by default, can contain or expose issues that prevent the node from starting. In those cases, it is necessary to disable view building such that the node can be submitted to maintenance operations. Fixes #3329 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-04-03 13:16:28 +01:00
Glauber Costa	a9ef72537f	parse and ignore background writer controller Unused options are not exposed as command line options and will prevent Scylla from booting when present, although they can still be pased over YAML, for Cassandra compatibility. That has never been a problem, but we have been adding options to i3 (and others) that are now deprecated, but were previously marked as Used. Systems with those options may have issues upgrading. While this problem is common to all Unused options, the likelihood for any other unused option to appear in the command line is near zero, except for those two - since we put them there ourselves. There are two ways to handle this issue: 1) Mark them as Used, and just ignore them. 2) Add them explicitly to boost program options, and then ignore them. The second option is preferred here, because we can add them as hidden options in program_options, meaning they won't show up in the help. We can then just print a discrete message saying that those options are, for now on ignored. v2: mark set as const (Botond) v3: rebase on top of master, identation suggested by Duarte. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180329145517.8462-1-glauber@scylladb.com>	2018-03-29 17:57:30 +03:00
Duarte Nunes	a21efeffa0	db/view/view_builder: React to schema changes The view_builder now uses the migration_manager to subscribe to schema change events, and update its bookkeeping accordingly. We prefer this to having the database call into the view_builder, as that would create a cyclic dependency. We serialize changes to the views of a particular base table, such that schema changes do not interfere with the upcoming view building code. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:11 +01:00
Duarte Nunes	901faabaa2	db/view: Introduce view_builder This patch introduces the view_builder class, a sharded service responsible for building all defined materialized views. This process entails walking over the existing data in a given base table, and using it to calculate and insert the respective entries for one or more views. This patch introduces only the bootstrap functionality, which is responsible for loading the data stored in the system tables and filling the in-memory data structures with the relevant information, to be used in subsequent patches for the actual view building. The interaction with the system tables is as follows. Interaction with the tables in system_keyspace: - When we start building a view, we add an entry to the scylla_views_builds_in_progress system table. If the node restarts at this point, we'll consider these newly inserted views as having made no progress, and we'll treat them as new views; - When we finish a build step, we update the progress of the views that we built during this step by writing the next token to the scylla_views_builds_in_progress table. If the node restarts here, we'll start building the views at the token in the next_token column. - When we finish building a view, we mark it as completed in the built views system table, and remove it from the in-progress system table. Under failure, the following can happen: * When we fail to mark the view as built, we'll redo the last step upon node reboot; * When we fail to delete the in-progress record, upon reboot we'll remove this record. A view is marked as completed only when all shards have finished their share of the work, that is, if a view is not built, then all shards will still have an entry in the in-progress system table; - A view that a shard finished building, but not all other shards, remains in the in-progress system table, with first_token == next_token. Interaction with the distributed system table (view_build_status): - When we start building a view, we mark the view build as being in-progress; - When we finish building a view, we mark the view as being built. Upon failure, we ensure that if the view is in the in-progress system table, then it may not have been written to this table. We don't load the built views from this table when starting. When starting, the following happens: * If the view is in the system.built_views table and not the in-progress system table, then it will be in view_build_status; * If the view is in the system.built_views table and not in this one, it will still be in the in-progress system table - we detect this and mark it as built in this table too, keeping the invariant; * If the view is in this table but not in system.built_views, then it will also be in the in-progress system table - we don't detect this and will redo the missing step, for simplicity. View building is necessarily a sharded process. That means that on restart, if the number of shards has changed, we need to calculate the most conservative token range that has been built, and build the remainder. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Duarte Nunes	ff15068a41	service/storage_service: Allow querying the view build status This patch adds support for the nodetool viewbuildstatus command, which shows the progress of a materialized view build across the cluster. A view can be absent from the result, successfully built, or currently being built. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-03-27 01:20:10 +01:00
Avi Kivity	16a7650873	Merge "More extensions: commitlog + system tables" from Calle " Additional extension points. * Allows wrapping commitlog file io (including hinted handoff). * Allows system schema modification on boot, allowing extensions to inject extensions into hardcoded schemas. Note: to make commitlog file extensions work, we need to both enforce we can be notified on segment delete, and thus need to fix the old issue of hard ::unlink call in segment destructor. Segment delete is therefore moved to a batch routine, run at intervals/flush. Replay segments and hints are also deleted via the commitlog object, ensuring an extension is notified (metadata). Configurable listeneres are now allowed to inject configuration object into the main config. I.e. a local object can, either by becoming a "configurable" or manually, add references to self-describing values that will be parsed from the scylla.yaml file, effectively extending it. All these wonderful abstractions courtesy of encryption of course. But super generalized! " * 'calle/commitlog_ext' of github.com:scylladb/seastar-dev: db::extensions: Allow extensions to modify (system) schemas db::commitlog: Add commitlog/hints file io extension db::commitlog: Do segment delete async + force replay delete go via CL main/init: Change configurable callbacks and calls to allow adding opts util::config_file: Add "add" config item overload	2018-03-26 16:18:22 +03:00
Calle Wilund	2bc98aebaf	db::commitlog: Do segment delete async + force replay delete go via CL Refs #2858 Push segement files to be deleted to a pending list, and process at intervals or flush-requests (or shutdown). Note that we do _not_ indescrimenately do deletes in non-anchored tasks, because we need to guarantee that finshed segments are fully deleted and gone on CL shutdown, not to be mistaken for replayables. Also make sure we delete segments replayed via commitlog call, so IFF we add metadata processing for CL, we can clear it out.	2018-03-26 11:58:27 +00:00
Glauber Costa	9188059427	database: group statements in their own scheduling group When we introduced the CPU scheduler, we have also introduced a group for commitlog - but never used it. There is also doubtful value in separating reads from writes, since they are often part of the same workload. To accomodate for that, let's rename the query group to "statement" (query is not incorrect, just confusing), and move the write path, currently ungrouped, inside it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-20 16:58:36 -04:00
Calle Wilund	eb10d32ff9	main/init: Change configurable callbacks and calls to allow adding opts Refs #2526 Allows sub-configs to dynamically add yaml/command line options to the main config object, i.e. extend the scylla.yaml	2018-03-19 12:24:04 +00:00
Benoît Canet	1d0cc7cf20	messaging_service: Start messaging service earlier The messaging service was completely started after a bootstraping node finished to join hence leading to #2034. Fixes #2034 Message-Id: <20180313084500.27265-1-amnon@scylladb.com>	2018-03-13 10:59:53 +02:00
Avi Kivity	d973445a94	Merge "sstable/schema extensions" from Calle " Adds extension points to schema/sstables to enable hooking in stuff, like, say, something that modifies how sstable disk io works. (Cough, cough, encryption) Extensions are processed as property keywords in CQL. To add an extension, a "module" must register it into the extensions object on boot time. To avoid globals (and yet don't), extensions are reachable from config (and thus from db). Table/view tables already contain an extension element, so we utilize this to persist config. schema_tables tables/views from mutations now require a "context" object (currently only extensions, but abstracted for easier further changes. Because of how schemas currently operate, there is a super lame workaround to allow "schema_registry" access to config and by extension extensions. DB, upon instansiation, calls a thread local global "init" in schema_registry and registers the config. It, in turn, can then call table_from_mutations as required. Includes the (modified) patch to encapsulate compression into objects, mainly because it is nice to encapsulate, and isolate a little. " * 'calle/extensions-v5' of github.com:scylladb/seastar-dev: extensions: Small unit test sstables: Process extensions on file open sstables::types: Add optional extensions attribute to scylla metadata sstables::disk_types: Add hash and comparator(sstring) to disk_string schema_tables: Load/save extensions table cql: Add schema extensions processing to properties schema_tables: Require context object in schema load path schema_tables: Add opaque context object config_file_impl: Remove ostream operators main/init: Formalize configurables + add extensions to init call db::config: Add extensions as a config sub-object db::extensions: Configuration object to store various extensions cql3::statements::property_definitions: Use std::variant instead of any sstables: Add extension type for wrapping file io schema: Add opaque type to represent extensions sstables::compress/compress: Make compression a virtual object	2018-02-26 17:15:29 +02:00
Avi Kivity	432268f582	Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael "The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. The manager was needed for orchestrating deletion of shared sstable across shards. It brings extra complexity that's not longer needed, and it was also overloading shard 0, but the latter could have been fixed. Tests: - unit: release mode - dtest: resharding_test.py" * 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla: Remove SSTable's atomic deletion manager Stop using SSTable's atomic deletion manager database: split column_family::rebuild_sstable_list	2018-02-08 19:10:16 +02:00
Tomasz Grabiec	cce1a2bce8	Merge "Use the CPU scheduler" from Glauber & Avi In this patchset I am resubmitting Avi's enablement of the CPU scheduler in his behalf. I've done a ton of testing in the series and there are some improvements / changes that I had previously sent as a separate series. What you see here is the result of merging that work. After this patchset is applied, workloads are smoother and we are able to uphold the pre-defined shares among the various actors. We also finally have everything we need to merge the CPU and I/O controllers. After that is done the code is now much simpler. But also, as a bonus, controllers that were previously available for I/O only (compactions) are enabled for CPU as well. * git@github.com:glommer/scylla.git cpusched-v7: Avi Kivity (4): database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler memtable, database: make memtable::clear_gently() inherit scheduling_group config: mark background_writer_scheduling_quota as Unused database: place data_query execution stage into scheduling_group Glauber Costa (9): database, main: set up scheduling_groups for our main tasks row_cache: actually use the scheduling group for update_cache allow update_cache and clear_gently to use the entire task quota. database: remove cpu_flush_quota metric controllers: retire auto_adjust_flush_quota controllers: allow memtable I/O controller to have shares statically set controllers: update control points for memtable I/O controller controllers: allow a static priority to override the controller output controllers: unify the I/O and CPU controllers	2018-02-08 15:58:40 +01:00
Raphael S. Carvalho	1472cfcc19	Stop using SSTable's atomic deletion manager The motivation is that it's no longer needed after new resharding algorithm that is the sole responsible for working with shared sstables and regular compaction will not work with those! So resharding will schedule deletion of shared sstables once it's certain that shards that own them have the new unshared sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-02-07 22:27:17 -02:00
Glauber Costa	956af9f099	database, main: set up scheduling_groups for our main tasks Set up scheduling groups for streaming, compaction, memtable flush, query, and commitlog. The background writer scheduling group is retired; it is split into the memtable flush and compaction groups. Comments from Glauber: This patch is based in a patch from Avi with the same subject, but the differences are signficant enough so that I reset authorship. In particular: 1) A bug/regression is fixed with the boundary calculations for the memtable controller sampling function. 2) A leftover is removed, where after flushing a memtable we would go back to the main group before going to the cache group again 3) As per Tomek's suggestion, now the submission of compactions themselves are run in the compaction scheduling group. Having that working is what changes this patch the most: we now store the scheduling group in the compaction manager and let the compaction manager itself enforce the scheduling group. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	641aaba12c	database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler thread_scheduling_groups are converted to plain scheduling_group. Due to differences in initialization (scheduling_group initializtion defers), we create the scheduling_groups in main.cc and propagate them to users via a new class database_config. The sstable writer loses its thread_scheduling_group parameter and instead inherits scheduling from its caller. Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas, the flush controller was adjusted to return values within the higher ranges.	2018-02-07 17:19:29 -05:00

1 2 3 4 5 ...

296 Commits