scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 16:33:35 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	4b51e0bf30	row_cache: Move cache_tracker to a separate header It will be needed by the sstable layer to get the to the LRU and the LSA region. Split to avoid inclusion of whole row_cache.hh	2021-07-02 10:25:58 +02:00
Calle Wilund	a40b6a2f54	commitlog: Use disk file alignment info (with lower value if possible) Previously, the disk block alignment of segments was hardcoded (due to really old code). Now we use the value as declared in the actual file opened. If we are using a previously written file (i.e. o_dsync), we can even use the sometimes smaller "read" alignment. Also allow config to completely override this with a disk alignment config option (not exposed to global config yet, but can be). v2: * Use overwrite alignment if doing only overwrite * Ensure to adjust actual alignment if/when doing file wrapping v3: * Kill alignment config param. Useless and unsafe. Closes #8935	2021-06-29 16:00:49 +03:00
Piotr Jastrzebski	1bdcef6890	features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled These features have been around for over 2 years and every reasonable deployment should have them enabled. The only case when those features could be not enabled is when the user has used enable_sstables_mc_format config flag to disable MC sstable format. This case has been eliminated by removing enable_sstables_mc_format config flag. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Nadav Har'El	4d7f55a29f	cql: add configurable restriction of DateTieredCompactionStrategy DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time (users should use TimeWindowCompactionStrategy, TWCS, instead). This patch adds a new configuration option - restrict_dtcs - which can be used to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE statements. This is part of a "safe mode" effort to allow an installation to restrict operations which are un-recommended or dangerous. The new restrict_dtcs option has three values: "true", "false", and "warn": For the time being, "false" is still the default, and means DTCS is not restricted and can still be used freely. We can easily change this default in a followup patch. Setting a value of "true" means that DTCS is restricted - trying to create a a table or alter a table with it will fail with an error. Setting a value of "warn" will allow the create or alter operation, but will warn the user - both with a warning message which will immediately appear in cqlsh (for example), and with a log message. Fixes #8914. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210624122411.435361-1-nyh@scylladb.com>	2021-06-24 20:59:27 +03:00
Kamil Braun	a3f3563828	storage_service: check for existing normal token owners before bootstrapping The bootstrap procedure starts by "waiting for range setup", which means waiting for a time interval specified by the `ring_delay` parameter (30s by default) so the node can receive the tokens of other nodes before introducing its own tokens. However it may sometimes happen that the node doesn't receive the tokens. There are no explicit checks for this. But the code may crash in weird ways if the tokens-received assuption is false, and we are lucky if it does crash (instead of, for example, allowing the node to incorrectly bootstrap, causing data loss in the process). Introduce an explicit check-and-throw-if-false: a bootstrapping node now checks that there's at least one NORMAL token in the token ring, which means that it had to have contacted at least one existing node in the cluster, which means that it received the gossip application states of all nodes from that node; in particular the tokens of all nodes. Also add an assert in CDC code which relies on that assumption (and would cause weird division-by-zero errors if the assumption was false; better to crash on assert than this). Ref #8889. Closes #8896	2021-06-24 13:19:08 +03:00
Michael Livshin	9b9efb2b42	disable caching of the system.large_* tables The cache of system.large_{partition,rows,cells} accumulates range tombstones (https://github.com/scylladb/scylla/issues/7750), and those range tombstones can be evicted only together with their partition (https://github.com/scylladb/scylla/issues/3288). Making the system.large_* tables uncached should work around the problem until #3288 is fixed. Fixes #8874 Refs #7750 Refs #3288 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>	2021-06-24 12:26:45 +03:00
Avi Kivity	14252c8b71	Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695 ) (v3)' from Calle Wilund Fixes #8270 If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall. We need to take a few things into account: 1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here. 2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task. 3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit. 4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to issue segment prunes from end_flush() so we wake up actual file deletion/recycling 5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate 6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above. 7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way. Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything). New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test. Closes #8764 * github.com:scylladb/scylla: commitlog_test: Add test case for usage/disk size threshold mismatch commitlog_test: Improve test assertion commitlog: Add waitable future for background sync/flush commitlog: abort queues on shutdown commitlog: break out "abort" calls into member functions commitlog: Do explicit discard+delete in shutdown commitlog: Recycle or not should not depend on shutdown state commitlog: Issue discard_unused_segments on segment::flush end IFF deletable commitlog: Flush all segments if we only have one. commitlog: Always force flush if segment allocation is waiting commitlog: Include segment wasted (slack) size in footprint check commitlog: Adjust (lower) usage threshold	2021-06-24 12:03:26 +03:00
Piotr Dulikowski	de1679b1b9	hints: make hints concurrency configurable and reduce the default Previously, hinted handoff had a hardcoded concurrency limit - at most 128 hints could be sent from a single shard at once. This commit makes this limit configurable by adding a new configuration option: `max_hinted_handoff_concurrency_per_shard`. This option can be updated in runtime. Additionally, the default concurrency per shard is made lower and is now 8. The motivation for reducing the concurrency was to mitigate the negative impact hints may have on performance of the receiving node due to them not being properly isolated with respect to I/O. Tests: - unit(dev) - dtest(hintedhandoff_additional_test.py) Refs: #8624 Closes #8646	2021-06-22 15:58:56 +02:00
Calle Wilund	d6113912cd	commitlog: Add waitable future for background sync/flush Commitlog timer issues un-waited syncs on all segments. If such a sync takes too long we can end up keeping a segment alive across a shutdown, causing the file to be left on disk, even if actually clean. This adds a future in segment_manager that is "chained" with all active syncs (hopefully just one), and ensures we wait for this to complete in shutdown, before pruning and deleting segments	2021-06-21 06:01:19 +00:00
Pavel Emelyanov	96131349e8	schema_tables: Remove unused sharded<proxy> argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	64bb16af8a	view_update_generator: Remove unused struct sstable_with_table Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Avi Kivity	b099e7c254	Merge "Untie hints managers and storage service" from Pavel E " The storage service is carried along storage proxy, hints resource manager and hints managers (two of them) just to subscribe the hints managers on lifecycle events (and stop the subscription on shutdown) emitted from storage service. This dependency chain can be greatly simplified, since the storage proxy is already subscribed on lifecycle events and can kick managers directly from its hooks. tests: unit(dev), dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev) " * 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla: hints: Drop storage service from managers hints: Do not subscribe managers on lifecycle events directly	2021-06-17 17:12:31 +03:00
Pavel Emelyanov	92a4278cd1	hints: Drop storage service from managers The storage service pointer is only used so (un)subscribe to (from) lifecycle events. Now the subscription is gone, so can the storage service pointer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-17 15:09:36 +03:00
Pavel Emelyanov	acdc568ecf	hints: Do not subscribe managers on lifecycle events directly Managers sit on storage proxy which is already subscribed on lifecycle events, so it can "notify" hints managers directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-17 15:06:26 +03:00
Nadav Har'El	45c2442f49	Merge 'Avoid large allocs in mv update code' from Piotr Sarna This series addresses #8852 by: * migrating to chunked_vector in view update generation code to avoid large allocations * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust. Tests: unit(release) Refs #8852 Closes #8856 * github.com:scylladb/scylla: db,view: limit the number of simultaneous view update futures db,view: use chunked_vector for view updates	2021-06-17 14:01:38 +03:00
Avi Kivity	4d70f3baee	storage_proxy: change unordered_set<inet_address> to small_vector in write path The write paths in storage_proxy pass replica sets as std::unordered_set<gms::inet_address>. This is a complex type, with N+1 allocations for N members, so we change it to a small_vector (via inet_address_vector_replica_set) which requires just one allocation, and even zero when up to three replicas are used. This change is more nuanced than the corresponding change to the read path `abe3d7d7` ("Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity"), for two reasons: - there is a quadratic algorithm in abstract_write_response_handler::response(): it searches for a replica and erases it. Since this happens for every replica, it happens N^2/2 times. - replica sets for writes always include all datacenters, while reads usually involve just one datacenter. So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2 =105 compares. We could remove this by sending the index of the replica in the replica set to the replica and ask it to include the index in the response, but I think that this is unnecessary. Those 105 compares need to be only 105/15 = 7 times cheaper than the corresponding unordered_set operation, which they surely will. Handling a response after a cross-datacenter round trip surely involves L3 cache misses, and a small_vector reduces these to a minimum compared to an unordered_set with its bucket table, linked list walking and managent, and table rehashing. Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000 --task-quota-ms show two allocations removed (as expected) and a nice reduction in instructions executed. before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op) after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op) Closes #8847	2021-06-17 13:46:40 +03:00
Avi Kivity	98cdeaf0f2	schema_tables: make the_merge_lock thread_local the_merge_lock is global, which is fine now because it is only used in shard 0. However, if we run multiple nodes in a single process, there will be multiple shard 0:s, and the_merge_lock will be accessed from multiple threads. This won't work. To fix, make it thread_local. It would be better to make it a member of some controlling object, but there isn't one. Closes #8858	2021-06-17 13:41:11 +03:00
Avi Kivity	00ff3c1366	Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy The option is provided by nodetool snapshot https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/ ``` nodetool [(-h <host> \| --host <host>)] [(-p <port> \| --port <port>)] [(-pp \| --print-port)] [(-pw <password> \| --password <password>)] [(-pwf <passwordFilePath> \| --password-file <passwordFilePath>)] [(-u <username> \| --username <username>)] snapshot [(-cf <table> \| --column-family <table> \| --table <table>)] [(-kc <kclist> \| --kc.list <kclist>)] [(-sf \| --skip-flush)] [(-t <tag> \| --tag <tag>)] [--] [<keyspaces...>] -sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data) ``` But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167) and not supported at the api level. This patch adds support for the option in advance from the api service level down via snapshot_ctl to the table class and snapshot implementation. In addition, a corresponding unit test was added to verify that taking a snapshot with `skip_flush` does not flush the memtable (at the table::snapshot level). Refs #8725 Closes #8726 * github.com:scylladb/scylla: test: database_test: add snapshot_skip_flush_works api: storage_service/snapshots: support skip-flush option snapshot: support skip_flush option table: snapshot: add skip_flush option api: storage_service/snapshots: add sf (skip_flush) option	2021-06-17 13:32:23 +03:00
Nadav Har'El	b6b4df9a47	heat-weighted load balancing: improve handling of near-perfect cache Consider two nodes with almost-100% cache hit ratio, but not exactly 100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we want to equalize the miss rate in both nodes. So we send the first node twice the number of requests we send to the second. But unless the disks are extremely limited, this doesn't make sense: As a numeric example, consider that we send 2000 requests to the first node and 1000 to the second, just so the number of misses will be the same - 2 (0.1% and 0.2% misses, respectively). At such low miss numbers, the assumption that the disk reads are the slowest part of the operation is wrong, so trying to equalize only this part is wrong. So above some threshold hit rate, we should treat all hit rates as equivalent. In the code we already had such a threshold - max_hit_rate, but it was set to the incredibly high 0.999. We saw in actual user runs (see issue #8815) that this threshold was too high - one node received twice the amount of requests that another did - although both had near-100% cache hit rates. So in this patch we lower the max_hit_rate to 0.95. This will have two consequences: 1. Two nodes with hit rates above 0.95 will be considered to have the same hit rate, so they will get equal amount of work - even if one has hit rate 0.98 and the other 0.99. 2. A cold node with it rate 0.0 will get 5% of the work of a node with the perfect hit rate limited to 0.95. This will allow the cold node to slowly warm up its cache. Before this patch, if the hot node happened to have a hit rate of 0.999 (the previous maximum), the cold node would get just 0.1% of the work and remain almost idle and fill its cache extremely slowly - which is a waste. Fixes #8815. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616180732.125295-1-nyh@scylladb.com>	2021-06-17 11:02:08 +02:00
Piotr Sarna	1fb831c8c1	db,view: limit the number of simultaneous view update futures Previously the view update code generated a continuation for each view update and stored them all in a vector. In certain cases the number of updates can grow really large (to millions and beyond), so it's better to only store a limited amount of these futures at a time.	2021-06-17 10:20:52 +02:00
Piotr Sarna	a7f7716ecf	db,view: use chunked_vector for view updates The number of view updates can grow large, especially in corner cases like removing large base partitions. Chunked vector prevents large allocations.	2021-06-17 10:15:17 +02:00
Tomasz Grabiec	6bdf8c4c46	Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja Miscellaneous preparatory patches for group 0 discovery. * scylla-dev/raft-group-0-part-2-v4: raft: (service) servers map is gid -> server, not sid -> server system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID raft: (server) wait for configuration transition to complete raft: (server) implement raft::server::get_configuration() raft: (service) don't throw from schema state machine raft: (service) permit some scylla.raft cells to be empty raft: (service) properly handle failure to add a server raft: implement is_transient_error()	2021-06-17 00:15:40 +02:00
Calle Wilund	14559b5a86	commitlog: abort queues on shutdown In case we only have a single segment active when shutting down, the replenisher can be blocked even though we manually flush-deleted. Add a signal type and abort queues using this to wake up waiter and force them to check shutdown status.	2021-06-16 15:35:56 +00:00
Calle Wilund	227b573cdf	commitlog: break out "abort" calls into member functions	2021-06-16 15:35:56 +00:00
Calle Wilund	5cd9691f00	commitlog: Do explicit discard+delete in shutdown When we are shutting down, before trying to close the gate, we should issue a discard to ensure waking up the replenish task	2021-06-16 15:35:56 +00:00
Calle Wilund	03b8baaa8d	commitlog: Recycle or not should not depend on shutdown state If we are using recycling, we should always use recycle in delete_segments, otherwise we can cause deadlock with replenish task, since it will be waiting for segment, then shutdown is set, and we are called, and can't fulfil the alloc -> deadlock	2021-06-16 15:35:56 +00:00
Calle Wilund	5ebf5835b0	commitlog: Issue discard_unused_segments on segment::flush end IFF deletable If a segments, when finishing a flush call, is deletable, we should issue a manual call to discard function (which moves deleteable segments off segment list) asap, since we otherwise are dependent on more calls from flush handlers (memtable flush). And since we could have blocked segment allocation, this can cause deadlocks, at least in tests.	2021-06-16 15:35:56 +00:00
Calle Wilund	cbddcf46aa	commitlog: Flush all segments if we only have one. Handle test cases with borked config so we don't deadlock in cases where we only have one segment in a commitlog	2021-06-16 15:35:56 +00:00
Calle Wilund	a0f559a44c	commitlog: Always force flush if segment allocation is waiting Refs #8270 If segement allocation is blocked, we should bypass all thresholds and issue a flush of as much as possible.	2021-06-16 15:35:56 +00:00
Calle Wilund	bcf4d07f0b	commitlog: Include segment wasted (slack) size in footprint check Refs #8270 Since segment allocation looks at actual disk footprint, not active, the threshold check in timer task should include slack space so we don't mistake sparse usage for space left.	2021-06-16 15:35:56 +00:00
Calle Wilund	1187f5c181	commitlog: Adjust (lower) usage threshold Refs #8270 Try to ensure we issue a flush as soon as we are allocating in the last allowable segment, instead of "half through". This will make flushing a little more eager, but should reduce latencies created by waiting for segment delete/recycle on heavy usage.	2021-06-16 15:35:56 +00:00
Konstantin Osipov	9c93d77e74	system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID Fix a bug in definitions of system.raft, system.raft_snapshots, group_id is TIMEUUID, not long.	2021-06-16 16:52:43 +03:00
Piotr Sarna	f832a30388	db,view,table: futurize calculating affected ranges In order to avoid stalls on large inputs, calculating affected ranges is now able to yield.	2021-06-16 09:51:31 +02:00
Piotr Sarna	3592d9b36e	db,view: use chunked vector for view affected ranges There were large allocation reportsa from vectors used for calculating affected ranges. In order to reduce the pressure on the allocator, chunked vector is used for storing intermediate results.	2021-06-15 10:30:27 +02:00
Piotr Sarna	8a049c9116	view: fix use-after-move when handling view update failures The code was susceptible to use-after-move if both local and remote updates were going to be sent. The whole routine for sending view updates is now rewritten to avoid use-after-move. Refs #8830 Tests: unit(release), dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)	2021-06-14 09:36:10 +02:00
Piotr Sarna	7cdbb7951a	db,view: explicitly move the mutation to its helper function The `apply_to_remote_endpoints` helper function used to take its `mut` parameter by reference, but then moved the value from it, which is confusing and prone to errors. Since the value is moved-from, let's pass it to the helper function as rvalue ref explicitly.	2021-06-14 09:34:40 +02:00
Piotr Sarna	88d4a66e90	db,view: pass base token by value to mutate_MV The base token is passed cross-continuations, so the current way of passing it by const reference probably only works because the token copying is cheap enough to optimize the reference out. Fix by explicitly taking the token by value.	2021-06-14 09:30:38 +02:00
Nadav Har'El	8a4ac6914a	config: add configuration option restrict_replication_simplestrategy This patch adds a configuration option to choose whether the SimpleStrategy replication strategy is restricted. It is a tri_mode_restriction, allowing to restrict this strategy (true), to allow it (false), or to just warn when it is used (warn). After this patch, the option exists but doesn't yet do anything. It will be used in the following two patches to restrict the CREATE KEYSPACE and ALTER KEYSPACE operations, respectively. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:45:16 +03:00
Nadav Har'El	a3d6f502ad	config: add "tri_mode_restriction" type of configurable value This patch adds a new type of configurable value for our command-line and YAML parsers - a "tri_mode_restriction" - which can be set to three values: "true", "false", or "warn". We will use this value type for many (but not all) of the restriction options that we plan to start adding in the following patches. Restriction options will allow users to ask Scylla to restrict (true), to not restrict (false) or to warn about (warn) certain dangerous or undesirable operations. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:44:20 +03:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Nadav Har'El	48ff641f67	Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy Shutdown must never fail, otherwise it may cause hangs as seen in https://github.com/scylladb/scylla/issues/8577. This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files. In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging. Fixes #8577 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8578 * github.com:scylladb/scylla: commitlog: segment_manager::shutdown: abort on errors commitlog: allocate_segment_ex: make_checked_file	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	e0749d6264	treewide: some random header cleanups Eliminate not used includes and replace some more includes with forward declarations where appropriate. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Benny Halevy	9cf858b5fc	snapshot: support skip_flush option skip_flush is disabled by default. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 17:20:21 +03:00
Piotr Sarna	389a0a52c9	treewide: revamp workload type for service levels This patch is not backward compatible with its original, but it's considered fine, since the original workload types were not yet part of any release. The changes include: - instead of using 'unspecified' for declaring that there's no workload type for a particular service level, NULL is used for that purpose; NULL is the standard way of representing lack of data - introducing a delete marker, which accompanies NULL and makes it possible to distinguish between wanting to forcibly reset a workload type to unspecified and not wanting to change the previous value - updating the tests accordingly These changes come in as a single patch, because they're intertwined with each other and the tests for workload types are already in place; an attempt to split them proved to be more complicated than it's worth. Tests: unit(release) Closes #8763	2021-05-31 18:18:33 +03:00
Piotr Jastrzebski	76d7c761d1	schema: Stop using deprecated constructor This is another boring patch. One of schema constructors has been deprecated for many years now but was used in several places anyway. Usage of this constructor could lead to data corruption when using MX sstables because this constructor does not set schema version. MX reading/writing code depends on schema version. This patch replaces all the places the deprecated constructor is used with schema_builder equivalent. The schema_builder sets the schema version correctly. Fixes #8507 Test: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>	2021-05-30 11:58:27 +03:00
Pavel Emelyanov	1ce0682821	view: Get database from stprage_proxy The db::view code already uses proxy rather actively, so instead of depending on the storage service to be at hands it's better to make db::view require the proxy. For now -- via global instance. There's one dependency on storage service left after this patch -- to get the tokens. This piece is to be fixed later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:09:32 +03:00
Avi Kivity	0acf5bfca6	build: enable -Wreturn-std-move Clang warns when "return std::move(x)" is needed to elide a copy, but the call to std::move() is missing. We disabled the warning during the migration to clang. This patch re-enables the warning and fixes the places it points out, usually by adding std::move() and in one place by converting the returned variable from a reference to a local, so normal copy elision can take place. Closes #8739	2021-05-27 21:16:26 +03:00
Avi Kivity	d3e5b37059	Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund" This reverts commit `e9c940dbbc`, reversing changes made to `6144656b25`. Since it was merged commitlog_test consistently times out in debug mode.	2021-05-27 21:16:26 +03:00
Avi Kivity	5f8484897b	Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged. --- Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in a new keyspace that uses EverywhereStrategy so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. The generation's timestamp and the UUID mentioned in the last point form a "generation identifier" for this new generation. For passing these new identifiers around, we introduce the cdc::generation_id_v2 type. Fixes #7961. --- For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order. dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/ unit tests (dev) passed locally Closes #8643 * github.com:scylladb/scylla: docs: update cdc.md with info about the new internal table sys_dist_ks: don't create old CDC generations table on service initialization sys_dist_ks: rename all_tables() to ensured_tables() cdc: when creating new generations, use format v2 if possible main: pass feature_service to cdc::generation_service gms: introduce CDC_GENERATIONS_V2 feature cdc: introduce retrieve_generation_data test: cdc: include new generations table in permissions test sys_dist_ks: increase timeout for create_cdc_desc sys_dist_ks: new table for exchanging CDC generations tree-wide: introduce cdc::generation_id_v2	2021-05-27 17:13:44 +03:00

1 2 3 4 5 ...

2111 Commits