scylladb

Author	SHA1	Message	Date
Anna Stuchlik	2ccd51844c	doc: remove wrong image upgrade info (5.2-to-2023.1) This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted). Refs https://github.com/scylladb/scylladb/issues/15733 Closes scylladb/scylladb#21876 Fixes https://github.com/scylladb/scylla-enterprise/issues/5041 Fixes https://github.com/scylladb/scylladb/issues/21898 (cherry picked from commit `98860905d8`)	2024-12-12 15:28:20 +02:00
Lakshmi Narayanan Sreethar	705ec24977	db/config.cc: increment components_memory_reclaim_threshold config default Incremented the components_memory_reclaim_threshold config's default value to 0.2 as the previous value was too strict and caused unnecessary eviction in otherwise healthy clusters. Fixes #18607 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `3d7d1fa72a`) Closes #19011	2024-06-04 07:13:28 +03:00
Botond Dénes	e89eb41e70	Merge '[Backport 5.2] : Reload reclaimed bloom filters when memory is available ' from Lakshmi Narayanan Sreethar PR https://github.com/scylladb/scylladb/pull/17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components). The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory. Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.2. Closes #18666 * github.com:scylladb/scylladb: sstable_datafile_test: add testcase to test reclaim during reload sstable_datafile_test: add test to verify auto reload of reclaimed components sstables_manager: reload previously reclaimed components when memory is available sstables_manager: start a fiber to reload components sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables sstable_datafile_test: add test to verify reclaimed components reload sstables: support reloading reclaimed components sstables_manager: add new intrusive set to track the reclaimed sstables sstable: add link and comparator class to support new instrusive set sstable: renamed intrusive list link type sstable: track memory reclaimed from components per sstable sstable: rename local variable in sstable::total_reclaimable_memory_size	2024-05-30 11:11:39 +03:00
Kefu Chai	45814c7f14	docs: fix typos in upgrade document s/Montioring/Monitoring/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `f1f3f009e7`) Closes #18910	2024-05-30 11:10:49 +03:00
Botond Dénes	331e0c4ca7	Merge '[Backport 5.2] mutation_fragment_stream_validating_filter: respect validating_level::none' from ScyllaDB Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to. Fixes: #18662 (cherry picked from commit `f6511ca1b0`) (cherry picked from commit `e7b07692b6`) (cherry picked from commit `78afb3644c`) Refs #18667 Closes #18723 * github.com:scylladb/scylladb: test/boost/mutation_fragment_test.cc: add test for validator validation levels mutation: mutation_fragment_stream_validating_filter: fix validation_level::none mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter	2024-05-27 08:52:06 +03:00
Alexey Novikov	32be38dae5	make timestamp string format cassandra compatible when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z' it concerns any conversion except JSON timestamp format JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z' both formats always contain milliseconds and timezone specification Fixes #14518 Fixes #7997 Closes #14726 Fixes #16575 (cherry picked from commit `ff721ec3e3`) Closes #18852	2024-05-26 16:30:06 +03:00
Botond Dénes	3dacf6a4b1	test/boost/mutation_fragment_test.cc: add test for validator validation levels To make sure that the validator doesn't validate what the validation level doesn't include. (cherry picked from commit `78afb3644c`)	2024-05-24 03:36:28 -04:00
Botond Dénes	3d360c7caf	mutation: mutation_fragment_stream_validating_filter: fix validation_level::none Despite its name, this validation level still did some validation. Fix this, by short-circuiting the catch-all operator(), preventing any validation when the user asked for none. (cherry picked from commit `e7b07692b6`)	2024-05-24 03:34:05 -04:00
Botond Dénes	f7a3091734	mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter When set to false, no exceptions will be raised from the validator on validation error. Instead, it will just return false from the respective validator methods. This makes testing simpler, asserting exceptions is clunky. When true (default), the previous behaviour will remain: any validation error will invoke on_internal_error(), resulting in either std::abort() or an exception. Backporting notes: * Added const const mutation_fragment_stream_validating_filter& param to on_validation_error() * Made full_name() public (cherry picked from commit `f6511ca1b0`)	2024-05-24 03:33:10 -04:00
Botond Dénes	6f0d32a42f	Merge '[Backport 5.2] utils: chunked_vector: fill ctor: make exception safe' from ScyllaDB Currently, if the fill ctor throws an exception, the destructor won't be called, as it object is not fully constructed yet. Call the default ctor first (which doesn't throw) to make sure the destructor will be called on exception. Fixes scylladb/scylladb#18635 - [x] Although the fixes is for a rare bug, it has very low risk and so it's worth backporting to all live versions (cherry picked from commit `64c51cf32c`) (cherry picked from commit `88b3173d03`) (cherry picked from commit `4bbb66f805`) Refs #18636 Closes #18680 * github.com:scylladb/scylladb: chunked_vector_test: add more exception safety tests chunked_vector_test: exception_safe_class: count also moved objects utils: chunked_vector: fill ctor: make exception safe	2024-05-21 16:30:23 +03:00
Benny Halevy	d947f1e275	chunked_vector_test: add more exception safety tests For insertion, with and without reservation, and for fill and copy constructors. Reproduces https://github.com/scylladb/scylladb/issues/18635 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-21 11:33:42 +03:00
Benny Halevy	d727382cc1	chunked_vector_test: exception_safe_class: count also moved objects We have to account for moved objects as well as copied objects so they will be balanced with the respective `del_live_object` calls called by the destructor. However, since chunked_vector requires the value_type to be nothrow_move_constructible, just count the additional live object, but do not modify _countdown or, respectively, throw an exception, as this should be considered only for the default and copy constructors. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-21 11:33:42 +03:00
Benny Halevy	15a090f711	utils: chunked_vector: fill ctor: make exception safe Currently, if the fill ctor throws an exception, the destructor won't be called, as it object is not fully constructed yet. Call the default ctor first (which doesn't throw) to make sure the destructor will be called on exception. Fixes scylladb/scylladb#18635 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-21 11:33:42 +03:00
Yaron Kaikov	2fc86cc241	release: prepare for 5.2.19	2024-05-19 16:25:28 +03:00
Lakshmi Narayanan Sreethar	d4c523e9ef	sstable_datafile_test: add testcase to test reclaim during reload Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `4d22c4b68b`)	2024-05-14 19:20:06 +05:30
Lakshmi Narayanan Sreethar	b505ce4897	sstable_datafile_test: add test to verify auto reload of reclaimed components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `a080daaa94`)	2024-05-14 19:20:06 +05:30
Lakshmi Narayanan Sreethar	80861e3bce	sstables_manager: reload previously reclaimed components when memory is available When an SSTable is dropped, the associated bloom filter gets discarded from memory, bringing down the total memory consumption of bloom filters. Any bloom filter that was previously reclaimed from memory due to the total usage crossing the threshold, can now be reloaded back into memory if the total usage can still stay below the threshold. Added support to reload such reclaimed filters back into memory when memory becomes available. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `0b061194a7`)	2024-05-14 19:20:02 +05:30
Lakshmi Narayanan Sreethar	9004c9ee38	sstables_manager: start a fiber to reload components Start a fiber that gets notified whenever an sstable gets deleted. The fiber doesn't do anything yet but the following patch will add support to reload reclaimed components if there is sufficient memory. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `f758d7b114`)	2024-05-14 19:19:22 +05:30
Lakshmi Narayanan Sreethar	72494af137	sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables The testcase uses an sstable whose mutation key and the generation are owned by different shards. Due to this, when process_sstable_dir is called, the sstable gets loaded into a different shard than the one that was intended. This also means that the sstable and the sstable manager end up in different shards. The following patch will introduce a condition variable in sstables manager which will be signalled from the sstables. If the sstable and the sstable manager are in different shards, the signalling will cause the testcase to fail in debug mode with this error : "Promise task was set on shard x but made ready on shard y". So, fix it by supplying appropriate generation number owned by the same shard which owns the mutation key as well. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `24064064e9`)	2024-05-14 19:19:22 +05:30
Lakshmi Narayanan Sreethar	c15e72695d	sstable_datafile_test: add test to verify reclaimed components reload Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `69b2a127b0`)	2024-05-14 19:19:18 +05:30
Lakshmi Narayanan Sreethar	83dd78fb9d	sstables: support reloading reclaimed components Added support to reload components from which memory was previously reclaimed as the total memory of reclaimable components crossed a threshold. The implementation is kept simple as only the bloom filters are considered reclaimable for now. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `54bb03cff8`)	2024-05-14 19:17:03 +05:30
Lakshmi Narayanan Sreethar	62338d3ad0	compaction: improve partition estimates for garbage collected sstables When a compaction strategy uses garbage collected sstables to track expired tombstones, do not use complete partition estimates for them, instead, use a fraction of it based on the droppable tombstone ratio estimate. Fixes #18283 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#18465 (cherry picked from commit `d39adf6438`) Closes #18659	2024-05-14 15:42:12 +03:00
Lakshmi Narayanan Sreethar	1bd6584478	sstables_manager: add new intrusive set to track the reclaimed sstables The new set holds the sstables from where the memory has been reclaimed and is sorted in ascending order of the total memory reclaimed. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `2340ab63c6`)	2024-05-14 01:46:36 +05:30
Lakshmi Narayanan Sreethar	19f3e42583	sstable: add link and comparator class to support new instrusive set Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `140d8871e1`)	2024-05-14 01:46:17 +05:30
Lakshmi Narayanan Sreethar	bb9ceae2c3	sstable: renamed intrusive list link type Renamed the intrusive list link type to differentiate it from the set link type that will be added in an upcoming patch. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `3ef2f79d14`)	2024-05-14 01:45:27 +05:30
Lakshmi Narayanan Sreethar	fa154a8d00	sstable: track memory reclaimed from components per sstable Added a member variable _total_memory_reclaimed to the sstable class that tracks the total memory reclaimed from a sstable. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `02d272fdb3`)	2024-05-14 01:45:20 +05:30
Lakshmi Narayanan Sreethar	a9101f14f6	sstable: rename local variable in sstable::total_reclaimable_memory_size Renamed local variable in sstable::total_reclaimable_memory_size in preparation for the next patch which adds a new member variable _total_memory_reclaimed to the sstable class. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `a53af1f878`)	2024-05-14 01:45:13 +05:30
Kamil Braun	b68c06cc3a	direct_failure_detector: increase ping timeout and make it tunable The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: scylladb/scylladb#16410 Fixes: scylladb/scylladb#16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in scylladb/scylladb#16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. (cherry picked from commit `8df6d10e88`) Closes #18558	2024-05-08 15:46:59 +02:00
Pavel Emelyanov	1cb959fc84	Update seastar submodule (iotune iodepth underflow fix) * seastar b9fd21d8...5ab9a7cf (1): > iotune: ignore shards with id above max_iodepth Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-06 19:25:41 +03:00
Pavel Emelyanov	5f5acc813a	view-builder: Print correct exception in built ste exception handler Inside .handle_exception() continuation std::current_exception() doesn't work, there's std::exception ex argument to handler's lambda instead fixes #18423 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18349 (cherry picked from commit `4ac30e5337`)	2024-05-01 10:20:26 +03:00
Anna Stuchlik	f71a687baf	doc: run repair after changing RF of system_auth This commit adds the requirement to run repair after changing the replication factor of the system_auth keyspace in the procedure of adding a new node to a cluster. Refs: https://github.com/scylladb/scylla-enterprise/issues/4129 Closes scylladb/scylladb#18466 (cherry picked from commit `d85d37921a`)	2024-04-30 19:18:15 +03:00
Asias He	b2858e4028	streaming: Fix use after move in fire_stream_event The event is used in a loop. Found by clang-tidy: ``` streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move] listener->handle_stream_event(std::move(event)); ^ streaming/stream_result_future.cc:80:39: note: move occurred here listener->handle_stream_event(std::move(event)); ^ streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move listener->handle_stream_event(std::move(event)); ^ ``` Fixes #18332 (cherry picked from commit `4fd4e6acf3`) Closes #18430	2024-04-30 15:07:43 +02:00
Lakshmi Narayanan Sreethar	0fc0474ccc	sstables: reclaim_memory_from_components: do not update _recognised_components When reclaiming memory from bloom filters, do not remove them from _recognised_components, as that leads to the on-disk filter component being left back on disk when the SSTable is deleted. Fixes #18398 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#18400 (cherry picked from commit `6af2659b57`) Closes #18437	2024-04-29 10:02:52 +03:00
Kefu Chai	119dbb0d43	thrift: avoid use-after-move in `make_non_overlapping_ranges()` in handler.cc, `make_non_overlapping_ranges()` references a moved instance of `ColumnSlice` when something unexpected happens to format the error message in an exception, the move constructor of `ColumnSlice` is default-generated, so the members' move constructors are used to construct the new instance in the move constructor. this could lead to undefined behavior when dereferencing the move instance. in this change, in order to avoid use-after free, let's keep a copy of the referenced member variables and reference them when formatting error message in the exception. this use-after-move issue was introduced in `822a315dfa`, which implemented `get_multi_slice` verb and this piece in the first place. since both 5.2 and 5.4 include this commit, we should backport this change to them. Refs `822a315dfa` Fixes #18356 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `1ad3744edc`) Closes #18373	2024-04-25 11:36:00 +03:00
Anna Mikhlin	dae9bef75f	release: prepare for 5.2.18	2024-04-19 13:30:48 +03:00
Asias He	065f7178ab	repair: Improve estimated_partitions to reduce memory usage Currently, we use the sum of the estimated_partitions from each participant node as the estimated_partitions for sstable produced by repair. This way, the estimated_partitions is the biggest possible number of partitions repair would write. Since repair will write only the difference between repair participant nodes, using the biggest possible estimation will overestimate the partitions written by repair, most of the time. The problem is that overestimated partitions makes the bloom filter consume more memory. It is observed that it causes OOM in the field. This patch changes the estimation to use a fraction of the average partitions per node instead of sum. It is still not a perfect estimation but it already improves memory usage significantly. Fixes #18140 Closes scylladb/scylladb#18141 (cherry picked from commit `642f9a1966`)	2024-04-18 16:37:05 +03:00
Botond Dénes	f17e480237	Merge '[Backport 5.2] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required. The feature can be manually verified by doing the following : 1. run a single-node single-shard 1GB cluster 2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter) 3. populate with tiny partitions 4. watch the bloom filter metrics get capped at 100MB The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature. Fixes https://github.com/scylladb/scylladb/issues/17747 Backported from #17771 to 5.2. Closes #18247 * github.com:scylladb/scylladb: test_bloom_filter.py: disable reclaiming memory from components sstable_datafile_test: add tests to verify auto reclamation of components test/lib: allow overriding available memory via test_env_config sstables_manager: support reclaiming memory from components sstables_manager: store available memory size sstables_manager: add variable to track component memory usage db/config: add a new variable to limit memory used by table components sstable_datafile_test: add testcase to verify reclamation from sstables sstables: support reclaiming memory from components	2024-04-17 14:34:19 +03:00
Lakshmi Narayanan Sreethar	dd9ab15bb5	test_bloom_filter.py: disable reclaiming memory from components Disabled reclaiming memory from sstable components in the testcase as it interferes with the false positive calculation. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d86505e399`)	2024-04-16 15:50:22 +05:30
Lakshmi Narayanan Sreethar	96db5ae5e3	sstable_datafile_test: add tests to verify auto reclamation of components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d261f0fbea`)	2024-04-16 15:49:58 +05:30
Lakshmi Narayanan Sreethar	beea229deb	test/lib: allow overriding available memory via test_env_config Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `169629dd40`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	89367c4310	sstables_manager: support reclaiming memory from components Reclaim memory from the SSTable that has the most reclaimable memory if the total reclaimable memory has crossed the threshold. Only the bloom filter memory is considered reclaimable for now. Fixes #17747 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `a36965c474`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	32de41ecb4	sstables_manager: store available memory size The available memory size is required to calculate the reclaim memory threshold, so store that within the sstables manager. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `2ca4b0a7a2`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	0841c0084c	sstables_manager: add variable to track component memory usage sstables_manager::_total_reclaimable_memory variable tracks the total memory that is reclaimable from all the SSTables managed by it. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `f05bb4ba36`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	786c08aa59	db/config: add a new variable to limit memory used by table components A new configuration variable, components_memory_reclaim_threshold, has been added to configure the maximum allowed percentage of available memory for all SSTable components in a shard. If the total memory usage exceeds this threshold, it will be reclaimed from the components to bring it back under the limit. Currently, only the memory used by the bloom filters will be restricted. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e8026197d2`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	31251b37dd	sstable_datafile_test: add testcase to verify reclamation from sstables Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e0b6186d16`)	2024-04-16 15:30:30 +05:30
Lakshmi Narayanan Sreethar	1b390ceb24	sstables: support reclaiming memory from components Added support to track total memory from components that are reclaimable and to reclaim memory from them if and when required. Right now only the bloom filters are considered as reclaimable components but this can be extended to any component in the future. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `4f0aee62d1`)	2024-04-16 13:03:45 +05:30
Tzach Livyatan	0bfe016beb	Update Driver root page The right term is Amazon DynamoDB not AWS DynamoDB See https://aws.amazon.com/dynamodb/ Closes scylladb/scylladb#18214 (cherry picked from commit `289793d964`)	2024-04-16 09:55:41 +03:00
Botond Dénes	280956f507	Merge '[Backport 5.2] repair: fix memory counting in repair' from Aleksandra Martyniuk Repair memory limit includes only the size of frozen mutation fragments in repair row. The size of other members of repair row may grow uncontrollably and cause out of memory. Modify what's counted to repair memory limit. Fixes: https://github.com/scylladb/scylladb/issues/16710. (cherry picked from commit `a4dc6553ab`) (cherry picked from commit `51c09a84cc`) Refs https://github.com/scylladb/scylladb/pull/17785 Closes #18237 * github.com:scylladb/scylladb: test: add test for repair_row::size() repair: fix memory accounting in repair_row	2024-04-16 07:07:15 +03:00
Aleksandra Martyniuk	97671eb935	test: add test for repair_row::size() Add test which checs whether repair_row::size() considers external memory. (cherry picked from commit `51c09a84cc`)	2024-04-09 13:29:33 +02:00
Aleksandra Martyniuk	8144134545	repair: fix memory accounting in repair_row In repair, only the size of frozen mutation fragments of repair row is counted to the memory limit. So, huge keys of repair rows may lead to OOM. Include other repair_row's members' memory size in repair memory limit. (cherry picked from commit `a4dc6553ab`)	2024-04-06 22:44:51 +00:00
Ferenc Szili	2bb5fe7311	logging: Don't log PK/CK in large partition/row/cell warning Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly. This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables. The logged data will look like this: Large cells: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db Large rows: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db Large partitions: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db Fixes #18041 Closes scylladb/scylladb#18166 (cherry picked from commit `f1cc6252fd`)	2024-04-05 16:03:08 +03:00
Kefu Chai	4595f51d5c	utils/logalloc: do not allocate memory in reclaim_timer::report() before this change, `reclaim_timer::report()` calls ```c++ fmt::format(", at {}", current_backtrace()) ``` which allocates a `std::string` on heap, so it can fail and throw. in that case, `std::terminate()` is called. but at that moment, the reason why `reclaim_timer::report()` gets called is that we fail to reclaim memory for the caller. so we are more likely to run into this issue. anyway, we should not allocate memory in this path. in this change, a dedicated printer is created so that we don't format to a temporary `std::string`, and instead write directly to the buffer of logger. this avoids the memory allocation. Fixes #18099 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18100 (cherry picked from commit `fcf7ca5675`)	2024-04-02 16:38:17 +03:00
Wojciech Mitros	c0c34d2af0	mv: keep semaphore units alive until the end of a remote view update When a view update has both a local and remote target endpoint, it extends the lifetime of its memory tracking semaphore units only until the end of the local update, while the resources are actually used until the remote update finishes. This patch changes the semaphore transferring so that in case of both local and remote endpoints, both view updates share the units, causing them to be released only after the update that takes longer finishes. Fixes #17890 (cherry picked from commit `9789a3dc7c`) Closes #18104	2024-04-02 10:09:01 +02:00
Pavel Emelyanov	c34a503ef3	Update seastar submodule (iotune error path crash fix) * seastar eb093f8a...b9fd21d8 (1): > iotune: Don't close file that wasn't opened Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-28 10:53:51 +03:00
Beni Peled	447a3beb47	release: prepare for 5.2.17	2024-03-27 14:35:37 +02:00
Botond Dénes	aca50c46b7	tools/toolchain: update python driver Backports scylladb/scylladb#17604 and scylladb/scylladb#17956. Fixes scylladb/scylladb#16709 Fixes scylladb/scylladb#17353 Closes #17661	2024-03-27 08:48:25 +02:00
Wojciech Mitros	44bcaca929	mv: adjust memory tracking of single view updates within a batch Currently, when dividing memory tracked for a batch of updates we do not take into account the overhead that we have for processing every update. This patch adds the overhead for single updates and joins the memory calculation path for batches and their parts so that both use the same overhead. Fixes #17854 (cherry picked from commit `efcb718`) Closes #17999	2024-03-26 09:38:17 +02:00
Botond Dénes	2e2bf79092	Merge '[Backport 5.2] tests: utils: error injection: print time duration instead of count' from ScyllaDB before this change, we always cast the wait duration to millisecond, even if it could be using a higher resolution. actually `std::chrono::steady_clock` is using `nanosecond` for its duration, so if we inject a deadline using `steady_clock`, we could be awaken earlier due to the narrowing of the duration type caused by the duration_cast. in this change, we just use the duration as it is. this should allow the caller to use the resolution provided by Seastar without losing the precision. the tests are updated to print the time duration instead of count to provide information with a higher resolution. Fixes #15902 (cherry picked from commit `8a5689e7a7`) (cherry picked from commit `1d33a68dd7`) Closes #17911 * github.com:scylladb/scylladb: tests: utils: error injection: print time duration instead of count error_injection: do not cast to milliseconds when injecting timeout	2024-03-25 17:41:23 +02:00
Pavel Emelyanov	616199f79c	Update seastar submodule (dupliex IO queue activation fix) * seastar ad0f2d5d...eb093f8a (1): > fair_queue: Do not pop unplugged class immediately Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-25 12:58:24 +03:00
Wojciech Mitros	5dfb6c9ead	mv: adjust the overhead estimation for view updates In order to avoid running out of memory, we can't underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously. After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each remote view update. The overhead consists of: - 512 bytes - a write_response_handler - less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream - 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV() - 192 bytes - a continuation to the coroutine above - 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin() - 112 bytes - a continuation to the coroutine above - 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes This patch changes the previous overhead estimate of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead. The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size. Besides that, considering that we're waiting for a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it. The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary to handle the excessive memory used by view updates. Fixes #17364 (cherry picked from commit `5ab3586135`) Closes #17858	2024-03-20 13:52:23 +02:00
Kefu Chai	6209f5d6d4	tests: utils: error injection: print time duration instead of count instead of casting / comparing the count of duration unit, let's just compare the durations, so that boost.test is able to print the duration in a more informative and user friendly way (line wrapped) test/boost/error_injection_test.cc(167): fatal error: in "test_inject_future_disabled": critical check wait_time > sleep_msec has failed [23839ns <= 10ms] Refs #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `1d33a68dd7`)	2024-03-20 09:40:16 +00:00
Kefu Chai	ac288684c6	error_injection: do not cast to milliseconds when injecting timeout before this change, we always cast the wait duration to millisecond, even if it could be using a higher resolution. actually `std::chrono::steady_clock` is using `nanosecond` for its duration, so if we inject a deadline using `steady_clock`, we could be awaken earlier due to the narrowing of the duration type caused by the duration_cast. in this change, we just use the duration as it is. this should allow the caller to use the resolution provided by Seastar without losing the precision. Fixes #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `8a5689e7a7`)	2024-03-20 09:40:16 +00:00
Raphael S. Carvalho	fc1d126f31	replica: Fix major compaction semantics by performing off-strategy first Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes #11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15792 (cherry picked from commit `ea6c281b9f`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #17901	2024-03-20 08:48:17 +02:00
Anna Stuchlik	70edebd8d7	doc: fix the image upgrade page This commit updates the Upgrade ScyllaDB Image page. - It removes the incorrect information that updating underlying OS packages is mandatory. - It adds information about the extended procedure for non-official images. (cherry picked from commit `fc90112b97`) Closes #17885	2024-03-19 16:47:06 +02:00
Petr Gusev	dffc0fb720	repair_meta: get_estimated_partitions fix The shard_range parameter was unused. Fixes: #17863 (cherry picked from commit `b9f527bfa8`)	2024-03-18 14:27:45 +02:00
Kamil Braun	0bb338c521	test: remove test_writes_to_recent_previous_cdc_generations The test in its original form relies on the `error_injections_at_startup` feature, which 5.2 doesn't have, so I adapted the test to enable error injections after bootstrapping nodes in the backport (`9c44bbce67`). That is however incorrect, it's important for the injection to be enabled while the nodes are booting, otherwise the test will be flaky, as we observed. Details in scylladb/scylladb#17749. Remove the test from 5.2 branch. Fixes scylladb/scylladb#17749 Closes #17750	2024-03-15 10:22:22 +02:00
Tomasz Grabiec	cefa19eb93	Merge 'migration_manager: take group0 lock during raft snapshot taking' from Kamil Braun This is a backport of `0c376043eb` and follow-up fix `57b14580f0` to 5.2. We haven't identified any specific issues in test or field in 5.2/2023.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways. Closes #17640 * github.com:scylladb/scylladb: migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 migration_manager: fix indentation after the previous patch. messaging_service: process migration_request rpc on shard 0 migration_manager: take group0 lock during raft snapshot taking	2024-03-14 23:41:02 +01:00
Nadav Har'El	08077ff3e8	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172 (cherry picked from commit `21e7deafeb`)	2024-03-13 15:08:32 +02:00
Kamil Braun	405387c663	test: unflake test_topology_remove_garbage_group0 The test is booting nodes, and then immediately starts shutting down nodes and removing them from the cluster. The shutting down and removing may happen before driver manages to connect to all nodes in the cluster. In particular, the driver didn't yet connect to the last bootstrapped node. Or it can even happen that the driver has connected, but the control connection is established to the first node, and the driver fetched topology from the first node when the first node didn't yet consider the last node to be normal. So the driver decides to close connection to the last node like this: ``` 22:34:03.159 DEBUG> [control connection] Removing host not found in peers metadata: <Host: 127.42.90.14:9042 datacenter1> ``` Eventually, at the end of the test, only the last node remains, all other nodes have been removed or stopped. But the driver does not have a connection to that last node. Fix this problem by ensuring that: - all nodes see each other as NORMAL, - the driver has connected to all nodes at the beginning of the test, before we start shutting down and removing nodes. Fixes scylladb/scylladb#16373 (cherry picked from commit `a68701ed4f`) Closes #17703	2024-03-12 13:43:21 +01:00
Kamil Braun	b567364af1	migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer Jumping to shard 0 during group 0 snapshot transfer is required because we take group 0 lock, onyl available on shard 0. But outside of Raft mode it only pessimizes performance unnecessarily, so don't do it.	2024-03-12 11:19:31 +01:00
Botond Dénes	3897d44893	repair: resolve start-up deadlock Repairs have to obtain a permit to the reader concurrency semaphore on each shard they have a presence on. This is prone to deadlocks: node1 node2 repair1_master (takes permit) repair1_follower (waits on permit) repair2_master (waits for permit) repair2_follower (takes permit) In lieu of strong central coordination, we solved this by making permits evictable: if repair2 can evict repair1's permit so it can obtain one and make progress. This is not efficient as evicting a permit usually means discarding already done work, but it prevents the deadlocks. We recently discovered that there is a window when deadlocks can still happen. The permit is made evictable when the disk reader is created. This reader is an evictable one, which effectively makes the permit evictable. But the permit is obtained when the repair constrol structrure -- repair meta -- is create. Between creating the repair meta and reading the first row from disk, the deadlock is still possible. And we know that what is possible, will happen (and did happen). Fix by making the permit evictable as soon as the repair meta is created. This is very clunky and we should have a better API for this (refs #17644), but for now we go with this simple patch, to make it easy to backport. Refs: #17644 Fixes: #17591 Closes #17646 (cherry picked from commit `c6e108a`) Backport notes: The fix above does not apply to 5.2, because on 5.2 the reader is created immediately when the repair-meta is created. So we don't need the game with a fake inactive read, we can just pause the already created reader in the repair-reader constructor. Closes #17730	2024-03-12 08:24:26 +02:00
Michał Chojnowski	54048e5613	sstables: fix a use-after-free in key_view::explode() key_view::explode() contains a blatant use-after-free: unless the input is already linearized, it returns a view to a local temporary buffer. This is rare, because partition keys are usually not large enough to be fragmented. But for a sufficiently large key, this bug causes a corrupted partition_key down the line. Fixes #17625 (cherry picked from commit `7a7b8972e5`) Closes #17725	2024-03-11 16:17:32 +02:00
Lakshmi Narayanan Sreethar	e8736ae431	reader_permit: store schema_ptr instead of raw schema pointer Store schema_ptr in reader permit instead of storing a const pointer to schema to ensure that the schema doesn't get changed elsewhere when the permit is holding on to it. Also update the constructors and all the relevant callers to pass down schema_ptr instead of a raw pointer. Fixes #16180 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#16658 (cherry picked from commit `76f0d5e35b`) Closes #17694	2024-03-08 10:56:12 +02:00
Gleb Natapov	42d25f1911	raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 group0 operations a valid on shard 0 only. Assert that. (cherry picked from commit `9847e272f9`)	2024-03-05 16:51:13 +01:00
Gleb Natapov	619f75d1de	migration_manager: fix indentation after the previous patch. (cherry picked from commit `77907b97f1`)	2024-03-05 16:50:24 +01:00
Gleb Natapov	6dd31dcade	messaging_service: process migration_request rpc on shard 0 Commit `0c376043eb` added access to group0 semaphore which can be done on shard0 only. Unlike all other group0 rpcs (that already always forwarded to shard0) migration_request does not since it is an rpc that what reused from non raft days. The patch adds the missing jump to shard0 before executing the rpc. (cherry picked from commit `4a3c79625f`)	2024-03-05 16:49:23 +01:00
Gleb Natapov	dd65bf151b	migration_manager: take group0 lock during raft snapshot taking Group0 state machine access atomicity is guaranteed by a mutex in group0 client. A code that reads or writes the state needs to hold the log. To transfer schema part of the snapshot we used existing "migration request" verb which did not follow the rule. Fix the code to take group0 lock before accessing schema in case the verb is called as part of group0 snapshot transfer. Fixes scylladb/scylladb#16821 (cherry picked from commit `0c376043eb`) Backport note: introduced missing `raft_group0_client::hold_read_apply_mutex`	2024-03-05 16:40:00 +01:00
Yaron Kaikov	f4a7804596	release: prepare for 5.2.16	2024-03-03 14:33:44 +02:00
Botond Dénes	ba373f83e4	Merge '[Backport 5.2] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: https://github.com/scylladb/scylladb/issues/17028. Fixes: https://github.com/scylladb/scylladb/issues/15370. Fixes: https://github.com/scylladb/scylladb/issues/15598. Closes #17528 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table	2024-02-28 16:33:01 +02:00
Kamil Braun	d82c757323	Merge 'misc_services: fix data race from bad usage of get_next_version' from Piotr Dulikowski The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc. The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0. The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0. There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A: 1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted, 2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state, 3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state, 4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`, 5. Shard 1 resumes and stores version `v` in the global counter, 6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`. 7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4. Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by: 1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual, 2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value, 3. running a test which performs schema changes in a loop. The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway. Refs: #17493 Closes scylladb/scylladb#17499 * github.com:scylladb/scylladb: version_generator: check that get_next_version is called on shard 0 misc_services: fix data race from bad usage of get_next_version (cherry picked from commit `fd32e2ee10`)	2024-02-28 14:28:03 +01:00
Aleksandra Martyniuk	78aeb990a6	repair: handle no_such_column_family from remote node gracefully If no_such_column_family is thrown on remote node, then repair operation fails as the type of exception cannot be determined. Use repair::with_table_drop_silenced in repair to continue operation if a table was dropped. (cherry picked from commit `cf36015591`)	2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk	23493bb342	test: test drop table on receiver side during streaming (cherry picked from commit `2ea5d9b623`)	2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk	d19afd7059	streaming: fix indentation (cherry picked from commit `b08f539427`)	2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk	4e200aa250	streaming: handle no_such_column_family from remote node gracefully If no_such_column_family is thrown on remote node, then streaming operation fails as the type of exception cannot be determined. Use repair::with_table_drop_silenced in streaming to continue operation if a table was dropped. (cherry picked from commit `219e1eda09`)	2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk	afca1142cd	repair: add methods to skip dropped table Schema propagation is async so one node can see the table while on the other node it is already dropped. So, if the nodes stream the table data, the latter node throws no_such_column_family. The exception is propagated to the other node, but its type is lost, so the operation fails on the other node. Add method which waits until all raft changes are applied and then checks whether given table exists. Add the function which uses the above to determine, whether the function failed because of dropped table (eg. on the remote node so the exact exception type is unknown). If so, the exception isn't rethrown. (cherry picked from commit `5202bb9d3c`)	2024-02-28 11:45:54 +01:00
Botond Dénes	ce1a422c9c	Merge '[Backport 5.2] sstables: close index_reader in has_partition_key' from Aleksandra Martyniuk If index_reader isn't closed before it is destroyed, then ongoing sstables reads won't be awaited and assertion will be triggered. Close index_reader in has_partition_key before destroying it. Fixes: https://github.com/scylladb/scylladb/issues/17232. Closes #17532 * github.com:scylladb/scylladb: test: add test to check if reader is closed sstables: close index_reader in has_partition_key	2024-02-27 16:12:17 +02:00
Aleksandra Martyniuk	296be93714	test: add test to check if reader is closed Add test to check if reader is closed in sstable::has_partition_key. (cherry picked from commit `4530be9e5b`)	2024-02-26 16:17:12 +01:00
Aleksandra Martyniuk	6feb802d54	sstables: close index_reader in has_partition_key If index_reader isn't closed before it is destroyed, then ongoing sstables reads won't be awaited and assertion will be triggered. Close index_reader in has_partition_key before destroying it. (cherry picked from commit `5227336a32`)	2024-02-26 16:17:12 +01:00
Avi Kivity	9c44bbce67	Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak Before this PR, writes to the previous CDC generations would always be rejected. After this PR, they will be accepted if the write's timestamp is greater than `now - generation_leeway`. This change was proposed around 3 years ago. The motivation was to improve user experience. If a client generates timestamps by itself and its clock is desynchronized with the clock of the node the client is connected to, there could be a period during generation switching when writes fail. We didn't consider this problem critical because the client could simply retry a failed write with a higher timestamp. Eventually, it would succeed. This approach is safe because these failed writes cannot have any side effects. However, it can be inconvenient. Writing to previous generations was proposed to improve it. The idea was rejected 3 years ago. Recently, it turned out that there is a case when the client cannot retry a write with the increased timestamp. It happens when a table uses CDC and LWT, which makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves writing to the CDC log table. If it fails, we get stuck. It's a major bug with an unknown perfect solution. Allowing writes to previous generations for `generation_leeway` is a probabilistic fix that should solve the problem in practice. Apart from this change, this PR adds tests for it and updates the documentation. This PR is sufficient to enable writes to the previous generations only in the gossiper-based topology. The Raft-based topology needs some adjustments in loading and cleaning CDC generations. These changes won't interfere with the changes introduced in this PR, so they are left for a follow-up. Fixes scylladb/scylladb#7251 Fixes scylladb/scylladb#15260 Closes scylladb/scylladb#17134 * github.com:scylladb/scylladb: docs: using-scylla: cdc: remove info about failing writes to old generations docs: dev: cdc: document writing to previous CDC generations test: add test_writes_to_previous_cdc_generations cdc: generation: allow increasing generation_leeway through error injection cdc: metadata: allow sending writes to the previous generations (cherry picked from commit `9bb4482ad0`) Backport note: replaced `servers_add` with `server_add` loop in tests replaced `error_injections_at_startup` (not implemented in 5.2) with `enable_injection` post-boot	2024-02-22 15:05:19 +01:00
Nadav Har'El	6a6115cd86	mv: fix missing view deletions in some cases of range tombstones For efficiency, if a base-table update generates many view updates that go the same partition, they are collected as one mutation. If this mutation grows too big it can lead to memory exhaustion, so since commit `7d214800d0` we split the output mutation to mutations no longer than 100 rows (max_rows_for_view_updates) each. This patch fixes a bug where this split was done incorrectly when the update involved range tombstones, a bug which was discovered by a user in a real use case (#17117). Range tombstones are read in two parts, a beginning and an end, and the code could split the processing between these two parts and the result that some of the range tombstones in update could be missed - and the view could miss some deletions that happened in the base table. This patch fixes the code in two places to avoid breaking up the processing between range tombstones: 1. The counter "_op_count" that decides where to break the output mutation should only be incremented when adding rows to this output mutation. The existing code strangely incrmented it on every read (!?) which resulted in the counter being incremented on every input fragment, and in particular could reach the limit 100 between two range tombstone pieces. 2. Moreover, the length of output was checked in the wrong place... The existing code could get to 100 rows, not check at that point, read the next input - half a range tombstone - and only then check that we reached 100 rows and stop. The fix is to calculate the number of rows in the right place - exactly when it's needed, not before the step. The first change needs more justification: The old code, that incremented _op_count on every input fragment and not just output fragments did not fit the stated goal of its introduction - to avoid large allocations. In one test it resulted in breaking up the output mutation to chunks of 25 rows instead of the intended 100 rows. But, maybe there was another goal, to stop the iteration after 100 input rows and avoid the possibility of stalls if there are no output rows? It turns out the answer is no - we don't need this _op_count increment to avoid stalls: The function build_some() uses `co_await on_results()` to run one step of processing one input fragment - and `co_await` always checks for preemption. I verfied that indeed no stalls happen by using the existing test test_long_skipped_view_update_delete_with_timestamp. It generates a very long base update where all the view updates go to the same partition, but all but the last few updates don't generate any view updates. I confirmed that the fixed code loops over all these input rows without increasing _op_count and without generating any view update yet, but it does NOT stall. This patch also includes two tests reproducing this bug and confirming its fixed, and also two additional tests for breaking up long deletions that I wanted to make sure doesn't fail after this patch (it doesn't). By the way, this fix would have also fixed issue #12297 - which we fixed a year ago in a different way. That issue happend when the code went through 100 input rows without generating any output rows, and incorrectly concluding that there's no view update to send. With this fix, the code no longer stops generating the view update just because it saw 100 input rows - it would have waited until it generated 100 output rows in the view update (or the input is really done). Fixes #17117 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17164 (cherry picked from commit `14315fcbc3`)	2024-02-22 15:36:58 +02:00
Avi Kivity	e0e46fbc50	Regenerate frozen toolchain For gnutls 3.8.3. Since Fedora 37 is end-of-life, pick the package from Fedora 38. libunistring needs to be updated to satisfy the dependency solver. Fixes #17285. Closes scylladb/scylladb#17287 Signed-off-by: Avi Kivity <avi@scylladb.com> Closes #17411	2024-02-20 12:34:46 +02:00
Wojciech Mitros	27ab3b1744	rust: update dependencies The currently used version of "rustix" depency had a minor security vulnerability. This patch updates the corresponding crate. The update was performed using "cargo update" on "rustix" package and version "0.36.17" relevant package and the corresponding version. Refs #15772 Closes #17408	2024-02-19 22:12:50 +02:00
Michał Jadwiszczak	0d22471222	schema::describe: print 'synchronous_updates' only if it was specified While describing materialized view, print `synchronous_updates` option only if the tag is present in schema's extensions map. Previously if the key wasn't present, the default (false) value was printed. Fixes: #14924 Closes #14928 (cherry picked from commit `b92d47362f`)	2024-02-19 09:10:34 +02:00
Botond Dénes	422a731e85	query: do not kill unpaged queries when they reach the tombstone-limit The reason we introduced the tombstone-limit (query_tombstone_page_limit), was to allow paged queries to return incomplete/empty pages in the face of large tombstone spans. This works by cutting the page after the tombstone-limit amount of tombstones were processed. If the read is unpaged, it is killed instead. This was a mistake. First, it doesn't really make sense, the reason we introduced the tombstone limit, was to allow paged queries to process large tombstone-spans without timing out. It does not help unpaged queries. Furthermore, the tombstone-limit can kill internal queries done on behalf of user queries, because all our internal queries are unpaged. This can cause denial of service. So in this patch we disable the tombstone-limit for unpaged queries altogether, they are allowed to continue even after having processed the configured limit of tombstones. Fixes: #17241 Closes scylladb/scylladb#17242 (cherry picked from commit `f068d1a6fa`)	2024-02-15 12:50:30 +02:00
Yaron Kaikov	1fa8327504	release: prepare for 5.2.15	2024-02-11 14:17:31 +02:00
Pavel Emelyanov	f3c215aaa1	Update seastar submodule * seastar 29badd99...ad0f2d5d (1): > Merge "Slowdown IO scheduler based on dispatched/completed ratio" into branch-5.2 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-09 12:22:58 +03:00
Botond Dénes	94af1df2cf	Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs. The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function. Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw. Fixes #17035 Closes scylladb/scylladb#17073 * github.com:scylladb/scylladb: utils: replace assert() by on_internal_error() utils: add on_internal_error with common logger utils: add a timeuuid minimum, like we had maximum test/cql-pytest: tests for "date" type (cherry picked from commit `2a4b991772`)	2024-02-07 14:19:32 +02:00
Botond Dénes	9291eafd4a	Merge '[Backport 5.2] Raft snapshot fixes' from Kamil Braun Backports required to fix scylladb/scylladb#16683 in 5.2: - when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0 - add an API to trigger Raft snapshot - use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot. Closes #17087 * github.com:scylladb/scylladb: test_raft_snapshot_request: fix flakiness (again) test_raft_snapshot_request: fix flakiness Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun raft: server: add workaround for scylladb/scylladb#12972 raft: Store snapshot update and truncate log atomically service: raft: force initial snapshot transfer in new cluster raft_sys_table_storage: give initial snapshot a non zero value	2024-02-07 11:55:20 +02:00
Michał Chojnowski	4546d0789f	row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted Commit `e81fc1f095` accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on all queried replicas at the time of the query. Fixes #16759 Closes scylladb/scylladb#17138 (cherry picked from commit `ed98102c45`)	2024-02-04 14:46:57 +02:00
Kamil Braun	4e257c5c74	test_raft_snapshot_request: fix flakiness (again) At the end of the test, we wait until a restarted node receives a snapshot from the leader, and then verify that the log has been truncated. To check the snapshot, the test used the `system.raft_snapshots` table, while the log is stored in `system.raft`. Unfortunately, the two tables are not updated atomically when Raft persists a snapshot (scylladb/scylladb#9603). We first update `system.raft_snapshots`, then `system.raft` (see `raft_sys_table_storage::store_snapshot_descriptor`). So after the wait finishes, there's no guarantee the log has been truncated yet -- there's a race between the test's last check and Scylla doing that last delete. But we can check the snapshot using `system.raft` instead of `system.raft_snapshots`, as `system.raft` has the latest ID. And since `1640f83fdc`, storing that ID and truncating the log in `system.raft` happens atomically. Closes scylladb/scylladb#17106 (cherry picked from commit `c911bf1a33`)	2024-02-02 11:31:19 +01:00
Kamil Braun	08021dc906	test_raft_snapshot_request: fix flakiness Add workaround for scylladb/python-driver#295. Also an assert made at the end of the test was false, it is fixed with appropriate comment added. (cherry picked from commit `74bf60a8ca`)	2024-02-02 11:31:19 +01:00
Botond Dénes	db586145aa	Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun The persisted snapshot index may be 0 if the snapshot was created in older version of Scylla, which means snapshot transfer won't be triggered to a bootstrapping node. Commands present in the log may not cover all schema changes --- group 0 might have been created through the upgrade upgrade procedure, on a cluster with existing schema. So a deployment with index=0 snapshot is broken and we need to fix it. We can use the new `raft::server::trigger_snapshot` API for that. Also add a test. Fixes scylladb/scylladb#16683 Closes scylladb/scylladb#17072 * github.com:scylladb/scylladb: test: add test for fixing a broken group 0 snapshot raft_group0: trigger snapshot if existing snapshot index is 0 (cherry picked from commit `181f68f248`) Backport note: test_raft_fix_broken_snapshot had to be removed because the "error injections enabled at startup" feature does not yet exist in 5.2.	2024-02-01 15:39:14 +01:00
Botond Dénes	ce0ed29ad6	Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb/scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (`ff386e7a44`) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb/scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit `d202d32f81`) Backport notes: - `has_output()` has a smaller condition in the backported version (because the condition was smaller in `poll_output()`) - `process_fsm_output` has a smaller body (because `io_fiber` had a smaller body) in the backported version - the HTTP API is only started if `raft_group_registry` is started	2024-02-01 15:38:51 +01:00
Kamil Braun	cbe8e05ef6	raft: server: add workaround for scylladb/scylladb#12972 When a node joins the cluster, it closes connections after learning topology information from other nodes, in order to reopen them with correct encryption, compression etc. In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot transfer. This was fixed in later versions by putting some order into the bootstrap process with `50e8ec77c6` but the fix was not backported due to many prerequisites and complexity. Raft automatically recovers from interrupted snapshot transfer by retrying it eventually, and everything works. However an ERROR is reported due to that one failed snapshot transfer, and dtests dont like ERRORs -- they report the test case as failed if an ERROR happened in any node's logs even if the test passed otherwise. Here we apply a simple workaround to please dtests -- in this particular scenario, turn the ERROR into a WARN.	2024-02-01 14:29:56 +01:00
Michael Huang	84004ab83c	raft: Store snapshot update and truncate log atomically In case the snapshot update fails, we don't truncate commit log. Fixes scylladb/scylladb#9603 Closes scylladb/scylladb#15540 (cherry picked from commit `1640f83fdc`)	2024-02-01 13:10:05 +01:00
Kamil Braun	753e2d3c57	service: raft: force initial snapshot transfer in new cluster When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In `91f609d065` we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: #14066 Closes #14336 (cherry picked from commit `ff386e7a44`) Backport note: contrary to the claims above, it turns out that it is actually necessary to create snapshots in clusters which bootstrap with Raft, because of tombstones in current schema state expire hence applying schema mutations from old Raft log entries is not really idempotent. Snapshot transfer, which transfers group 0 history and state_ids, prevents old entries from applying schema mutations over latest schema state. Ref: scylladb/scylladb#16683	2024-01-31 17:00:10 +01:00
Gleb Natapov	42cf25bcbb	raft_sys_table_storage: give initial snapshot a non zero value We create a snapshot (config only, but still), but do not assign it any id. Because of that it is not loaded on start. We do want it to be loaded though since the state of group0 will not be re-created from the log on restart because the entries will have outdated id and will be skipped. As a result in memory state machine state will not be restored. This is not a problem now since schema state it restored outside of raft code. Message-Id: <20230316112801.1004602-5-gleb@scylladb.com> (cherry picked from commit `a690070722`)	2024-01-31 16:50:42 +01:00
Aleksandra Martyniuk	f85375ff99	api: ignore future in task_manager_json::wait_task Before returning task status, wait_task waits for it to finish with done() method and calls get() on a resulting future. If requested task fails, an exception will be thrown and user will get internal server error instead of failed task status. Result of done() method is ignored. Fixes: #14914. (cherry picked from commit `ae67f5d47e`) Closes #16438	2024-01-30 10:54:33 +02:00
Aleksandra Martyniuk	35a0a459db	compaction: ignore future explicitly discard_result ignores only successful futures. Thus, if perform_compaction<regular_compaction_task_executor> call fails, a failure is considered abandoned, causing tests to fail. Explicitly ignore failed future. Fixes: #14971. Closes #15000 (cherry picked from commit `7a28cc60ec`) Closes #16441	2024-01-30 10:53:09 +02:00
Kamil Braun	784695e3ac	system_keyspace: use system memory for `system.raft` table `system.raft` was using the "user memory pool", i.e. the `dirty_memory_manager` for this table was set to `database::_dirty_memory_manager` (instead of `database::_system_dirty_memory_manager`). This meant that if a write workload caused memory pressure on the user memory pool, internal `system.raft` writes would have to wait for memtables of user tables to get flushed before the write would proceed. This was observed in SCT longevity tests which ran a heavy workload on the cluster and concurrently, schema changes (which underneath use the `system.raft` table). Raft would often get stuck waiting many seconds for user memtables to get flushed. More details in issue #15622. Experiments showed that moving Raft to system memory fixed this particular issue, bringing the waits to reasonable levels. Currently `system.raft` stores only one group, group 0, which is internally used for cluster metadata operations (schema and topology changes) -- so it makes sense to keep use system memory. In the future we'd like to have other groups, for strongly consistent tables. These groups should use the user memory pool. It means we won't be able to use `system.raft` for them -- we'll just have to use a separate table. Fixes: scylladb/scylladb#15622 Closes scylladb/scylladb#15972 (cherry picked from commit `f094e23d84`)	2024-01-25 17:59:49 +01:00
Avi Kivity	351d6d6531	Merge 'Invalidate prepared statements for views when their schema changes.' from Eliran Sinvani When a base table changes and altered, so does the views that might refer to the added column (which includes "SELECT " views and also views that might need to use this column for rows lifetime (virtual columns). However the query processor implementation for views change notification was an empty function. Since views are tables, the query processor needs to at least treat them as such (and maybe in the future, do also some MV specific stuff). This commit adds a call to `on_update_column_family` from within `on_update_view`. The side effect true to this date is that prepared statements for views which changed due to a base table change will be invalidated. Fixes https://github.com/scylladb/scylladb/issues/16392 This series also adds a test which fails without this fix and passes when the fix is applied. Closes scylladb/scylladb#16897 github.com:scylladb/scylladb: Add test for mv prepared statements invalidation on base alter query processor: treat view changes at least as table changes (cherry picked from commit `5810396ba1`)	2024-01-23 21:31:47 +02:00
Takuya ASADA	5a05ccc2f8	scylla_raid_setup: faillback to other paths when UUID not avialable On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is not available, scylla_raid_setup will fail while mounting volume. To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available paths to mount the disk and fallback to other paths like by-partuuid, by-id, by-path or just using real device path like /dev/md0. To get device path, and also to dumping device status when UUID is not available, this will introduce UdevInfo class which communicate udev using pyudev. Related #11359 Closes scylladb/scylladb#13803 (cherry picked from commit `58d94a54a3`) [syuu: renegerate tools/toolchain/image for new python3-pyudev package] Closes #16938	2024-01-23 16:05:28 +02:00
Botond Dénes	a1603bcb40	readers/multishard: evictable_reader::fast_forward_to(): close reader on exception When the reader is currently paused, it is resumed, fast-forwarded, then paused again. The fast forwarding part can throw and this will lead to destroying the reader without it being closed first. Add a try-catch surrounding this part in the code. Also mark `maybe_pause()` and `do_pause()` as noexcept, to make it clear why that part doesn't need to be in the try-catch. Fixes: #16606 Closes scylladb/scylladb#16630 (cherry picked from commit `204d3284fa`)	2024-01-16 16:57:28 +02:00
Michał Jadwiszczak	29da20b9e0	schema: add scylla specific options to schema description Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates` options to schema description. Fixes: #12389 Fixes: scylladb/scylla-enterprise#2979 Closes #16786	2024-01-16 09:56:08 +02:00
Botond Dénes	7c4ec8cf4b	Update tools/java submodule * tools/java 843096943e...a1eed2f381 (1): > Update JNA dependency to 5.14.0 Fixes: https://github.com/scylladb/scylla-tools-java/issues/371	2024-01-15 15:51:32 +02:00
Aleksandra Martyniuk	5def443cf0	tasks: keep task's children in list If std::vector is resized its iterators and references may get invalidated. While task_manager::task::impl::_children's iterators are avoided throughout the code, references to its elements are being used. Since children vector does not need random access to its elements, change its type to std::list<foreign_task_ptr>, which iterators and references aren't invalidated on element insertion. Fixes: #16380. Closes scylladb/scylladb#16381 (cherry picked from commit `9b9ea1193c`) Closes #16777	2024-01-15 15:38:00 +02:00
Anna Mikhlin	c0604a31fa	release: prepare for 5.2.14	2024-01-14 16:34:38 +02:00
Pavel Emelyanov	96bb602c62	Update seastar submodule (token bucket duration underflow) * seastar 43a1ce58...29badd99 (1): > shared_token_bucket: Fix duration_for() underflow Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-01-12 15:15:56 +03:00
Botond Dénes	d96440e8b6	Merge '[Backport 5.2] Validate compaction strategy options in prepare' from Aleksandra Martyniuk Table properties validation is performed on statement execution. Thus, when one attempts to create a table with invalid options, an incorrect command gets committed in Raft. But then its application fails, leading to a raft machine being stopped. Check table properties when create and alter statements are prepared. Fixes: https://github.com/scylladb/scylladb/issues/14710. Closes #16750 * github.com:scylladb/scylladb: cql3: statements: delete execute override cql3: statements: call check_restricted_table_properties in prepare cql3: statements: pass data_dictionary::database to check_restricted_table_properties	2024-01-12 10:56:54 +02:00
Aleksandra Martyniuk	ea41a811d6	cql3: statements: delete execute override Delete overriden create_table_statement::execute as it only calls its direct parent's (schema_altering_statement) execute method anyway. (cherry picked from commit `6c7eb7096e`)	2024-01-11 16:43:17 +01:00
Aleksandra Martyniuk	8b77fbc904	cql3: statements: call check_restricted_table_properties in prepare Table properties validation is performed on statement execution. Thus, when one attempts to create a table with invalid options, an incorrect command gets committed in Raft. But then its application fails, leading to a raft machine being stopped. Check table properties when create and alter statements are prepared. The error is no longer returned as an exceptional future, but it is thrown. Adjust the tests accordingly. (cherry picked from commit `60fdc44bce`)	2024-01-11 16:10:26 +01:00
Aleksandra Martyniuk	3ab3a2cc1b	cql3: statements: pass data_dictionary::database to check_restricted_table_properties Pass data_dictionary::database to check_restricted_table_properties as an arguemnt instead of query_processor as the method will be called from a context which does not have access to query processor. (cherry picked from commit `ec98b182c8`)	2024-01-11 16:10:26 +01:00
Botond Dénes	7e9107cc97	Update tools/java submodule * tools/java 79fa02d8a3...843096943e (1): > build.xml: update io.airlift to 0.9 Fixes: scylladb/scylla-tools-java#374	2024-01-11 11:03:29 +02:00
Botond Dénes	abb7ae4309	Update ./tools/jmx submodule * tools/jmx f21550e...50909d6 (1): > scylla-apiclient: drop hk2-locator dependency Fixes: scylladb/scylla-jmx#231	2024-01-10 14:22:14 +02:00
Botond Dénes	2820c63734	Update tools/java submodule * tools/java d7ec9bf45f...79fa02d8a3 (2): > build.xml: update scylla-driver-core to 3.11.5.1 > treewide: update "guava" package Fixes: scylla-tools-java#365 Fixes: scylla-tools-java#343 Closes #16693	2024-01-10 08:19:43 +02:00
Nadav Har'El	ac0056f4bc	Merge 'Fix partition estimation with TWCS tables during streaming' from Raphael "Raph" Carvalho TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows. Turns out we had two problems in this area that leads to suboptimal bloom filters. 1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed. 2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count. For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong. Fixes https://github.com/scylladb/scylladb/issues/15704. Closes scylladb/scylladb#15938 * github.com:scylladb/scylladb: streaming: Improve partition estimation with TWCS streaming: Don't adjust partition estimate if segregation is postponed (cherry picked from commit `64d1d5cf62`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #16672	2024-01-08 09:06:43 +02:00
Calle Wilund	aaa25e1a78	Commitlog replayer: Range-check skip call Fixes #15269 If segment being replayed is corrupted/truncated we can attempt skipping completely bogues byte amounts, which can cause assert (i.e. crash) in file_data_source_impl. This is not a crash-level error, so ensure we range check the distance in the reader. v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will ensure we log the fact that things are broken Closes scylladb/scylladb#15270 (cherry picked from commit `6ffb482bf3`)	2024-01-05 09:19:45 +02:00
Beni Peled	c57a0a7a46	release: prepare for 5.2.13	2024-01-03 17:48:59 +02:00
Botond Dénes	740ba3ac2a	tools/schema_loader: read_schema_table_mutation(): close the reader The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing #16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: #16519 Closes scylladb/scylladb#16521 (cherry picked from commit `da033343b7`)	2023-12-31 18:13:10 +02:00
Gleb Natapov	76c3dda640	storage_service: register schema version observer before joining group0 and starting gossiper The schema version is updated by group0, so if group0 starts before schema version observer is registered some updates may be missed. Since the observer is used to update node's gossiper state the gossiper may contain wrong schema version. Fix by registering the observer before starting group0 and even before starting gossiper to avoid a theoretical case that something may pull schema after start of gossiping and before the observer is registered. Fixes: #15078 Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com> (cherry picked from commit `d1654ccdda`)	2023-12-20 11:14:27 +01:00
Kamil Braun	287546923e	Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak Fixes #9405 `sync_point` API provided with incorrect sync point id might allocate crazy amount of memory and fail with `std::bad_alloc`. To fix this, we can check if the encoded sync point has been modified before decoding. We can achieve this by calculating a checksum before encoding, appending it to the encoded sync point, and compering it with a checksum calculated in `db::hints::decode` before decoding. Closes #14534 * github.com:scylladb/scylladb: db: hints: add checksum to sync point encoding db: hints: add the version_size constant (cherry picked from commit `eb6202ef9c`) The only difference from the original merge commit is the include path of `xx_hasher.hh`. On branch 5.2, this file is in the root directory, not `utils`. Closes #16458	2023-12-19 17:39:50 +02:00
Botond Dénes	c0dab523f9	Update tools/java submodule * tools/java e2aad6e3a0...d7ec9bf45f (1): > Merge "build: take care of old libthrift" from Piotr Grabowski Fixes: scylladb/scylla-tools-java#352 Closes #16464	2023-12-19 17:37:27 +02:00
Michael Huang	5499f7b5a8	cdc: use chunked_vector for topology_description entries Lists can grow very big. Let's use a chunked vector to prevent large contiguous allocations. Fixes: #15302. Closes scylladb/scylladb#15428 (cherry picked from commit `62a8a31be7`)	2023-12-19 13:43:23 +01:00
Piotr Grabowski	7055ac45d1	test: use more frequent reconnection policy The default reconnection policy in Python Driver is an exponential backoff (with jitter) policy, which starts at 1 second reconnection interval and ramps up to 600 seconds. This is a problem in tests (refs #15104), especially in tests that restart or replace nodes. In such a scenario, a node can be unavailable for an extended period of time and the driver will try to reconnect to it multiple times, eventually reaching very long reconnection interval values, exceeding the timeout of a test. Fix the issue by using a exponential reconnection policy with a maximum interval of 4 seconds. A smaller value was not chosen, as each retry clutters the logs with reconnection exception stack trace. Fixes #15104 Closes #15112 (cherry picked from commit `17e3e367ca`)	2023-12-19 13:43:23 +01:00
Gleb Natapov	4ff29d1637	raft: drop assert in server_impl::apply_snapshot for a condition that may happen server_impl::apply_snapshot() assumes that it cannot receive a snapshots from the same host until the previous one is handled and usually this is true since a leader will not send another snapshot until it gets response to a previous one. But it may happens that snapshot sending RPC fails after the snapshot was sent, but before reply is received because of connection disconnect. In this case the leader may send another snapshot and there is no guaranty that the previous one was already handled, so the assumption may break. Drop the assert that verifies the assumption and return an error in this case instead. Fixes: #15222 Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com> (cherry picked from commit `55f047f33f`)	2023-12-19 13:43:23 +01:00
Alexey Novikov	6bcf9e6631	When add duration field to UDT check whether this UDT is used in some clustering key Having values of the duration type is not allowed for clustering columns, because duration can't be ordered. This is correctly validated when creating a table but do not validated when we alter the type. Fixes #12913 Closes scylladb/scylladb#16022 (cherry picked from commit `bd73536b33`)	2023-12-19 06:58:41 -05:00
Takuya ASADA	74dd8f08e3	dist: fix local-fs.target dependency systemd man page says: systemd-fstab-generator(3) automatically adds dependencies of type Before= to all mount units that refer to local mount points for this target unit. So "Before=local-fs.taget" is the correct dependency for local mount points, but we currently specify "After=local-fs.target", it should be fixed. Also replaced "WantedBy=multi-user.target" with "WantedBy=local-fs.target", since .mount are not related with multi-user but depends local filesystems. Fixes #8761 Closes scylladb/scylladb#15647 (cherry picked from commit `a23278308f`)	2023-12-19 13:15:00 +02:00
Botond Dénes	68507ed4d9	Merge '[Backport 5.2] Shard of shard repair task impl' from Aleksandra Martyniuk Shard id is logged twice in repair (once explicitly, once added by logger). Redundant occurrence is deleted. shard_repair_task_impl::id (which contains global repair shard) is renamed to avoid further confusion. Fixes: https://github.com/scylladb/scylladb/issues/12955 Closes #16439 * github.com:scylladb/scylladb: repair: rename shard_repair_task_impl::id repair: delete redundant shard id from logs	2023-12-19 10:28:57 +02:00
Botond Dénes	46a29e9a02	Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed. The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected. Fixes #6389. Closes #13150 * github.com:scylladb/scylladb: test/alternator: test concurrent TagResource / UntagResource db/tags: drop unsafe update_tags() utility function alternator: isolate concurrent modification to tags db/tags: add safe modify_tags() utility functions migration_manager: expose access to storage_proxy (cherry picked from commit `dba1d36aa6`) Closes #16453	2023-12-19 10:19:31 +02:00
Botond Dénes	23fd6939eb	Merge '[Backport to 5.2] gossiper: mark_alive: use deferred_action to unmark pending' from Benny Halevy Backport the following patches to 5.2: - gossiper: mark_alive: enter background_msg gate (#14791) - gossiper: mark_alive: use deferred_action to unmark pending (#14839) Closes #16452 * github.com:scylladb/scylladb: gossiper: mark_alive: use deferred_action to unmark pending gossiper: mark_alive: enter background_msg gate	2023-12-19 09:06:37 +02:00
Botond Dénes	1cf499cfea	Update tools/java submodule * tools/java 80701efa8d...e2aad6e3a0 (2): > build: update logback dependency > build: update `netty` dependency Fixes: https://github.com/scylladb/scylla-tools-java/issues/363 Fixes: https://github.com/scylladb/scylla-tools-java/issues/364 Closes #16444	2023-12-18 18:19:20 +02:00
Nadav Har'El	91e05dc646	cql: fix SELECT toJson() or SELECT JSON of time column The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column of type "time" forgot to put the time string in quotes. The result was invalid JSON. This is patch is a one-liner fixing this bug. This patch also removes the "xfail" marker from one xfailing test for this issue which now starts to pass. We also add a second test for this issue - the existing test was for "SELECT TOJSON(t)", and the second test shows that "SELECT JSON t" had exactly the same bug - and both are fixed by the same patch. We also had a test translated from Cassandra which exposed this bug, but that test continues to fail because of other bugs, so we just need to update the xfail string. The patch also fixes one C++ test, test/boost/json_cql_query_test.cc, which enshrined the wrong behavior - JSON output that isn't even valid JSON - and had to be fixed. Unlike the Python tests, the C++ test can't be run against Cassandra, and doesn't even run a JSON parser on the output, which explains how it came to enshrine wrong output instead of helping to discover the bug. Fixes #7988 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16121 (cherry picked from commit `8d040325ab`)	2023-12-18 18:19:20 +02:00
Benny Halevy	a2009c4a8c	gossiper: mark_alive: use deferred_action to unmark pending Make sure _pending_mark_alive_endpoints is unmarked in any case, including exceptions. Fixes #14839 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14840 (cherry picked from commit `1e7e2eeaee`)	2023-12-18 14:44:22 +02:00
Benny Halevy	999a6bfaae	gossiper: mark_alive: enter background_msg gate The function dispatch a background operation that must be waited on in stop(). \Fixes scylladb/scylladb#14791 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `868e436901`)	2023-12-18 14:42:52 +02:00
Kefu Chai	faef786c88	reloc: strip.sh: always generate symbol list with posix format we compare the symbols lists of stripped ELF file ($orig.stripped) and that of the one including debugging symbols ($orig.debug) to get a an ELF file which includes only the necessary bits as the debuginfo ($orig.minidebug). but we generate the symbol list of stripped ELF file using the sysv format, while generate the one from the unstripped one using posix format. the former is always padded the symbol names with spaces so that their the length at least the same as the section name after we split the fields with "\|". that's why the diff includes the stuff we don't expect. and hence, we have tons of warnings like: ``` objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line ``` when using objcopy to filter the ELF file to keep only the symbols we are interested in. so, in this change * use the same format when dumping the symbols from unstripped ELF file * include the symbols in the text area -- the code, by checking "T" and "t" in the dumped symbols. this was achieved by matching the lines with "FUNC" before this change. * include the the symbols in .init data section -- the global variables which are initialized at compile time. they could be also interesting when debugging an application. Fixes #15513 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15514 (cherry picked from commit `50c937439b`)	2023-12-18 13:58:14 +02:00
Michał Chojnowski	7e9bdef8bb	row_cache: when the constructor fails, clear `_partitions` in the right allocator If the constructor of row_cache throws, `_partitions` is cleared in the wrong allocator, possibly causing allocator corruption. Fix that. Fixes #15632 Closes scylladb/scylladb#15633 (cherry picked from commit `330d221deb`)	2023-12-18 13:55:16 +02:00
Michael Huang	af38b255c8	cql3: Fix invalid JSON parsing for JSON objects with ASCII keys For JSON objects represented as map<ascii, int>, don't treat ASCII keys as a nested JSON string. We were doing that prior to the patch, which led to parsing errors. Included the error offset where JSON parsing failed for rjson::parse related functions to help identify parsing errors better. Fixes: #7949 Signed-off-by: Michael Huang <michaelhly@gmail.com> Closes scylladb/scylladb#15499 (cherry picked from commit `75109e9519`)	2023-12-18 13:45:57 +02:00
Kefu Chai	c4b699525a	sstables: throw at seeing invalid chunk_len before this change, when running into a zero chunk_len, scylla crashes with `assert(chunk_size != 0)`. but we can do better than printing a backtrace like: ``` scylla: sstables/compress.cc:158: void sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed. ``` so, in this change, a `malformed_sstable_exception` is throw in place of an `assert()`, which is supposed to verify the programming invariants, not for identifying corrupted data file. Fixes #15265 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15264 (cherry picked from commit `1ed894170c`)	2023-12-18 13:29:02 +02:00
Nadav Har'El	3a24b8c435	sstables: stop warning when auto-snapshot leaves non-empty directory When a table is dropped, we delete its sstables, and finally try to delete the table's top-level directory with the rmdir system call. When the auto-snapshot feature is enabled (this is still Scylla's default), the snapshot will remain in that directory so it won't be empty and will cannot be removed. Today, this results in a long, ugly and scary warning in the log: ``` WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored. ``` It is bad to log as a warning something which is completely normal - it happens every time a table is dropped with the perfectly valid (and even default) auto-snapshot mode. We should only log a warning if the deletion failed because of some unexpected reason. And in fact, this is exactly what the code tried to do - it does not log a warning if the rmdir failed with EEXIST. It even had a comment saying why it was doing this. But the problem is that in Linux, deleting a non-empty directory does not return EEXIST, it returns ENOTEMPTY... Posix actually allows both. So we need to check both, and this is the only change in this patch. To confirm this that this patch works, edit test/cql-pytest/run.py and change auto-snapshot from 0 to 1, run test/alternator/run (for example) and see many "Directory not empty" warnings as above. With this patch, none of these warnings appear. Fixes #13538 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14557 (cherry picked from commit `edfb89ef65`)	2023-12-18 13:26:40 +02:00
Kefu Chai	9e9a488da3	streaming: cast the progress to a float before formatting it before this change, we format a `long` using `{:f}`. fmtlib would throw an exception when actually formatting it. so, let's make the percentage a float before formatting it. Fixes #14587 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14588 (cherry picked from commit `1eb76d93b7`)	2023-12-18 13:16:58 +02:00
Aleksandra Martyniuk	614d15b9f6	repair: rename shard_repair_task_impl::id shard_repair_task_impl::id stores global repair id. To avoid confusion with the task id, the field is renamed to global_repair_id. (cherry picked from commit `d889a599e8`)	2023-12-18 12:08:00 +01:00
Aleksandra Martyniuk	fc2799096f	repair: delete redundant shard id from logs In repair shard id is logged twice. Delete repeated occurence. (cherry picked from commit `f7c88edec5`)	2023-12-18 12:03:26 +01:00
Petr Gusev	b9178bd853	hints: send_one_hint: extend the scope of file_send_gate holder The problem was that the holder in with_gate call was released too early. This happened before the possible call to on_hint_send_failure in then_wrapped. As a result, the effects of on_hint_send_failure (segment_replay_failed flag) were not visible in send_one_file after ctx_ptr->file_send_gate.close(), so we could decide that the segment was sent in full and delete it even if sending of some hints led to errors. Fixes #15110 (cherry picked from commit `9fd3df13a2`)	2023-12-18 13:03:23 +02:00
Kefu Chai	12aacea997	compound_compat: do not format an sstring with {:d} before this change, we format a sstring with "{:d}", fmtlib would throw `fmt::format_error` at runtime when formatting it. this is not expected. so, in this change, we just print the int8_t using `seastar::format()` in a single pass. and with the format specifier of `#02x` instead of adding the "0x" prefix manually. Fixes #14577 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14578 (cherry picked from commit `27d6ff36df`)	2023-12-18 12:47:16 +02:00
Kefu Chai	df30f66bfa	tools/scylla-sstable: dump column_desc as an object before this change, `scylla sstable dump-statistics` prints the "regular_columns" as a list of strings, like: ``` "regular_columns": [ "name", "clustering_order", "type_name", "org.apache.cassandra.db.marshal.UTF8Type", "name", "column_name_bytes", "type_name", "org.apache.cassandra.db.marshal.BytesType", "name", "kind", "type_name", "org.apache.cassandra.db.marshal.UTF8Type", "name", "position", "type_name", "org.apache.cassandra.db.marshal.Int32Type", "name", "type", "type_name", "org.apache.cassandra.db.marshal.UTF8Type" ] ``` but according https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics, > $SERIALIZATION_HEADER_METADATA := { > "min_timestamp_base": Uint64, > "min_local_deletion_time_base": Uint64, > "min_ttl_base": Uint64", > "pk_type_name": String, > "clustering_key_types_names": [String, ...], > "static_columns": [$COLUMN_DESC, ...], > "regular_columns": [$COLUMN_DESC, ...], > } > > $COLUMN_DESC := { > "name": String, > "type_name": String > } "regular_columns" is supposed to be a list of "$COLUMN_DESC". the same applies to "static_columnes". this schema makes sense, as each column should be considered as a single object which is composed of two properties. but we dump them like a list. so, in this change, we guard each visit() call of `json_dumper()` with `StartObject()` and `EndObject()` pair, so that each column is printed as an object. after the change, "regular_columns" are printed like: ``` "regular_columns": [ { "name": "clustering_order", "type_name": "org.apache.cassandra.db.marshal.UTF8Type" }, { "name": "column_name_bytes", "type_name": "org.apache.cassandra.db.marshal.BytesType" }, { "name": "kind", "type_name": "org.apache.cassandra.db.marshal.UTF8Type" }, { "name": "position", "type_name": "org.apache.cassandra.db.marshal.Int32Type" }, { "name": "type", "type_name": "org.apache.cassandra.db.marshal.UTF8Type" } ] ``` Fixes #15036 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15037 (cherry picked from commit `c82f1d2f57`)	2023-12-18 12:26:36 +02:00
Michał Sala	2427bda737	forward_service: introduce shutdown checks This commit introduces a new boolean flag, `shutdown`, to the forward_service, along with a corresponding shutdown method. It also adds checks throughout the forward_service to verify the value of the shutdown flag before retrying or invoking functions that might use the messaging service under the hood. The flag is set before messaging service shutdown, by invoking forward_service::shutdown in main. By checking the flag before each call that potentially involves the messaging service, we can ensure that the messaging service is still operational. If the flag is false, indicating that the messaging service is still active, we can proceed with the call. In the event that the messaging service is shutdown during the call, appropriate exceptions should be thrown somewhere down in called functions, avoiding potential hangs. This fix should resolve the issue where forward_service retries could block the shutdown. Fixes #12604 Closes #13922 (cherry picked from commit `e0855b1de2`)	2023-12-18 12:25:25 +02:00
Petr Gusev	27adf340ef	storage_proxy: mutation:: make frozen_mutation [[ref]] We had a redundant copy in receive_mutation_handler forward_fn callback. This frozen_mutation is dynamically allocated and can be arbitrary large. Fixes: #12504 (cherry picked from commit `5adbb6cde2`)	2023-12-18 12:20:40 +02:00
Botond Dénes	5c33c9d6a6	Merge 'thrift: return address in listen_addresses() only after server is ready' from Marcin Maliszkiewicz This is used for readiness API: /storage_service/rpc_server and the fix prevents from returning 'true' prematurely. Some improvement for readiness was added in `a51529dd15` but thrift implementation wasn't fully done. Fixes https://github.com/scylladb/scylladb/issues/12376 Closes #13319 * github.com:scylladb/scylladb: thrift: return address in listen_addresses() only after server is ready thrift: simplify do_start_server() with seastar:async (cherry picked from commit `9a024f72c4`)	2023-12-18 12:20:40 +02:00
Kamil Braun	9aaaa66981	Merge 'cql3: fix a few misformatted printouts of column names in error messages' from Nadav Har'El Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name. Fixes #13657 Closes #13658 * github.com:scylladb/scylladb: cql3: fix printing of column_specification::name in some error messages cql3: fix printing of column_definition::name in some error messages (cherry picked from commit `a29b8cd02b`)	2023-12-18 09:55:37 +02:00
Avi Kivity	b21ec82894	Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map. get_endpoints is used here by gossiper and storage_service for iterations that may preempt instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators. \Fixes #13899 \Closes #13900 * github.com:scylladb/scylladb: storage_service: do not preempt while traversing endpoint_state_map gossiper: do not preempt while traversing endpoint_state_map (cherry picked from commit `d2d53fc1db`) Closes #16431	2023-12-18 09:35:42 +02:00
Yaron Kaikov	5052890ae8	release: prepare for 5.2.12	2023-12-17 14:28:03 +02:00
Kefu Chai	0da3453f95	db: schema_tables: capture reference to temporary value by value `clustering_key_columns()` returns a range view, and `front()` returns the reference to its first element. so we cannot assume the availability of this reference after the expression is evaluated. to address this issue, let's capture the returned range by value, and keep the first element by reference. this also silences warning from GCC-13: ``` /home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 3654 \| const column_definition& first_view_ck = v->clustering_key_columns().front(); \| ^~~~~~~~~~~~~ /home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’ 3654 \| const column_definition& first_view_ck = v->clustering_key_columns().front(); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ ``` Fixes #13720 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13721 (cherry picked from commit `135b4fd434`)	2023-12-15 13:55:57 +02:00
Benny Halevy	6d7b2bc02f	sstables: compressed_file_data_source_impl: get: throw malformed_sstable_exception on premature eof Currently, the reader might dereference a null pointer if the input stream reaches eof prematurely, and read_exactly returns an empty temporary_buffer. Detect this condition before dereferencing the buffer and sstables::malformed_sstable_exception. Fixes #13599 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13600 (cherry picked from commit `77b70dbdb7`)	2023-12-15 13:54:42 +02:00
Wojciech Mitros	119c8279dd	rust: update wasmtime dependency The previous version of wasmtime had a vulnerability that possibly allowed causing undefined behavior when calling UDFs. We're directly updating to wasmtime 8.0.1, because the update only requires a slight code modification and the Wasm UDF feature is still experimental. As a result, we'll benefit from a number of new optimizations. Fixes #13807 Closes #13804 (cherry picked from commit `6bc16047ba`)	2023-12-15 13:54:42 +02:00
Michał Chojnowski	3af6dfe4ac	database: fix reads_memory_consumption for system semaphore The metric shows the opposite of what its name suggests. It shows available memory rather than consumed memory. Fix that. Fixes #13810 Closes #13811 (cherry picked from commit `0813fa1da0`)	2023-12-15 13:54:42 +02:00
Eliran Sinvani	0230798db3	use_statement: Covert an exception to a future exception The use statement execution code can throw if the keyspace is doesn't exist, this can be a problem for code that will use execute in a fiber since the exception will break the fiber even if `then_wrapped` is used. Fixes #14449 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes scylladb/scylladb#14394 (cherry picked from commit `c5956957f3`)	2023-12-15 13:54:42 +02:00
Botond Dénes	64503a7137	Merge 'mutation_query: properly send range tombstones in reverse queries' from Michał Chojnowski reconcilable_result_builder passes range tombstone changes to _rt_assembler using table schema, not query schema. This means that a tombstone with bounds (a; b), where a < b in query schema but a > b in table schema, will not be emitted from mutation_query. This is a very serious bug, because it means that such tombstones in reverse queries are not reconciled with data from other replicas. If any queried replica has a row, but not the range tombstone which deleted the row, the reconciled result will contain the deleted row. In particular, range deletes performed while a replica is down will not later be visible to reverse queries which select this replica, regardless of the consistency level. As far as I can see, this doesn't result in any persistent data loss. Only in that some data might appear resurrected to reverse queries, until the relevant range tombstone is fully repaired. This series fixes the bug and adds a minimal reproducer test. Fixes #10598 Closes scylladb/scylladb#16003 * github.com:scylladb/scylladb: mutation_query_test: test that range tombstones are sent in reverse queries mutation_query: properly send range tombstones in reverse queries (cherry picked from commit `65e42e4166`)	2023-12-14 12:53:07 +02:00
Yaron Kaikov	b013877629	build_docker.sh: Upgrade package during creation and remove sshd service When scanning our latest docker image using `trivy` (command: `trivy image docker.io/scylladb/scylla-nightly:latest`), it shows we have OS packages which are out of date. Also removing `openssh-server` and `openssh-client` since we don't use it for our docker images Fixes: https://github.com/scylladb/scylladb/issues/16222 Closes scylladb/scylladb#16224 (cherry picked from commit `7ce6962141`) Closes #16360	2023-12-11 10:57:16 +02:00
Botond Dénes	33d2da94ab	reader_concurrency_semaphore: execution_loop(): trigger admission check when _ready_list is empty The execution loop consumes permits from the _ready_list and executes them. The _ready_list usually contains a single permit. When the _ready_list is not empty, new permits are queued until it becomes empty. The execution loops relies on admission checks triggered by the read releasing resouces, to bring in any queued read into the _ready_list, while it is executing the current read. But in some cases the current read might not free any resorces and thus fail to trigger an admission check and the currently queued permits will sit in the queue until another source triggers an admission check. I don't yet know how this situation can occur, if at all, but it is reproducible with a simple unit test, so it is best to cover this corner-case in the off-chance it happens in the wild. Add an explicit admission check to the execution loop, after the _ready_list is exhausted, to make sure any waiters that can be admitted with an empty _ready_list are admitted immediately and execution continues. Fixes: #13540 Closes #13541 (cherry picked from commit `b790f14456`)	2023-12-07 16:04:55 +02:00
Paweł Zakrzewski	dac69be4a4	auth: fix error message when consistency level is not met Propagate `exceptions::unavailable_exception` error message to the client such as cqlsh. Fixes #2339 (cherry picked from commit `400aa2e932`)	2023-12-07 14:49:47 +02:00
Botond Dénes	763e583cf2	Merge 'row_cache: abort on exteral_updater::execute errors' from Benny Halevy Currently the cache updaters aren't exception safe yet they are intended to be. Instead of allowing exceptions from `external_updater::execute` escape `row_cache::update`, abort using `on_fatal_internal_error`. Future changes should harden all `execute` implementations to effectively make them `noexcept`, then the pure virtual definition can be made `noexcept` to cement that. \Fixes scylladb/scylladb#15576 \Closes scylladb/scylladb#15577 * github.com:scylladb/scylladb: row_cache: abort on exteral_updater::execute errors row_cache: do_update: simplify _prev_snapshot_pos setup (cherry picked from commit `4a0f16474f`) Closes scylladb/scylladb#16256	2023-12-07 09:16:42 +02:00
Nadav Har'El	b331b4a4bb	Backport fixes for nodetool commands with Alternator GSI in the database Fixes #16153 * java e716e1bd1d...80701efa8d (1): > NodeProbe: allow addressing table name with colon in it /home/nyh/scylla/tools$ git submodule summary jmx \| cat * jmx bc4f8ea...f21550e (3): > ColumnFamilyStore: only quote table names if necessary > APIBuilder: allow quoted scope names > ColumnFamilyStore: don't fail if there is a table with ":" in its name Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #16296	2023-12-06 10:48:49 +02:00
Anna Stuchlik	d9448a298f	doc: fix rollback in the 4.6-to-5.0 upgrade guide This commit fixes the rollback procedure in the 4.6-to-5.0 upgrade guide: - The "Restore system tables" step is removed. - The "Restore the configuration file" command is fixed. - The "Gracefully shutdown ScyllaDB" command is fixed. In addition, there are the following updates to be in sync with the tests: - The "Backup the configuration file" step is extended to include a command to backup the packages. - The Rollback procedure is extended to restore the backup packages. - The Reinstallation section is fixed for RHEL. Refs https://github.com/scylladb/scylladb/issues/11907 This commit must be backported to branch-5.4, branch-5.2, and branch-5.1 Closes scylladb/scylladb#16155 (cherry picked from commit `1e80bdb440`)	2023-12-05 15:10:21 +02:00
Anna Stuchlik	a82fd96b6a	doc: fix rollback in the 5.0-to-5.1 upgrade guide This commit fixes the rollback procedure in the 5.0-to-5.1 upgrade guide: - The "Restore system tables" step is removed. - The "Restore the configuration file" command is fixed. - The "Gracefully shutdown ScyllaDB" command is fixed. In addition, there are the following updates to be in sync with the tests: - The "Backup the configuration file" step is extended to include a command to backup the packages. - The Rollback procedure is extended to restore the backup packages. - The Reinstallation section is fixed for RHEL. Also, I've the section removed the rollback section for images, as it's not correct or relevant. Refs https://github.com/scylladb/scylladb/issues/11907 This commit must be backported to branch-5.4, branch-5.2, and branch-5.1 Closes scylladb/scylladb#16154 (cherry picked from commit `7ad0b92559`)	2023-12-05 15:08:25 +02:00
Anna Stuchlik	ae79fb9ce0	doc: fix rollback in the 5.1-to-5.2 upgrade guide This commit fixes the rollback procedure in the 5.1-to-5.2 upgrade guide: - The "Restore system tables" step is removed. - The "Restore the configuration file" command is fixed. - The "Gracefully shutdown ScyllaDB" command is fixed. In addition, there are the following updates to be in sync with the tests: - The "Backup the configuration file" step is extended to include a command to backup the packages. - The Rollback procedure is extended to restore the backup packages. - The Reinstallation section is fixed for RHEL. Also, I've the section removed the rollback section for images, as it's not correct or relevant. Refs https://github.com/scylladb/scylladb/issues/11907 This commit must be backported to branch-5.4 and branch-5.2. Closes scylladb/scylladb#16152 (cherry picked from commit `91cddb606f`)	2023-12-05 14:58:21 +02:00
Pavel Emelyanov	d83f4b9240	Update seastar submodule * seastar eda297fc...43a1ce58 (2): > io_queue: Add iogroup label to metrics > io_queue: Remove ioshard metrics label refs: scylladb/seastar#1591 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-12-05 10:46:07 +03:00
Raphael S. Carvalho	1b8c078cab	test: Fix sporadic failures of database_test database_test is failing sporadically and the cause was traced back to commit `e3e7c3c7e5`. The commit forces a subset of tests in database_test, to run once for each of predefined x_log2_compaction_group settings. That causes two problems: 1) test becomes 240% slower in dev mode. 2) queries on system.auth is timing out, and the reason is a small table being spread across hundreds of compaction groups in each shard. so to satisfy a range scan, there will be multiple hops, making the overhead huge. additionally, the compaction group aware sstable set is not merged yet. so even point queries will unnecessarily scan through all the groups. Fixes #13660. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13851 (cherry picked from commit `a7ceb987f5`)	2023-11-30 17:31:07 +02:00
Benny Halevy	1592a84b80	task_manager: module: make_task: enter gate when the task is created Passing the gate_closed_exception to the task promise in start() ends up with abandoned exception since no-one is waiting for it. Instead, enter the gate when the task is made so it will fail make_task if the gate is already closed. Fixes scylladb/scylladb#15211 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `f9a7635390`)	2023-11-30 17:16:57 +02:00
Michał Chojnowski	bfeadae1bd	position_in_partition: make operator= exception-safe The copy assignment operator of _ck can throw after _type and _bound_weight have already been changed. This leaves position_in_partition in an inconsistent state, potentially leading to various weird symptoms. The problem was witnessed by test_exception_safety_of_reads. Specifically: in cache_flat_mutation_reader::add_to_buffer, which requires the assignment to _lower_bound to be exception-safe. The easy fix is to perform the only potentially-throwing step first. Fixes #15822 Closes scylladb/scylladb#15864 (cherry picked from commit `93ea3d41d8`)	2023-11-30 15:01:22 +02:00
Avi Kivity	2c219a65f8	Update seastar submodule (spins on epoll) * seastar 45f4102428...eda297fcb5 (1): > epoll: Avoid spinning on aborted connections Fixes #12774 Fixes #7753 Fixes #13337	2023-11-30 14:09:22 +02:00
Piotr Grabowski	7054b1ab1e	install-dependencies.sh: update node_exporter to 1.7.0 Update node_exporter to 1.7.0. The previous version (1.6.1) was flagged by security scanners (such as Trivy) with HIGH-severity CVE-2023-39325. 1.7.0 release fixed that problem. [Botond: regenerate frozen toolchain] Fixes #16085 Closes scylladb/scylladb#16086 Closes scylladb/scylladb#16090 (cherry picked from commit `321459ec51`) [avi: regenerate frozen toolchain]	2023-11-27 18:17:38 +00:00
Anna Mikhlin	a65838ee9c	re-spin: 5.2.11	2023-11-26 16:17:58 +02:00
Botond Dénes	68faf18ad9	Update ./tools/jmx and ./tools/java submodules * tools/jmx 88d9bdc...bc4f8ea (1): > Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski * tools/java f8f556d802...e716e1bd1d (1): > Merge 'build: update several dependencies' from Piotr Grabowski Update build dependencies which were flagged by security scanners. Refs: scylladb/scylla-jmx#220 Refs: scylladb/scylla-tools-java#351 Closes #16150	2023-11-23 15:29:00 +02:00
Beni Peled	44d1b55253	release: prepare for 5.2.11	2023-11-22 14:22:13 +02:00
Tomasz Grabiec	bfd8401477	api, storage_service: Recalculate table digests on relocal_schema api call Currently, the API call recalculates only per-node schema version. To workaround issues like #4485 we want to recalculate per-table digests. One way to do that is to restart the node, but that's slow and has impact on availability. Use like this: curl -X POST http://127.0.0.1:10000/storage_service/relocal_schema Fixes #15380 Closes #15381 (cherry picked from commit `c27d212f4b`)	2023-11-21 01:29:28 +01:00
Botond Dénes	e31f2224f5	migration_manager: also reload schema on enabling digest_insensitive_to_expiry Currently, when said feature is enabled, we recalcuate the schema digest. But this feature also influences how table versions are calculated, so it has to trigger a recalculation of all table versions, so that we can guarantee correct versions. Before, this used to happen by happy accident. Another feature -- table_digest_insensitive_to_expiry -- used to take care of this, by triggering a table version recalulation. However this feature only takes effect if digest_insensitive_to_expiry is also enabled. This used to be the case incidently, by the time the reload triggered by table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was already enabled. But this was not guaranteed whatsoever and as we've recently seen, any change to the feature list, which changes the order in which features are enabled, can cause this intricate balance to break. This patch makes digest_insensitive_to_expiry also kick off a schema reload, to eliminate our dependence on (unguaranteed) feature order, and to guarantee that table schemas have a correct version after all features are enabled. In fact, all schema feature notification handlers now kick off a full schema reload, to ensure bugs like this don't creep in, in the future. Fixes: #16004 Closes scylladb/scylladb#16013 (cherry picked from commit `22381441b0`)	2023-11-21 01:29:28 +01:00
Kamil Braun	4101c8beab	schema_tables: remove default value for `reload` in `merge_schema` To avoid bugs like the one fixed in the previous commit. (cherry picked from commit `4376854473`)	2023-11-21 01:29:28 +01:00
Kamil Braun	c994ed2057	schema_tables: pass `reload` flag when calling `merge_schema` cross-shard In `0c86abab4d` `merge_schema` obtained a new flag, `reload`. Unfortunately, the flag was assigned a default value, which I think is almost always a bad idea, and indeed it was in this case. When `merge_scehma` is called on shard different than 0, it recursively calls itself on shard 0. That recursive call forgot to pass the `reload` flag. Fix this. (cherry picked from commit `48164e1d09`)	2023-11-21 01:29:28 +01:00
Avi Kivity	40eed1f1c5	Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d` (5.2.0), it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485. Manually tested using ccm on cluster upgrade scenarios and node restarts. Closes #14441 * github.com:scylladb/scylladb: test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled schema_mutations, migration_manager: Ignore empty partitions in per-table digest migration_manager, schema_tables: Implement migration_manager::reload_schema() schema_tables: Avoid crashing when table selector has only one kind of tables (cherry picked from commit `cf81eef370`)	2023-11-21 01:29:28 +01:00
Gleb Natapov	f233c8a9e4	database: fix do_apply_many() to handle empty array of mutations Currently the code will assert because cl pointer will be null and it will be null because there is no mutations to initialize it from. Message-Id: <20230212144837.2276080-3-gleb@scylladb.com> (cherry picked from commit `941407b905`) Backport needed by #4485.	2023-11-21 01:29:17 +01:00
Botond Dénes	0f3e31975d	api/storage_service: start/stop native transport in the statement sg Currently, it is started/stopped in the streaming/maintenance sg, which is what the API itself runs in. Starting the native transport in the streaming sg, will lead to severely degraded performance, as the streaming sg has significantly less CPU/disk shares and reader concurrency semaphore resources. Furthermore, it will lead to multi-paged reads possibly switching between scheduling groups mid-way, triggering an internal error. To fix, use `with_scheduling_group()` for both starting and stopping native transport. Technically, it is only strictly necessary for starting, but I added it for stop as well for consistency. Also apply the same treatment to RPC (Thrift). Although no one uses it, best to fix it, just to be on the safe side. I think we need a more systematic approach for solving this once and for all, like passing the scheduling group to the protocol server and have it switch to it internally. This allows the server to always run on the correct scheduling group, not depending on the caller to remember using it. However, I think this is best done in a follow-up, to keep this critical patch small and easily backportable. Fixes: #15485 Closes scylladb/scylladb#16019 (cherry picked from commit `dfd7981fa7`)	2023-11-20 20:00:56 +02:00
Takuya ASADA	c98b22afce	scylla_post_install.sh: detect RHEL correctly $ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL itself. To detect RHEL correctly, we also need to check $ID = "rhel". Fixes #16040 Closes scylladb/scylladb#16041 (cherry picked from commit `338a9492c9`)	2023-11-20 19:36:22 +02:00
Marcin Maliszkiewicz	900754d377	db: view: run local materialized view mutations on a separate smp service group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) not committed framework for easy alternator performance testing: https://github.com/scylladb/scylladb/pull/13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes https://github.com/scylladb/scylladb/issues/15844 Closes scylladb/scylladb#15845 (cherry picked from commit `020a9c931b`)	2023-11-19 18:54:46 +02:00
Botond Dénes	fbb356aa88	repair/repair.cc: do_repair_ranges(): prevent stalls when skipping ranges We have observed do_repair_ranges() receiving tens of thousands of ranges to repairs on occasion. do_repair_ranges() repairs all ranges in parallel, with parallel_for_each(). This is normally fine, as the lambda inside parallel_for_each() takes a semaphore and this will result in limited concurrency. However, in some instances, it is possible that most of these ranges are skipped. In this case the lambda will become synchronous, only logging a message. This can cause stalls beacuse there are no opportunities to yield. Solve this by adding an explicit yield to prevent this. Fixes: #14330 Closes scylladb/scylladb#15879 (cherry picked from commit `90a8489809`)	2023-11-08 21:10:30 +02:00
Michał Jadwiszczak	e8871c02a1	cql3:statements:describe_statement: check pointer to UDF/UDA While looking for specific UDF/UDA, result of `functions::functions::find()` needs to be filtered out based on function's type. Fixes: #14360 (cherry picked from commit `d498451cdf`)	2023-11-08 20:16:41 +02:00
Pavel Emelyanov	f76ba217e7	Merge 'api: failure_detector: invoke on shard 0' from Kamil Braun These APIs may return stale or simply incorrect data on shards other than 0. Newer versions of Scylla are better at maintaining cross-shard consistency, but we need a simple fix that can be easily and without risk be backported to older versions; this is the fix. Add a simple test to check that the `failure_detector/endpoints` API returns nonzero generation. Fixes: scylladb/scylladb#15816 Closes scylladb/scylladb#15970 * github.com:scylladb/scylladb: test: rest_api: test that generation is nonzero in `failure_detector/endpoints` api: failure_detector: fix indentation api: failure_detector: invoke on shard 0 (cherry picked from commit `9443253f3d`)	2023-11-07 15:12:12 +01:00
Botond Dénes	17e4d535db	test/cql-pytest/nodetool.py: no_autocompaction_context: use the correct API This `with` context is supposed to disable, then re-enable autocompaction for the given keyspaces, but it used the wrong API for it, it used the column_family/autocompaction API, which operates on column families, not keyspaces. This oversight led to a silent failure because the code didn't check the result of the request. Both are fixed in this patch: * switch to use `storage_service/auto_compaction/{keyspace}` endpoint * check the result of the API calls and report errors as exceptions Fixes: #13553 Closes #13568 (cherry picked from commit `66ee73641e`)	2023-11-07 13:59:01 +02:00
Aleksandra Martyniuk	75b792e260	repair: release resources of shard_repair_task_impl Before integration with task manager the state of one shard repair was kept in repair_info. repair_info object was destroyed immediately after shard repair was finished. In an integration process repair_info's fields were moved to shard_repair_task_impl as the two served the similar purposes. Though, shard_repair_task_impl isn't immediately destoyed, but is kept in task manager for task_ttl seconds after it's complete. Thus, some of repair_info's fields have their lifetime prolonged, which makes the repair state change delayed. Release shard_repair_task_impl resources immediately after shard repair is finished. Fixes: #15505. (cherry picked from commit `0474e150a9`) Closes #15875	2023-11-07 09:40:05 +02:00
Tomasz Grabiec	573ef87245	Merge ' tool/scylla-sstable: more flexibility in obtaining the schema' from Botond Dénes scylla-sstable currently has two ways to obtain the schema: * via a `schema.cql` file. * load schema definition from memory (only works for system tables). This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file. This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override. If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong. A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes. This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change. Example: ``` $ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db {"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}} ``` As seen above, subdirectories like qurantine, staging etc are also supported. Fixes: https://github.com/scylladb/scylladb/issues/10126 Closes #13448 * github.com:scylladb/scylladb: test/cql-pytest: test_tools.py: add tests for schema loading test/cql-pytest: add no_autocompaction_context docs: scylla-sstable.rst: remove accidentally added copy-pasta docs: scylla-sstable.rst: remove paragraph with schema limitations docs: scylla-sstable.rst: update schema section test/cql-pytest: nodetool.py: add flush_keyspace() tools/scylla-sstable: reform schema loading mechanism tools/schema_loader: add load_schema_from_schema_tables() db/schema_tables: expose types schema (cherry picked from commit `952b455310`) Closes #15386	2023-11-02 17:25:18 +02:00
Beni Peled	454e5a7110	release: prepare for 5.2.10	2023-11-02 15:08:11 +00:00
Avi Kivity	9967c0bda4	Update tools/pythion3 submodule (tar file timestamps) * tools/python3 cf7030a...6ad2e5a (1): > create-relocatable-package.py: fix timestamp of executable files Fixes #13415.	2023-11-02 12:37:09 +01:00
Botond Dénes	48509c5c00	Merge '[Backport 5.2] properly update storage service after schema changes' from Benny Halevy This is a backport of https://github.com/scylladb/scylladb/pull/14158 to branch 5.2 Closes #15872 * github.com:scylladb/scylladb: migration_notifier: get schema_ptr by value migration_manager: propagate listener notification exceptions storage_service: keyspace_changed: execute only on shard 0 database: modify_keyspace_on_all_shards: execute func first on shard 0 database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards database: add modify_keyspace_on_all_shards schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace database: create_keyspace_on_all_shards database: update_keyspace_on_all_shards database: drop_keyspace_on_all_shards	2023-10-31 10:27:08 +02:00
Botond Dénes	d606e9bfa2	Merge '[branch-5.2] Enable incremental compaction on off-strategy' from Raphael "Raph" Carvalho Off-strategy suffers with a 100% space overhead, as it adopted a sort of all or nothing approach. Meaning all input sstables, living in maintenance set, are kept alive until they're all reshaped according to the strategy criteria. Input sstables in off-strategy are very likely to be mostly disjoint, so it can greatly benefit from incremental compaction. The incremental compaction approach is not only good for decreasing disk usage, but also memory usage (as metadata of input and output live in memory), and file desc count, which takes memory away from OS. Turns out that this approach also greatly simplifies the off-strategy impl in compaction manager, as it no longer have to maintain new unused sstables and mark them for deletion on failure, and also unlink intermediary sstables used between reshape rounds. Fixes https://github.com/scylladb/scylladb/issues/14992. Backport notes: relatively easy to backport, had to include replica: Make compaction_group responsible for deleting off-strategy compaction input and compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0 Closes #15793 * github.com:scylladb/scylladb: test: Verify that off-strategy can do incremental compaction compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0 compaction: Clear pending_replacement list when tombstone GC is disabled compaction: Enable incremental compaction on off-strategy compaction: Extend reshape type to allow for incremental compaction compaction: Move reshape_compaction in the source compaction: Enable incremental compaction only if replacer callback is engaged replica: Make compaction_group responsible for deleting off-strategy compaction input	2023-10-30 12:00:54 +02:00
Benny Halevy	cd7abb3833	migration_notifier: get schema_ptr by value To prevent use-after-free as seen in https://github.com/scylladb/scylladb/issues/15097 where a temp schema_ptr retrieved from a global_schema_ptr get destroyed when the notification function yielded. Capturing the schema_ptr on the coroutine frame is inexpensive since its a shared ptr and it makes sure that the schema remains valid throughput the coroutine life time. \Fixes scylladb/scylladb#15097 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> \Closes #15098 (cherry picked from commit `0f54e24519`)	2023-10-29 19:39:17 +02:00
Benny Halevy	8064fface9	migration_manager: propagate listener notification exceptions `1e29b07e40` claimed to make event notification exception safe, but swallawing the exceptions isn't safe at all, as this might leave the node in an inconsistent state if e.g. storage_service::keyspace_changed fails on any of the shards. Propagating the exception here will cause abort, but it is better than leaving the node up, but in an inconsistent state. We keep notifying other listeners even if any of them failed Based on `1e29b07e40`: ``` If one of the listeners throws an exception, we must ensure that other listeners are still notified. ``` The decision about swallowing exceptions can't be made in such a generic layer. Specific notification listeners that may ignore exceptions, like in transport/evenet_notifier, may decide to swallow their local exceptions on their own (as done in this patch). Refs #3389 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `825d617a53`)	2023-10-29 19:32:55 +02:00
Benny Halevy	0cf6891c6d	storage_service: keyspace_changed: execute only on shard 0 Previously all shards called `update_topology_change_info` which in turn calls `mutate_token_metadata`, ending up in quadratic complexity. Now that the notifications are called after all database shards are updated, we can apply the changes on token metadata / effective replication map only on shard 0 and count on replicate_to_all_cores to propagate those changes to all other shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `a690f0e81f`)	2023-10-29 19:27:52 +02:00
Benny Halevy	16a594d564	database: modify_keyspace_on_all_shards: execute func first on shard 0 When creating or altering a keyspace, we create a new effective_replication_map instance. It is more efficient to do that first on shard 0 and then on all other shards, otherwise multiple shards might need to calculate to new e_r_m (and reach the same result). When the new e_r_m is "seeded" on shard 0, other shards will find it there and clone a local copy of it - which is more efficient. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `13dd92e618`)	2023-10-29 19:22:01 +02:00
Benny Halevy	096c312821	database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards When creating, updating, or dropping keyspaces, first execute the database internal function to modify the database state, and only when all shards are updated, run the listener notifications, to make sure they would operate when the database shards are consistent with each other. \Fixes #13137 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ba15786059`)	2023-10-29 19:21:34 +02:00
Benny Halevy	5c27dacad5	database: add modify_keyspace_on_all_shards Run all keyspace create/update/drop ops via `modify_keyspace_on_all_shards` that will standardize the execution on all shards in the coming patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `3b8c913e61`)	2023-10-29 19:16:56 +02:00
Benny Halevy	14113dc23e	schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace Similar to create_keyspace_on_all_shards, `extract_scylla_specific_keyspace_info` and `create_keyspace_from_schema_partition` can be called once in the upper layer, passing keyspace_metadata& down to database::update_keyspace_on_all_shards which now would only make the per-shard keyspace_metadata from the reference it gets from the schema_tables layer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `dc9b0812e9`)	2023-10-29 19:14:06 +02:00
Benny Halevy	4d5a99f3b8	database: create_keyspace_on_all_shards Part of moving the responsibility for applying and notifying keyspace schema changes from schema_tables to the database so that the database can control the order of applying the changes across shards and when to notify its listeners. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `3520c786bd`)	2023-10-29 19:13:55 +02:00
Benny Halevy	ffe28b3e3f	database: update_keyspace_on_all_shards Part of moving the responsibility for applying and notifying keyspace schema changes from schema_tables to the database so that the database can control the order of applying the changes across shards and when to notify its listeners. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `53a6ea8616`)	2023-10-29 19:06:45 +02:00
Benny Halevy	1459306603	database: drop_keyspace_on_all_shards Part of moving the responsibility for applying and notifying keyspace schema changes from schema_tables to the database so that the database can control the order of applying the changes across shards and when to notify its listeners. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `9d40305ef6`)	2023-10-29 19:06:25 +02:00
Kefu Chai	30a4eb0ea7	sstables: writer: delegate flush() in checksummed_file_data_sink_impl before this change, `checksummed_file_data_sink_impl` just inherits the `data_sink_impl::flush()` from its parent class. but as a wrapper around the underlying `_out` data_sink, this is not only an unusual design decision in a layered design of an I/O system, but also could be problematic. to be more specific, the typical user of `data_sink_impl` is a `data_sink`, whose `flush()` member function is called when the user of `data_sink` want to ensure that the data sent to the sink is pushed to the underlying storage / channel. this in general works, as the typical user of `data_sink` is in turn `output_stream`, which calls `data_sink.flush()` before closing the `data_sink` with `data_sink.close()`. and the operating system will eventually flush the data after application closes the corresponding fd. to be more specific, almost none of the popular local filesystem implements the file_operations.op, hence, it's safe even if the `output_stream` does not flush the underlying data_sink after writing to it. this is the use case when we write to sstables stored on local filesystem. but as explained above, if the data_sink is backed by a network filesystem, a layered filesystem or a storage connected via a buffered network device, then it is crucial to flush in a timely manner, otherwise we could risk data lost if the application / machine / network breaks when the data is considerered persisted but they are _not_! but the `data_sink` returned by `client::make_upload_jumbo_sink` is a little bit different. multipart upload is used under the hood, and we have to finalize the upload once all the parts are uploaded by calling `close()`. but if the caller fails / chooses to close the sink before flushing it, the upload is aborted, and the partially uploaded parts are deleted. the default-implemented `checksummed_file_data_sink_impl::flush()` breaks `upload_jumbo_sink` which is the `_out` data_sink being wrapped by `checksummed_file_data_sink_impl`. as the `flush()` calls are shortcircuited by the wrapper, the `close()` call always aborts the upload. that's why the data and index components just fail to upload with the S3 backend. in this change, we just delegate the `flush()` call to the wrapped class. Fixes #15079 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15134 (cherry picked from commit `d2d1141188`)	2023-10-26 16:48:17 +03:00
Avi Kivity	ea198d884d	cql3: grammar: reject intValue with no contents The grammar mistakenly allows nothing to be parsed as an intValue (itself accepted in LIMIT and similar clauses). Easily fixed by removing the empty alternative. A unit test is added. Fixes #14705. Closes #14707 (cherry picked from commit `e00811caac`)	2023-10-25 19:15:28 +03:00
Raphael S. Carvalho	b8c8794e14	test: Verify that off-strategy can do incremental compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 17:05:33 -03:00
Benny Halevy	0c2bb5f0b3	compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0 Prevent div-by-zero byt returning const level 1 if max_sstable_size is zero, as configured by cleanup_incremental_compaction_test, before it's extended to cover also offstrategy compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `b1e164a241`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 17:05:33 -03:00
Raphael S. Carvalho	61316d8e88	compaction: Clear pending_replacement list when tombstone GC is disabled pending_replacement list is used by incremental compaction to communicate to other ongoing compactions about exhausted sstables that must be replaced in the sstable set they keep for tombstone GC purposes. Reshape doesn't enable tombstone GC, so that list will not be cleared, which prevents incremental compaction from releasing sstables referenced by that list. It's not a problem until now where we want reshape to do incremental compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 17:05:33 -03:00
Raphael S. Carvalho	b8e2739596	compaction: Enable incremental compaction on off-strategy Off-strategy suffers with a 100% space overhead, as it adopted a sort of all or nothing approach. Meaning all input sstables, living in maintenance set, are kept alive until they're all reshaped according to the strategy criteria. Input sstables in off-strategy are very likely to mostly disjoint, so it can greatly benefit from incremental compaction. The incremental compaction approach is not only good for decreasing disk usage, but also memory usage (as metadata of input and output live in memory), and file desc count, which takes memory away from OS. Turns out that this approach also greatly simplifies the off-strategy impl in compaction manager, as it no longer have to maintain new unused sstables and mark them for deletion on failure, and also unlink intermediary sstables used between reshape rounds. Fixes #14992. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `42050f13a0`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 17:05:29 -03:00
Raphael S. Carvalho	ba87dfefd1	compaction: Extend reshape type to allow for incremental compaction That's done by inheriting regular_compaction, which implement incremental compaction. But reshape still implements its own methods for creating writer and reader. One reason is that reshape is not driven by controller, as input sstables to it live in maintenance set. Another reason is customization of things like sstable origin, etc. stop_sstable_writer() is extended because that's used by regular_compaction to check for possibility of removing exhausted sstables earlier whenever an output sstable is sealed. Also, incremental compaction will be unconditionally enabled for ICS/LCS during off-strategy. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `db9ce9f35a`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 15:00:07 -03:00
Raphael S. Carvalho	6b8499f4d8	compaction: Move reshape_compaction in the source That's in preparation to next change that will make reshape inherit from regular compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 15:00:07 -03:00
Raphael S. Carvalho	8c8a80a03d	compaction: Enable incremental compaction only if replacer callback is engaged That's needed for enabling incremental compaction to operate, and needed for subsequent work that enables incremental compaction for off-strategy, which in turn uses reshape compaction type. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 15:00:07 -03:00
Raphael S. Carvalho	fdec5e62d0	replica: Make compaction_group responsible for deleting off-strategy compaction input Compaction group is responsible for deleting SSTables of "in-strategy" compactions, i.e. regular, major, cleanup, etc. Both in-strategy and off-strategy compaction have their completion handled using the same compaction group interface, which is compaction_group::table_state::on_compaction_completion(..., sstables::offstrategy offstrategy) So it's important to bring symmetry there, by moving the responsibility of deleting off-strategy input, from manager to group. Another important advantage is that off-strategy deletion is now throttled and gated, allowing for better control, e.g. table waiting for deletion on shutdown. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13432 (cherry picked from commit `457c772c9c`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 15:00:06 -03:00
Raphael S. Carvalho	6798f9676f	Resurrect optimization to avoid bloom filter checks during compaction Commit `8c4b5e4` introduced an optimization which only calculates max purgeable timestamp when a tombstone satisfy the grace period. Commit 'repair: Get rid of the gc_grace_seconds' inverted the order, probably under the assumption that getting grace period can be more expensive than calculating max purgeable, as repair-mode GC will look up into history data in order to calculate gc_before. This caused a significant regression on tombstone heavy compactions, where most of tombstones are still newer than grace period. A compaction which used to take 5s, now takes 35s. 7x slower. The reason is simple, now calculation of max purgeable happens for every single tombstone (once for each key), even the ones that cannot be GC'ed yet. And each calculation has to iterate through (i.e. check the bloom filter of) every single sstable that doesn't participate in compaction. Flame graph makes it very clear that bloom filter is a heavy path without the optimization: 45.64% 45.64% sstable_compact sstable_compaction_test_g [.] utils::filter::bloom_filter::is_present With its resurrection, the problem is gone. This scenario can easily happen, e.g. after a deletion burst, and tombstones becoming only GC'able after they reach upper tiers in the LSM tree. Before this patch, a compaction can be estimated to have this # of filter checks: (# of keys containing any tombstone) * (# of uncompacting sstable runs[1]) [1] It's # of runs, as each key tend to overlap with only one fragment of each run. After this patch, the estimation becomes: (# of keys containing a GC'able tombstone) * (# of uncompacting runs). With repair mode for tombstone GC, the assumption, that retrieval of gc_before is more expensive than calculating max purgeable, is kept. We can revisit it later. But the default mode, which is the "timeout" (i.e. gc_grace_seconds) one, we still benefit from the optimization of deferring the calculation until needed. Cherry picked from commit `38b226f997` Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes #14091. Closes #13908 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15744	2023-10-20 09:34:53 +03:00
Botond Dénes	2642f32c38	Merge '[5.2 backport] doc: remove recommended image upgrade with OS from previous releases' from Anna Stuchlik This is a backport of PR https://github.com/scylladb/scylladb/pull/15740. This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted). The scope of this commit: - Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info. - Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info. - Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info. - Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides. /upgrade/_common/upgrade-image-opensource.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst Closes #15768 * github.com:scylladb/scylladb: doc: remove wrong image upgrade info (5.x.y-to-5.x.y) doc: remove wrong image upgrade info (4.6-to-5.0) doc: remove wrong image upgrade info (5.0-to-5.1)	2023-10-19 12:30:55 +03:00
Anna Stuchlik	fcbcf1eafd	doc: remove wrong image upgrade info (5.x.y-to-5.x.y) This commit removes the invalid information about the recommended way of upgrading ScyllaDB images (by updating ScyllaDB and OS packages in one step) from the 5.x.y-to-5.x.y upgrade guide. This upgrade procedure is not supported (it was implemented, but then reverted). Refs https://github.com/scylladb/scylladb/issues/15733 In addition, the following files are removed as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides. /upgrade/_common/upgrade-image-opensource.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst (cherry picked from commit `dd1207cabb`)	2023-10-19 08:47:25 +02:00
Anna Stuchlik	3a14fd31d0	doc: remove wrong image upgrade info (4.6-to-5.0) This commit removes the invalid information about the recommended way of upgrading ScyllaDB images (by updating ScyllaDB and OS packages in one step) from the 4.6-to-5.0 upgrade guide. This upgrade procedure is not supported (it was implemented, but then reverted). Refs https://github.com/scylladb/scylladb/issues/15733 (cherry picked from commit `526d543b95`)	2023-10-19 08:41:24 +02:00
Anna Stuchlik	c7b6152a81	doc: remove wrong image upgrade info (5.0-to-5.1) This commit removes the invalid information about the recommended way of upgrading ScyllaDB images (by updating ScyllaDB and OS packages in one step) from the 5.0-to-5.1 upgrade guide. This upgrade procedure is not supported (it was implemented, but then reverted). Refs https://github.com/scylladb/scylladb/issues/15733 (cherry picked from commit `9852130c5b`)	2023-10-19 08:40:27 +02:00
Asias He	ac45d8d092	repair: Use the updated estimated_partitions to create writer The estimated_partitions is estimated after the repair_meta is created. Currently, the default estimated_partitions was used to create the write which is not correct. To fix, use the updated estimated_partitions. Reported by Petr Gusev Closes #14179 Fixes #15748 (cherry picked from commit `4592bbe182`)	2023-10-18 13:58:28 +03:00
Anna Stuchlik	d319c2a83f	doc: remove recommended image upgrade with OS This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted). The scope of this commit: - Remove the information from the 5.1-to.-5.2 upgrade guide and replace with general info. - Remove the information from the Image Upgrade page. - Remove outdated info (about previous releases) from the Image Upgrade page. - Rename "AMI Upgrade" as "Image Upgrade" in the page tree. Refs: https://github.com/scylladb/scylladb/issues/15733 (cherry picked from commit `f6767f6d6e`) Closes #15754	2023-10-18 13:57:08 +03:00
Nadav Har'El	cb7e7f15ac	Cherry-pick Seastar patch Backported Seastar commit 4f4e84bb2cec5f11b4742396da7fc40dbb3f162f: > sstring: refactor to_sstring() using fmt::format_to() Refs https://github.com/scylladb/scylladb/issues/15127 Closes #15663	2023-10-09 10:02:21 +03:00
Raphael S. Carvalho	00d431bd20	reader_concurrency_semaphore: Fix stop() in face of evictable reads becoming inactive Scylla can crash due to a complicated interaction of service level drop, evictable readers, inactive read registration path. 1) service level drop invoke stop of reader concurrency semaphore, which will wait for in flight requests 2) turns out it stops first the gate used for closing readers that will become inactive. 3) proceeds to wait for in-flight reads by closing the reader permit gate. 4) one of evictable reads take the inactive read registration path, and finds the gate for closing readers closed. 5) flat mutation reader is destroyed, but finds the underlying reader was not closed gracefully and triggers the abort. By closing permit gate first, evictable readers becoming inactive will be able to properly close underlying reader, therefore avoiding the crash. Fixes #15534. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15535 (cherry picked from commit `914cbc11cf`)	2023-09-29 09:24:37 +03:00
Botond Dénes	ca8723a6fd	Merge 'gossiper: add get_unreachable_members_synchronized and use over api' from Benny Halevy Modeled after get_live_members_synchronized, get_unreachable_members_synchronized calls replicate_live_endpoints_on_change to synchronize the state of unreachable_members on all shards. Fixes #12261 Fixes #15088 Also, add rest_api unit test for those apis Closes #15093 * github.com:scylladb/scylladb: test: rest_api: add test_gossiper gossiper: add get_unreachable_members_synchronized (cherry picked from commit `57deeb5d39`) Backport note: `gossiper::lock_endpoint_update_semaphore` helper function was missing, replaced with `get_units(g._endpoint_update_semaphore, 1)`	2023-09-27 15:09:32 +02:00
Beni Peled	5709d00439	release: prepare for 5.2.9	2023-09-20 12:34:43 +03:00
Konstantin Osipov	7202634789	raft: do not update raft address map with obsolete gossip data It is possible that a gossip message from an old node is delivered out of order during a slow boot and the raft address map overwrites a new IP address with an obsolete one, from the previous incarnation of this node. Take into account the node restart counter when updating the address map. A test case requires a parameterized error injection, which we don't support yet. Will be added as a separate commit. Fixes #14257 Refs #14357 Closes #14329 (cherry picked from commit `b9c2b326bc`) Backport note: replaced `gms::generation_type` with `int64_t` because the branch is missing the refactor which introduced `generation_type` (`7f04d8231d`)	2023-09-19 11:12:38 +02:00
Avi Kivity	34e0afb18a	Merge "auth: do not grant permissions to creator without actually creating" from Wojciech Mitros Currently, when creating the table, permissions may be mistakenly granted to the user even if the table is already existing. This can happen in two cases: The query has a IF NOT EXISTS clause - as a result no exception is thrown after encountering the existing table, and the permission granting is not prevented. The query is handled by a non-zero shard - as a result we accept the query with a bounce_to_shard result_message, again without preventing the granting of permissions. These two cases are now avoided by checking the result_message generated when handling the query - now we only grant permissions when the query resulted in a schema_change message. Additionally, a test is added that reproduces both of the mentioned cases. CVE-2023-33972 Fixes #15467. * 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq: auth: do not grant permissions to creator without actually creating transport: add is_schema_change() method to result_message (cherry picked from commit `ab6988c52f`)	2023-09-19 01:47:27 +03:00
Anna Stuchlik	99e906499d	doc: fix internal links Fixes https://github.com/scylladb/scylladb/issues/14490 This commit fixes mulitple links that were broken after the documentation is published (but not in the preview) due to incorrect syntax. I've fixed the syntax to use the :docs: and :ref: directive for pages and sections, respectively. Closes #14664 (cherry picked from commit `a93fd2b162`)	2023-09-18 09:32:12 +03:00
Anna Stuchlik	b8ff392e8b	doc: add info - support for FIPS-compliant systems This commit adds the information that ScyllaDB Enterprise supports FIPS-compliant systems in versions 2023.1.1 and later. The information is excluded from OSS docs with the "only" directive, because the support was not added in OSS. This commit must be backported to branch-5.2 so that it appears on version 2023.1 in the Enterprise docs. Closes #15415 (cherry picked from commit `fb635dccaa`)	2023-09-18 09:17:59 +03:00
Raphael S. Carvalho	a65e5120ab	compaction: base compaction throughput on amount of data read Today, we base compaction throughput on the amount of data written, but it should be based on the amount of input data compacted instead, to show the amount of data compaction had to process during its execution. A good example is a compaction which expire 99% of data, and today throughput would be calculated on the 1% written, which will mislead the reader to think that compaction was terribly slow. Fixes #14533. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14615 (cherry picked from commit `3b1829f0d8`)	2023-09-14 21:30:22 +03:00
Jan Ciolek	cd9458eeb1	cql.g: make the parser reject INSERT JSON without a JSON value We allow inserting column values using a JSON value, eg: ```cql INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}'; ``` When no JSON value is specified, the query should be rejected. Scylla used to crash in such cases. A recent change fixed the crash (https://github.com/scylladb/scylladb/pull/14706), it now fails on unwrapping an uninitialized value, but really it should be rejected at the parsing stage, so let's fix the grammar so that it doesn't allow JSON queries without JSON values. A unit test is added to prevent regressions. Refs: https://github.com/scylladb/scylladb/pull/14707 Fixes: https://github.com/scylladb/scylladb/issues/14709 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> \Closes #14785 (cherry picked from commit `cbc97b41d4`)	2023-09-14 21:07:21 +03:00
Nadav Har'El	e917b874f9	test/alternator: fix flaky test test_ttl_expiration_gsi_lsi The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky. The test incorrectly assumes that when we write an already expired item, it will be visible for a short time until being deleted by the TTL thread. But this doesn't need to be true - if the test is slow enough, it may go look or the item after it was already expired! So we fix this test by splitting it into two parts - in the first part we write a non-expiring item, and notice it eventually appears in the GSI, LSI, and base-table. Then we write the same item again, with an expiration time - and now it should eventually disappear from the GSI, LSI and base-table. This patch also fixes a small bug which prevented this test from running on DynamoDB. Fixes #14495 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14496 (cherry picked from commit `599636b307`)	2023-09-14 20:43:51 +03:00
Pavel Emelyanov	a27c391cba	Update seastar submodule * seastar 85147cfd...872e0bc6 (3): > rpc: Abort server connection streams on stop > rpc: Do not register stream to dying parent > rpc: Fix client-side stream registration race refs: #13100 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-06 12:33:30 +03:00
Beni Peled	455ab99b6c	release: prepare for 5.2.8	2023-08-31 22:02:06 +03:00
Michał Chojnowski	adcf296bcf	reader_concurrency_semaphore: fix a deadlock between stop() and execution_loop() Permits added to `_ready_list` remain there until executed by `execution_loop()`. But `execution_loop()` exits when `_stopped == true`, even though nothing prevents new permits from being added to `_ready_list` after `stop()` sets `_stopped = true`. Thus, if there are reads concurrent with `stop()`, it's possible for a permit to be added to `_ready_list` after `execution_loop()` has already quit. Such a permit will never be destroyed, and `stop()` will forever block on `_permit_gate.close()`. A natural solution is to dismiss `execution_loop()` only after it's certain that `_ready_list` won't receive any new permits. This is guaranteed by `_permit_gate.close()`. After this call completes, it is certain that no permits exist. After this patch, `execution_loop()` no longer looks at `_stopped`. It only exits when `_ready_list_cv` breaks, and this is triggered by `stop()` right after `_permit_gate.close()`. Fixes #15198 Closes #15199 (cherry picked from commit `2000a09859`)	2023-08-31 08:13:09 +03:00
Calle Wilund	198297a08a	generic_server: Handle TLS error codes indicating broken pipe Fixes #14625 In broken pipe detection, handle also TLS error codes. Requires https://github.com/scylladb/seastar/pull/1729 Closes #14626 (cherry picked from commit `890f1f4ad3`)	2023-08-29 15:38:21 +03:00
Botond Dénes	9a9b5b691d	Update seastar submodule * seastar 534cb38c...85147cfd (1): > tls: Export error_category instance used by tls + some common error codes Refs: #14625	2023-08-29 15:37:24 +03:00
Alejo Sanchez	610b682cf4	gms, service: replicate live endpoints on shard 0 Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of the shards. And get the list of live members from shard 0. Move lock to the callers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13240 (cherry picked from commit `da00052ad8`)	2023-08-29 12:28:00 +02:00
Kamil Braun	05f4640360	Merge 'api: gossiper: get alive nodes after reaching current shard 0 version' from Alecco Add an API call to wait for all shards to reach the current shard 0 gossiper version. Throws when timeout is reached. Closes #12540 * github.com:scylladb/scylladb: api: gossiper: fix alive nodes gms, service: lock live endpoint copy gms, service: live endpoint copy method (cherry picked from commit `b919373cce`)	2023-08-29 12:27:52 +02:00
Kefu Chai	8ed58c7dca	sstable/writer: log sstable name and pk when capping ldt when the local_deletion_time is too large and beyond the epoch time of INT32_MAX, we cap it to INT32_MAX - 1. this is a signal of bad configuration or a bug in scylla. so let's add more information in the logging message to help track back to the source of the problem. Fixes #15015 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `9c24be05c3`) Closes #15150	2023-08-25 10:13:19 +03:00
Petr Gusev	a83c0a8bbc	test_secondary_index_collections: change insert/create index order Secondary index creation is asynchronous, meaning it takes time for existing data to be reflected within the index. However, new data added after the index is created should appear in it immediately. The test consisted of two parts. The first created a series of indexes for one table, added test data to the table, and then ran a series of checks. In the second part, several new indexes were added to the same table, and checks were made to make sure that already existing data would appear in them. This last part was flaky. The patch just moves the index creation statements from the second part to the first. Fixes: #14076 Closes #14090 (cherry picked from commit `0415ac3d5f`) Closes #15101	2023-08-24 14:09:08 +03:00
Botond Dénes	df71753498	Merge '[Backport 5.2] distributed_loader: process_sstable_dir: do not verify snapshots' from Benny Halevy This mini-series backports the fix for #12010 along with low-risk patches it depends on. Fixes: #12010 Closes #15137 * github.com:scylladb/scylladb: distributed_loader: process_sstable_dir: do not verify snapshots utils/directories: verify_owner_and_mode: add recursive flag utils: Restore indentation after previous patch utils: Coroutinize verify_owner_and_mode()	2023-08-23 15:50:29 +03:00
Benny Halevy	6588ecd66f	distributed_loader: process_sstable_dir: do not verify snapshots Skip over verification of owner and mode of the snapshots sub-directory as this might race with scylla-manager trying to delete old snapshots concurrently. Fixes #12010 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `845b6f901b`)	2023-08-23 13:19:55 +03:00
Benny Halevy	03640cc15b	utils/directories: verify_owner_and_mode: add recursive flag Allow the caller to verify only the top level directories so that sub-directories can be verified selectively (in particular, skip validation of snapshots). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `60862c63dd`)	2023-08-23 13:19:36 +03:00
Pavel Emelyanov	6d4d576460	utils: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `2eb88945ea`)	2023-08-23 13:19:36 +03:00
Pavel Emelyanov	96aca473b4	utils: Coroutinize verify_owner_and_mode() There's a helper verification_error() that prints a warning and returns excpetional future. The one is converted into void throwing one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `4ebb812df0`)	2023-08-23 13:19:30 +03:00
Aleksandra Martyniuk	29e6dc8c1b	compaction: do not swallow compaction_stopped_exception for reshape Loop in shard_reshaping_compaction_task_impl::run relies on whether sstables::compaction_stopped_exception is thrown from run_custom_job. The exception is swallowed for each type of compaction in compaction_manager::perform_task. Rethrow an exception in perfrom task for reshape compaction. Fixes: #15058. (cherry picked from commit `e0ce711e4f`) Closes #15122	2023-08-23 12:11:58 +03:00
Vlad Zolotarov	9a414d440d	scylla_raid_setup: make --online-discard argument useful This argument was dead since its introduction and 'discard' was always configured regardless of its value. This patch allows actually configuring things using this argument. Fixes #14963 Closes #14964 (cherry picked from commit `e13a2b687d`)	2023-08-22 10:40:37 +03:00
Anna Mikhlin	e0ebc95025	release: prepare for 5.2.7	2023-08-21 14:44:56 +03:00
Botond Dénes	b7ab42b61c	Merge 'Ignore no such column family in repair' from Aleksandra Martyniuk While repair requested by user is performed, some tables may be dropped. When the repair proceeds to these tables, it should skip them and continue with others. When no_such_column_family is thrown during user requested repair, it is logged and swallowed. Then the repair continues with the remaining tables. Fixes: #13045 Closes #13068 * github.com:scylladb/scylladb: repair: fix indentation repair: continue user requested repair if no_such_column_family is thrown repair: add find_column_family_if_exists function (cherry picked from commit `9859bae54f`)	2023-08-20 19:49:21 +03:00
Botond Dénes	098baaef48	Merge 'cql: add missing functions for the COUNTER column type' from Nadav Har'El We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742). As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass. Fixes #14501 Fixes #14742 Closes #14745 * github.com:scylladb/scylladb: test/cql-pytest: test confirming that casting to counter doesn't work cql: support casting of counter to other types cql: implement missing counterasblob() and blobascounter() functions cql: implement missing type functions for "counters" type (cherry picked from commit `a637ddd09c`)	2023-08-13 14:53:48 +03:00
Nadav Har'El	e11561ef65	cql-pytest: translate Cassandra's tests for compact tables This is a translation of Cassandra's CQL unit test source file validation/operations/CompactStorageTest.java into our cql-pytest framework. This very large test file includes 86 tests for various types of operations and corner cases of WITH COMPACT STORAGE tables. All 86 tests pass on Cassandra (except one using a deprecated feature that needs to be specially enabled). 30 of the tests fail on Scylla reproducing 7 already-known Scylla issues and 7 previously-unknown issues: Already known issues: Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE" Refs #4244: Add support for mixing token, multi- and single-column restrictions Refs #5361: LIMIT doesn't work when using GROUP BY Refs #5362: LIMIT is not doing it right when using GROUP BY Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax Refs #8627: Cleanly reject updates with indexed values where value > 64k New issues: Refs #12471: Range deletions on COMPACT STORAGE is not supported Refs #12474: DELETE prints misleading error message suggesting ALLOW FILTERING would work Refs #12477: Combination of COUNT with GROUP BY is different from Cassandra in case of no matches Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column Refs #12526: Support filtering on COMPACT tables Refs #12749: Unsupported empty clustering key in COMPACT table Refs #12815: Hidden column "value" in compact table isn't completely hidden Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12816 (cherry picked from commit `328cdb2124`)	2023-08-13 14:44:19 +03:00
Nadav Har'El	e03c21a83b	cql-pytest: translate Cassandra's tests for CAST operations This is a translation of Cassandra's CQL unit test source file functions/CastFctsTest.java into our cql-pytest framework. There are 13 tests, 9 of them currently xfail. The failures are caused by one recently-discovered issue: Refs #14501: Cannot Cast Counter To Double and by three previously unknown or undocumented issues: Refs #14508: SELECT CAST column names should match Cassandra's Refs #14518: CAST from timestamp to string not same as Cassandra on zero milliseconds Refs #14522: Support CAST function not only in SELECT Curiously, the careful translation of this test also caused me to find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647 which the test in Java missed because it made the same mistake as the implementation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14528 (cherry picked from commit `f08bc83cb2`)	2023-08-13 14:41:36 +03:00
Nadav Har'El	79b5befe65	test/cql-pytest: add tests for data casts and inf in sums This patch adds tests to reproduce issue #13551. The issue, discovered by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast()) from varint type broke. So we add two tests in cql-pytest: 1. A new test file, test_cast_data.py, for testing data casts (a CAST (...) as ... in a SELECT), starting with testing casts from varint to other types. The test uncovers a lot of interesting cases (it is heavily commented to explain these cases) but nothing there is wrong and all tests pass on Scylla. 2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out that this caused #13551. In Cassandra and older Scylla, the sum returned a NaN. In Scylla today, it generates a misleading error message. As usual, the tests were run on both Cassandra (4.1.1) and Scylla. Refs #13551. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `78555ba7f1`)	2023-08-13 14:40:08 +03:00
Petr Gusev	aca9e41a44	topology.cc: remove_endpoint: _dc_racks removal fix The eps reference was reused to manipulate the racks dictionary. This resulted in assigning a set of nodes from the racks dictionary to an element of the _dc_endpoints dictionary. This is a backport of `bcb1d7c` to branch-5.2. Refs: #14184 Closes #14893	2023-08-11 14:29:37 +03:00
Pavel Emelyanov	ff22807ed2	Update seastar submodule * seastar 29a0e645...534cb38c (1): > rpc: Abort connection if send_entry() fails Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-09 11:30:57 +03:00
Botond Dénes	bcb8f6a8dd	Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak If semaphore mismatch occurs, check whether both semaphores belong to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error. Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup__querier` methods. The mismatch can happen if user's scheduling group changed during a query. We don't want to throw an error then, but drop and reset cached reader. This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it. Refers: https://github.com/scylladb/scylla-enterprise/issues/3182 Refers: https://github.com/scylladb/scylla-enterprise/issues/3050 Closes: #14770 Closes #14736 github.com:scylladb/scylladb: querier_cache: add stats of scheduling group mismatches querier_cache: check semaphore mismatch during querier lookup querier_cache: add reference to `replica::database::is_user_semaphore()` replica:database: add method to determine if semaphore is user one (cherry picked from commit `a8feb7428d`)	2023-08-09 10:20:53 +03:00
Kefu Chai	9ce3695a0d	compaction_manager: prevent gc-only sstables from being compacted before this change, there are chances that the temporary sstables created for collecting the GC-able data create by a certain compaction can be picked up by another compaction job. this wastes the CPU cycles, adds write amplification, and causes inefficiency. in general, these GC-only SSTables are created with the same run id as those non-GC SSTables, but when a new sstable exhausts input sstable(s), we proactively replace the old main set with a new one so that we can free up the space as soon as possible. so the GC-only SSTables are added to the new main set along with the non-GC SSTables, but since the former have good chance to overlap the latter. these GC-only SSTables are assigned with different run ids. but we fail to register them to the `compaction_manager` when replacing the main sstable set. that's why future compactions pick them up when performing compaction, when the compaction which created them is not yet completed. so, in this change, * to prevent sstables in the transient stage from being picked up by regular compactions, a new interface class is introduced so that the sstable is always added to registration before it is added to sstable set, and removed from registration after it is removed from sstable set. the struct helps to consolidate the regitration related logic in a single place, and helps to make it more obvious that the timespan of an sstable in the registration should cover that in the sstable set. * use a different run_id for the gc sstable run, as it can overlap with the output sstable run. the run_id for the gc sstable run is created only when the gc sstable writer is created. because the gc sstables is not always created for all compactions. please note, all (indirect) callers of `compaction_task_executor::compact_sstables()` passes a non-empty `std::function` to this function, so there is no need to check for empty before calling it. so in this change, the check is dropped. Fixes #14560 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14725 (cherry picked from commit `fdf61d2f7c`) Closes #14827	2023-08-04 09:59:10 +03:00
Patryk Jędrzejczak	4cd5847761	config: add schema_commitlog_segment_size_in_mb variable In #14668, we have decided to introduce a new scylla.yaml variable for the schema commitlog segment size. The segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. Therefore, increasing the schema commitlog segment size is sometimes necessary. (cherry picked from commit `5b167a4ad7`)	2023-08-02 18:05:39 +02:00
Botond Dénes	2b7f1cd906	Update tools/java submodule * tools/java 83b2168b19...f8f556d802 (1): > Use EstimatedHistogram in metricPercentilesAsArray Fixes: #10089	2023-07-31 12:13:01 +03:00
Nadav Har'El	e34c62c567	Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions. This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed. Fixes: #14819 Closes #14821 * github.com:scylladb/scylladb: test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations db/view/view_updating_consumer: account for the size of mutations mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() mutation/mutation: add memory_usage() (cherry picked from commit `056d04954c`)	2023-07-31 03:43:44 -04:00
Nadav Har'El	992c50173a	Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek LWT queries with empty clustering range used to cause a crash. For example in: ```cql UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3 ``` The range of `c` is empty - there are no valid values. This caused a segfault when accessing the `first` range: ```c++ op.ranges.front() ``` Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved. We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different. We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change. A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash. Fixes: https://github.com/scylladb/scylladb/issues/13129 Closes #14429 * github.com:scylladb/scylladb: modification_statement: reject conditional statements with empty clustering key statements/cas_request: fix crash on empty clustering range in LWT (cherry picked from commit `49c8c06b1b`)	2023-07-31 09:14:55 +03:00
Beni Peled	58acf071bf	release: prepare for 5.2.6	2023-07-30 14:19:28 +03:00
Raphael S. Carvalho	d2369fc546	cached_file: Evict unused pages that aren't linked to LRU yet It was found that cached_file dtor can hit the following assert after OOM cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.` cached_file's dtor iterates through all entries and evict those that are linked to LRU, under the assumption that all unused entries were linked to LRU. That's partially correct. get_page_ptr() may fetch more than 1 page due to read ahead, but it will only call cached_page::share() on the first page, the one that will be consumed now. share() is responsible for automatically placing the page into LRU once refcount drops to zero. If the read is aborted midway, before cached_file has a chance to hit the 2nd page (read ahead) in cache, it will remain there with refcount 0 and unlinked to LRU, in hope that a subsequent read will bring it out of that state. Our main user of cached_file is per-sstable index caching. If the scenario above happens, and the sstable and its associated cached_file is destroyed, before the 2nd page is hit, cached_file will not be able to clear all the cache because some of the pages are unused and not linked. A page read ahead will be linked into LRU so it doesn't sit in memory indefinitely. Also allowing for cached_file dtor to clear all cache if some of those pages brought in advance aren't fetched later. A reproducer was added. Fixes #14814. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14818 (cherry picked from commit `050ce9ef1d`)	2023-07-28 13:56:28 +02:00
Kamil Braun	6273c4df35	test: use correct timestamp resolution in `test_group0_history_clearing_old_entries` In `10c1f1dc80` I fixed `make_group0_history_state_id_mutation` to use correct timestamp resolution (microseconds instead of milliseconds) which was supposed to fix the flakiness of `test_group0_history_clearing_old_entries`. Unfortunately, the test is still flaky, although now it's failing at a later step -- this is because I was sloppy and I didn't adjust this second part of the test to also use microsecond resolution. The test is counting the number of entries in the `system.group0_history` table that are older than a certain timestamp, but it's doing the counting using millisecond resolution, causing it to give results that are off by one sometimes. Fix it by using microseconds everywhere. Fixes #14653 Closes #14670 (cherry picked from commit `9d4b3c6036`)	2023-07-27 15:46:37 +02:00
Raphael S. Carvalho	752984e774	Fix stack-use-after-return in mutation source excluding staging The new test detected a stack-use-after-return when using table's as_mutation_source_excluding_staging() for range reads. This doesn't really affect view updates that generate single key reads only. So the problem was only stressed in the recently added test. Otherwise, we'd have seen it when running dtests (in debug mode) that stress the view update path from staging. The problem happens because the closure was feeded into a noncopyable_function that was taken by reference. For range reads, we defer before subsequent usage of the predicate. For single key reads, we only defer after finished using the predicate. Fix is about using sstable_predicate type, so there won't be a need to construct a temporary object on stack. Fixes #14812. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14813 (cherry picked from commit `0ac43ea877`)	2023-07-26 14:30:32 +03:00
Raphael S. Carvalho	986491447b	table: Optimize creation of reader excluding staging for view building View building from staging creates a reader from scratch (memtable + sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s to INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s Refs #14089. Fixes #14244. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `1d8cb32a5d`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14764	2023-07-20 16:46:15 +03:00
Takuya ASADA	a05bb26cd6	scylla_fstrim_setup: start scylla-fstrim.timer on setup Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and just enables it, so the timer starts only after rebooted. This is incorrect behavior, we start start it during the setup. Also, unmask is unnecessary for enabling the timer. Fixes #14249 Closes #14252 (cherry picked from commit `c70a9cbffe`) Closes #14421	2023-07-18 16:05:09 +03:00
Michał Chojnowski	41aef6dc96	partition_snapshot_reader.hh: fix iterator invalidation in do_refresh_state do_refresh_state() keeps iterators to rows_entry in a vector. This vector might be resized during the procedure, triggering memory reclaim and invalidating the iterators, which can cause arbitrarily long loops and/or a segmentation fault during make_heap(). To fix this, do_refresh_state has to always be called from the allocating section. Additionally, it turns out that the first do_refresh_state is useless, because reset_state() doesn't set _change_mark. This causes do_refresh_state to be needlessly repeated during a next_row() or next_range_tombstone() which happens immediately after it. Therefore this patch moves the _change_mark assignment from maybe_refresh_state to do_refresh_state, so that the change mark is properly set even after the first refresh. Fixes #14696 Closes #14697	2023-07-17 14:20:37 +02:00
Botond Dénes	aa5e904c40	repair: Release permit earlier when the repair_reader is done Consider - 10 repair instances take all the 10 _streaming_concurrency_sem - repair readers are done but the permits are not released since they are waiting for view update _registration_sem - view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them - deadlock happens Note, when the readers are done, i.e., reaching EOS, the repair reader replaces the underlying (evictable) reader with an empty reader. The empty reader is not evictable, so the resources cannot be forcibly released. To fix, release the permits manually as soon as the repair readers are done even if the repair job is waiting for _registration_sem. Fixes #14676 Closes #14677 (cherry picked from commit `1b577e0414`)	2023-07-14 18:18:43 +03:00
Marcin Maliszkiewicz	eff2fe79b1	alternator: close output_stream when exception is thrown during response streaming When exception occurs and we omit closing output_stream then the whole process is brought down by an assertion in ~output_stream. Fixes https://github.com/scylladb/scylladb/issues/14453 Relates https://github.com/scylladb/scylladb/issues/14403 Closes #14454 (cherry picked from commit `6424dd5ec4`)	2023-07-13 23:27:46 +03:00
Nadav Har'El	ee8b26167b	Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz Adds preemption points used in Alternator when: - sending bigger json response - building results for BatchGetItem I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to: auto start = std::chrono::steady_clock::now(); do { } while ((std::chrono::steady_clock::now() - start) < 100ms); and seeing reactor stall times. After the patch they were not increasing while before they kept building up due to no preemption. Refs #7926 Fixes #13689 Closes #12351 * github.com:scylladb/scylladb: alternator: remove redundant flush call in make_streamed utils: yield when streaming json in print() alternator: yield during BatchGetItem operation (cherry picked from commit `d2e089777b`)	2023-07-13 23:27:38 +03:00
Yaron Kaikov	02bc54d4b6	release: prepare for 5.2.5	2023-07-13 14:23:18 +03:00
Avi Kivity	c9a5c4c876	Merge ' message: match unknown tenants to the default tenant' from Botond Dénes On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name. If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen. This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster. This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore. I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before. Fixes: #13841 Fixes: #12552 Closes #13843 * github.com:scylladb/scylladb: message: match unknown tenants to the default tenant message: generalize per-tenant connection types (cherry picked from commit `a7c2c9f92b`)	2023-07-12 15:31:48 +03:00
Tomasz Grabiec	1a6f4389ae	Merge 'atomic_cell: compare value last' from Benny Halevy Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in `e225c88a65` ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb/scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes scylladb/scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit `87b4606cd6`) Closes #14649	2023-07-12 10:09:56 +03:00
Calle Wilund	1088c3e24a	storage_proxy: Make split_stats resilient to being called from different scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14636	2023-07-12 09:24:56 +03:00
Botond Dénes	c9cb8dcfd0	Merge '[backport 5.2] view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski View update routines accept mutation objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Backported from `c25201c1a3`. `view_build_test.cc` needed some tiny adjustments for the backport. Closes #14619 Fixes #14503 * github.com:scylladb/scylladb: test: view_build_test: add range tombstones to test_view_update_generator_buffering test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations view_updating_consumer: make buffer limit a variable view: fix range tombstone handling on flushes in view_updating_consumer	2023-07-11 15:04:23 +03:00
Takuya ASADA	91c1feec51	scylla_raid_setup: wipe filesystem signatures from specified disks The discussion on the thread says, when we reformat a volume with another filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it detected two filesystem signatures, because mkfs.xxx did not cleared previous filesystem signature. To avoid this, we need to run wipefs before running mkfs. Note that this runs wipefs twice, for target disks and also for RAID device. wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks). Also dropped -f option from mkfs.xfs, it will check wipefs is working as we expected. Fixes #13737 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #13738 (cherry picked from commit `fdceda20cc`)	2023-07-11 15:00:03 +03:00
Piotr Dulikowski	57d0310dcc	combined: mergers: remove recursion in operator()() In mutation_reader_merger and clustering_order_reader_merger, the operator()() is responsible for producing mutation fragments that will be merged and pushed to the combined reader's buffer. Sometimes, it might have to advance existing readers, open new and / or close some existing ones, which requires calling a helper method and then calling operator()() recursively. In some unlucky circumstances, a stack overflow can occur: - Readers have to be opened incrementally, - Most or all readers must not produce any fragments and need to report end of stream without preemption, - There has to be enough readers opened within the lifetime of the combined reader (~500), - All of the above needs to happen within a single task quota. In order to prevent such a situation, the code of both reader merger classes were modified not to perform recursion at all. Most of the code of the operator()() was moved to maybe_produce_batch which does not recur if it is not possible for it to produce a fragment, instead it returns std::nullopt and operator()() calls this method in a loop via seastar::repeat_until_value. A regression test is added. Fixes: scylladb/scylladb#14415 Closes #14452 (cherry picked from commit `ee9bfb583c`) Closes #14605	2023-07-11 11:09:25 +03:00
Michał Chojnowski	78f25f2d36	test: view_build_test: add range tombstones to test_view_update_generator_buffering This patch adds a full-range tombstone to the compacted mutation. This raises the coverage of the test. In particular, it reproduces issue #14503, which should have been caught by this test, but wasn't.	2023-07-11 09:44:00 +02:00
Michał Chojnowski	14fa3ee34e	test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations A random mutation test for view_updating_consumer's buffering logic. Reproduces #14503.	2023-07-11 09:44:00 +02:00
Michał Chojnowski	75933b9906	view_updating_consumer: make buffer limit a variable The limit doesn't change at runtime, but we this patch makes it variable for unit testing purposes.	2023-07-11 09:44:00 +02:00
Michał Chojnowski	fc7b02c8e4	view: fix range tombstone handling on flushes in view_updating_consumer View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes #14503	2023-07-11 09:44:00 +02:00
Jan Ciolek	0f4f8638c5	forward_service: fix forgetting case-sensitivity in aggregates There was a bug that caused aggregates to fail when used on column-sensitive columns. For example: ``` SELECT SUM("SomeColumn") FROM ks.table; ``` would fail, with a message saying that there is no column "somecolumn". This is because the case-sensitivity got lost on the way. For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written. The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it. This converted the name to lowercase, and later such column couldn't be found. To fix it, let's make the `column_identifier` case-sensitive. It will preserve the name, without converting it to lowercase. Fixes: https://github.com/scylladb/scylladb/issues/14307 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> (cherry picked from commit `7fca350075`)	2023-07-10 15:22:58 +03:00
Botond Dénes	0ba37fa431	Merge 'doc: fix rollback in the 4.3-to-2021.1, 5.0-to-2022.1, and 5.1-to-2022.2 upgrade guides' from Anna Stuchlik This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents). This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1. Refs: https://github.com/scylladb/scylla-enterprise/issues/3046 - [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1) - [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1) - [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1) (see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864) Closes #14444 * github.com:scylladb/scylladb: doc: fix rollback in 4.3-to-2021.1 upgrade guide doc: fix rollback in 5.0-to-2022.1 upgrade guide doc: fix rollback in 5.1-to-2022.2 upgrade guide (cherry picked from commit `8a7261fd70`)	2023-07-10 15:16:24 +03:00
Raphael S. Carvalho	55edbded47	compaction: avoid excessive reallocation and during input list formatting with off-strategy, input list size can be close to 1k, which will lead to unneeded reallocations when formatting the list for logging. in the past, we faced stalls in this area, and excessive reallocation (log2 ~1k = ~10) may have contributed to that. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13907 (cherry picked from commit `5544d12f18`) Fixes scylladb/scylladb#14071	2023-07-09 23:54:18 +03:00
Marcin Maliszkiewicz	9f79c9f41d	docs: link general repairs page to RBNO page Information was duplicated before and the version on this page was outdated - RBNO is enabled for replace operation already. Closes #12984 (cherry picked from commit `bd7caefccf`)	2023-07-07 16:38:32 +02:00
Kamil Braun	6dd09bb4ea	storage_proxy: query_partition_key_range_concurrent: don't access empty range `query_partition_range_concurrent` implements an optimization when querying a token range that intersects multiple vnodes. Instead of sending a query for each vnode separately, it sometimes sends a single query to cover multiple vnodes - if the intersection of replica sets for those vnodes is large enough to satisfy the CL and good enough in terms of the heat metric. To check the latter condition, the code would take the smallest heat metric of the intersected replica set and compare them to smallest heat metrics of replica sets calculated separately for each vnode. Unfortunately, there was an edge case that the code didn't handle: the intersected replica set might be empty and the code would access an empty range. This was catched by an assertion added in `8db1d75c6c` by the dtest `test_query_dc_with_rf_0_does_not_crash_db`. The fix is simple: check if the intersected set is empty - if so, don't calculate the heat metrics because we can decide early that the optimization doesn't apply. Also change the `assert` to `on_internal_error`. Fixes #14284 Closes #14300 (cherry picked from commit `732feca115`) Backport note: the original `assert` was never added to branch-5.2, but the fix is still applicable, so I backported the fix and the `on_internal_error` check.	2023-07-05 13:14:24 +02:00
Mikołaj Grzebieluch	f431345ab6	raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs Another node can stop after it joined the group0 but before it advertised itself in gossip. `get_inet_addrs` will try to resolve all IPs and `wait_for_peers_to_enter_synchronize_state` will loop indefinitely. But `wait_for_peers_to_enter_synchronize_state` can return early if one of the nodes confirms that the upgrade procedure has finished. For that, it doesn't need the IPs of all group 0 members - only the IP of some nodes which can do the confirmation. This commit restructures the code so that IPs of nodes are resolved inside the `max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs. Then, even if some IPs won't be resolved, but one of the nodes confirms a successful upgrade, we can continue. Fixes #13543 (cherry picked from commit `a45e0765e4`)	2023-07-05 13:01:57 +02:00
Anna Stuchlik	009601d374	doc: fix rollback in 5.2-to-2023.1 upgrade guide This commit fixes the Restore System Tables section in the 5.2-to-2023.1 upgrade guide by adding a command to clean upgraded SStables during rollback. This is a bug (an incomplete command) and must be backported to branch-5.3 and branch-5.2. Refs: https://github.com/scylladb/scylla-enterprise/issues/3046 Closes #14373 (cherry picked from commit `f4ae2c095b`)	2023-06-29 12:07:41 +03:00
Botond Dénes	8e63b2f3e3	Merge 'readers: evictable_reader: don't accidentally consume the entire partition' from Kamil Braun The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s. Fixes #13491 Closes #14375 * github.com:scylladb/scylladb: readers: evictable_reader: don't accidentally consume the entire partition test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position (cherry picked from commit `586102b42e`)	2023-06-29 12:04:35 +03:00
Benny Halevy	483c0b183a	repair: use fmt::join to print ks_erms\|boost::adaptors::map_keys This is a minimal fix for #13146 for branch-5.2 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14405	2023-06-27 14:15:28 +03:00
Anna Stuchlik	d5063e6347	doc: add Ubuntu 22 to 2021.1 OS support Fixes https://github.com/scylladb/scylla-enterprise/issues/3036 This commit adds support for Ubuntu 22.04 to the list of OSes supported by ScyllaDB Enterprise 2021.1. This commit fixex a bug and must be backported to branch-5.3 and branch-5.2. Closes #14372 (cherry picked from commit `74fc69c825`)	2023-06-26 13:43:58 +03:00
Anna Stuchlik	543fa04e4d	doc: udpate the OSS docs landing page Fixes https://github.com/scylladb/scylladb/issues/14333 This commit replaces the documentation landing page with the Open Source-only documentation landing page. This change is required as now there is a separate landing page for the ScyllaDB documentation, so the page is duplicated, creating bad user experience. (cherry picked from commit `f60f89df17`) Closes #14370	2023-06-23 14:00:33 +02:00
Anna Mikhlin	cebbf6c5df	release: prepare for 5.2.4	2023-06-22 16:23:46 +03:00
Avi Kivity	73b8669953	Update seastar submodule (default priority class shares) * seastar 32ab15cda6...29a0e64513 (1): > reactor: change shares for default IO class from 1 to 200 Fixes #13753. In 5.3: `37e6e65211`	2023-06-21 21:23:14 +03:00
Botond Dénes	9efca96cf2	Merge 'Backport 5.2 test.py stability/UX improvemenets' from Kamil Braun Backport the following improvements for test.py topology tests for CI stability: - https://github.com/scylladb/scylladb/pull/12652 - https://github.com/scylladb/scylladb/pull/12630 - https://github.com/scylladb/scylladb/pull/12619 - https://github.com/scylladb/scylladb/pull/12686 - picked from https://github.com/scylladb/scylladb/pull/12726: `9ceb6aba81` - picked from https://github.com/scylladb/scylladb/pull/12173: `fc60484422` - https://github.com/scylladb/scylladb/pull/12765 - https://github.com/scylladb/scylladb/pull/12804 - https://github.com/scylladb/scylladb/pull/13342 - https://github.com/scylladb/scylladb/pull/13589 - picked from https://github.com/scylladb/scylladb/pull/13135: `7309a1bd6b` - picked from https://github.com/scylladb/scylladb/pull/13134: `21b505e67c`, `a4411e9ec4`, `c1d0ee2bce`, `8e3392c64f`, `794d0e4000`, `e407956e9f` - https://github.com/scylladb/scylladb/pull/13271 - https://github.com/scylladb/scylladb/pull/13399 - picked from https://github.com/scylladb/scylladb/pull/12699: `3508a4e41e`, `08d754e13f`, `62a945ccd5`, `041ee3ffdd` - https://github.com/scylladb/scylladb/pull/13438 (but skipped the test_mutation_schema_change.py fix since I didn't backport this new test) - https://github.com/scylladb/scylladb/pull/13427 - https://github.com/scylladb/scylladb/pull/13756 - https://github.com/scylladb/scylladb/pull/13789 - https://github.com/scylladb/scylladb/pull/13933 (but skipped the test_snapshot.py fix since I didn't backport this new test) Closes #14215 * github.com:scylladb/scylladb: test: pylib: fix `read_barrier` implementation test: pylib: random_tables: perform read barrier in `verify_schema` test: issue a read barrier before checking ring consistency Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr test/pylib: ManagerClient helpers to wait for... test: pylib: Add a way to create cql connections with particular coordinators test/pylib: get gossiper alive endpoints test/topology: default replication factor 3 test/pylib: configurable replication factor scylla_cluster.py: optimize node logs reading test/pylib: RandomTables.add_column with value column scylla_cluster.py: add start flag to server_add ServerInfo: drop host_id scylla_cluster.py: add config to server_add scylla_cluster.py: add expected_error to server_start scylla_cluster.py: ScyllaServer.start, refactor error reporting scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed test: improve logging in ScyllaCluster test: topology smp test with custom cluster test/pylib: topology: support clusters of initial size 0 Merge 'test/pylib: split and refactor topology tests' from Alecco Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun test: Increase START_TIMEOUT test/pylib: one-shot error injection helper test: topology: wait for token ring/group 0 consistency after decommission test: topology: verify that group 0 and token ring are consistent Merge 'pytest: start after ungraceful stop' from Alecco Merge 'test.py: improve test failure handling' from Kamil Braun	2023-06-15 07:19:39 +03:00
Pavel Emelyanov	210e3d1999	Backport 'Merge 'Enlighten messaging_service::shutdown()'' This includes seastar update titled 'Merge 'Split rpc::server stop into two parts'' * br-5.2-backport-ms-shutdown: messaging_service: Shutdown rpc server on shutdown messaging_service: Generalize stop_servers() messaging_service: Restore indentation after previous patch messaging_service: Coroutinize stop() messaging_service: Coroutinize stop_servers() Update seastar submodule refs: #14031	2023-06-14 09:14:06 +03:00
Pavel Emelyanov	702d622b38	messaging_service: Shutdown rpc server on shutdown The RPC server now has a lighter .shutdown() method that just does what m.s. shutdown() needs, so call it. On stop call regular stop to finalize the stopping process Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-14 09:04:04 +03:00
Pavel Emelyanov	db44630254	messaging_service: Generalize stop_servers() Make it do_with_servers() and make it accept method to call and message to print. This gives the ability to reuse this helper in next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-14 09:03:59 +03:00
Pavel Emelyanov	5d3d64bafe	messaging_service: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-14 09:03:53 +03:00
Pavel Emelyanov	079f5d8eca	messaging_service: Coroutinize stop() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-14 09:03:48 +03:00
Pavel Emelyanov	fd7310b104	messaging_service: Coroutinize stop_servers() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-14 09:03:42 +03:00
Pavel Emelyanov	991d00964d	Update seastar submodule * seastar 8c86e6de...32ab15cd (1): > rpc: Introduce server::shutdown() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-14 09:02:46 +03:00
Anna Stuchlik	0137ddaec8	doc: remove support for Ubuntu 18 Fixes https://github.com/scylladb/scylladb/issues/14097 This commit removes support for Ubuntu 18 from platform support for ScyllaDB Enterprise 2023.1. The update is in sync with the change made for ScyllaDB 5.2. This commit must be backported to branch-5.2 and branch-5.3. Closes #14118 (cherry picked from commit `b7022cd74e`)	2023-06-13 12:06:56 +03:00
Raphael S. Carvalho	58f88897c8	compaction: Fix incremental compaction for sstable cleanup After `c7826aa910`, sstable runs are cleaned up together. The procedure which executes cleanup was holding reference to all input sstables, such that it could later retry the same cleanup job on failure. Turns out it was not taking into account that incremental compaction will exhaust the input set incrementally. Therefore cleanup is affected by the 100% space overhead. To fix it, cleanup will now have the input set updated, by removing the sstables that were already cleaned up. On failure, cleanup will retry the same job with the remaining sstables that weren't exhausted by incremental compaction. New unit test reproduces the failure, and passes with the fix. Fixes #14035. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14038 (cherry picked from commit `23443e0574`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14193	2023-06-13 09:53:46 +03:00
Kamil Braun	f4115528d6	test: pylib: fix `read_barrier` implementation The previous implementation didn't actually do a read barrier, because the statement failed on an early prepare/validate step which happened before read barrier was even performed. Change it to a statement which does not fail and doesn't perform any schema change but requires a read barrier. This breaks one test which uses `RandomTables.verify_schema()` when only one node is alive, but `verify_schema` performs a read barrier. Unbreak it by skipping the read barrier in this case (it makes sense in this particular test). Closes #13933 (cherry picked from commit `64dc76db55`) Backport note: skipped the test_snapshot.py change, as the test doesn't exist on this branch.	2023-06-12 12:40:22 +02:00
Kamil Braun	9c941aba0b	test: pylib: random_tables: perform read barrier in `verify_schema` `RandomTables.verify_schema` is often called in topology tests after performing a schema change. It compares the schema tables fetched from some node to the expected latest schema stored by the `RandomTables` object. However there's no guarantee that the latest schema change has already propagated to the node which we query. We could have performed the schema change on a different node and the change may not have been applied yet on all nodes. To fix that, pick a specific node and perform a read barrier on it, then use that node to fetch the schema tables. Fixes #13788 Closes #13789 (cherry picked from commit `3f3dcf451b`)	2023-06-12 12:40:22 +02:00
Konstantin Osipov	094bcac399	test: issue a read barrier before checking ring consistency Raft replication doesn't guarantee that all replicas see identical Raft state at all times, it only guarantees the same order of events on all replicas. When comparing raft state with gossip state on a node, first issue a read barrier to ensure the node has the latest raft state. To issue a read barrier it is sufficient to alter a non-existing state: in order to validate the DDL the node needs to sync with the leader and fetch its latest group0 state. Fixes #13518 (flaky topology test). Closes #13756 (cherry picked from commit `e7c9ca560b`)	2023-06-12 12:40:22 +02:00
Kamil Braun	e49a531aaa	Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr This is a follow-up to #13399, the patch addresses the issues mentioned there: * linesep can be split between blocks; * linesep can be part of UTF-8 sequence; * avoid excessively long lines, limit to 256 chars; * the logic of the function made simpler and more maintainable. Closes #13427 * github.com:scylladb/scylladb: pylib_test: add tests for read_last_line pytest: add pylib_test directory scylla_cluster.py: fix read_last_line scylla_cluster.py: move read_last_line to util.py (cherry picked from commit `70f2b09397`)	2023-06-12 12:40:22 +02:00
Alejo Sanchez	bcf99a37cd	test/pylib: ManagerClient helpers to wait for... server to see other servers after start/restart When starting/restarting a server, provide a way to wait for the server to see at least n other servers. Also leave the implementation methods available for manual use and update previous tests, one to wait for a specific server to be seen, and one to wait for a specific server to not be seen (down). Fixes #13147 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13438 (cherry picked from commit `11561a73cb`) Backport note: skipped the test_mutation_schema_change.py fix as the test doesn't exist on this branch.	2023-06-12 12:40:08 +02:00
Tomasz Grabiec	fe4af95745	test: pylib: Add a way to create cql connections with particular coordinators Usage: await manager.driver_connect(server=servers[0]) manager.cql.execute(f"...", execution_profile='whitelist') (cherry picked from commit `041ee3ffdd`)	2023-06-12 12:38:15 +02:00
Alejo Sanchez	ac5dff7de0	test/pylib: get gossiper alive endpoints Helper to get list of gossiper alive endpoints from REST API. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> (cherry picked from commit `62a945ccd5`)	2023-06-12 12:38:15 +02:00
Alejo Sanchez	ad99456a9d	test/topology: default replication factor 3 For most tests there will be nodes down, increase replication factor to 3 to avoid having problems for partitions belonging to down nodes. Use replication factor 1 for raft upgrade tests. (cherry picked from commit `08d754e13f`)	2023-06-12 12:38:15 +02:00
Alejo Sanchez	937e890fba	test/pylib: configurable replication factor Make replication factor configurable for the RandomTables helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> (cherry picked from commit `3508a4e41e`)	2023-06-12 12:38:15 +02:00
Petr Gusev	12eec5bb2b	scylla_cluster.py: optimize node logs reading There are two occasions in scylla_cluster where we read the node logs, and in both of them we read the entire file in memory. This is not efficient and may cause an OOM. In the first case we need the last line of the log file, so we seek at the end and move backwards looking for a new line symbol. In the second case we look through the log file to find the expected_error. The readlines() method returns a Python list object, which means it reads the entire file in memory. It's sufficient to just remove it since iterating over the file instance already yields lines lazily one by one. This is a follow-up for #13134. Closes #13399 (cherry picked from commit `09636b20f3`)	2023-06-12 12:38:15 +02:00
Alejo Sanchez	59847389d4	test/pylib: RandomTables.add_column with value column When adding extra columns in a test, make them value column. Name them with the "v_" prefix and use the value column number counter. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13271 (cherry picked from commit `81b40c10de`)	2023-06-12 12:38:15 +02:00
Petr Gusev	7a8c5db55b	scylla_cluster.py: add start flag to server_add Sometimes when creating a node it's useful to just install it and not start. For example, we may want to try to start it later with expected error. The ScyllaServer.install method has been made exception safe, if an exception occurs, it reverts to the original state. This allows to not duplicate the try/except logic in two of its call sites. (cherry picked from commit `e407956e9f`)	2023-06-12 12:38:15 +02:00
Petr Gusev	15ea5bf53f	ServerInfo: drop host_id We are going to allow the ScyllaCluster.add_server function not to start the server if the caller has requested that with a special parameter. The host_id can only be obtained from a running node, so add_server won't be able to return it in this case. I've grepped the tests for host_id and there doesn't seem to be any reference to it in the code. (cherry picked from commit `794d0e4000`)	2023-06-12 12:38:15 +02:00
Petr Gusev	3ab610753e	scylla_cluster.py: add config to server_add Sometimes when creating a node it's useful to pass a custom node config. (cherry picked from commit `8e3392c64f`)	2023-06-12 12:38:15 +02:00
Petr Gusev	1959eddf86	scylla_cluster.py: add expected_error to server_start Sometimes it's useful to check that the node has failed to start for a particular reason. If server_start can't find expected_error in the node's log or if the node has started without errors, it throws an exception. (cherry picked from commit `c1d0ee2bce`)	2023-06-12 12:38:15 +02:00
Petr Gusev	43525aec83	scylla_cluster.py: ScyllaServer.start, refactor error reporting Extract the function that encapsulates all the error reporting logic. We are going to use it in several other places to implement expected_error feature. (cherry picked from commit `a4411e9ec4`)	2023-06-12 12:38:15 +02:00
Petr Gusev	930c4e65d6	scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed The ScyllaServer expects cmd to be None if the Scylla process is not running. Otherwise, if start failed and the test called update_config, the latter will try to send a signal to a non-existent process via cmd. (cherry picked from commit `21b505e67c`)	2023-06-12 12:38:15 +02:00
Konstantin Osipov	d2caaef188	test: improve logging in ScyllaCluster Print IP addresses and cluster identifiers in more log messages, it helps debugging. (cherry picked from commit `7309a1bd6b`)	2023-06-12 12:38:15 +02:00
Alejo Sanchez	6474edd691	test: topology smp test with custom cluster Instead of decommission of initial cluster, use custom cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13589 (cherry picked from commit `ce87aedd30`)	2023-06-12 12:38:15 +02:00
Alejo Sanchez	b39cdadff3	test/pylib: topology: support clusters of initial size 0 To allow tests with custom clusters, allow configuration of initial cluster size of 0. Add a proof-of-concept test to be removed later. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13342 (cherry picked from commit `e3b462507d`)	2023-06-12 12:38:15 +02:00
Nadav Har'El	7b60cddae7	Merge 'test/pylib: split and refactor topology tests' from Alecco Move long running topology tests out of `test_topology.py` and into their own files, so they can be run in parallel. While there, merge simple schema tests. Closes #12804 * github.com:scylladb/scylladb: test/topology: rename topology test file test/topology: lint and type for topology tests test/topology: move topology ip tests to own file test/topology: move topology test remove garbaje... test/topology: move topology rejoin test to own file test/topology: merge topology schema tests and... test/topology: isolate topology smp params test test/topology: move topology helpers to common file (cherry picked from commit `a24600a662`)	2023-06-12 12:38:15 +02:00
Botond Dénes	ea80fe20ad	Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun Recently we enabled RBNO by default in all topology operations. This made the operations a bit slower (repair-based topology ops are a bit slower than classic streaming - they do more work), and in debug mode with large number of concurrent tests running, they might timeout. The timeout for bootstrap was already increased before, do the same for decommission/removenode. The previously used timeout was 300 seconds (this is the default used by aiohttp library when it makes HTTP requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which is 1000 seconds. Closes #12765 * github.com:scylladb/scylladb: test/pylib: use larger timeout for decommission/removenode test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT (cherry picked from commit `e55f475db1`)	2023-06-12 12:38:15 +02:00
Asias He	f90fe6f312	test: Increase START_TIMEOUT It is observed that CI machine is slow to run the test. Increase the timeout of adding servers. (cherry picked from commit `fc60484422`)	2023-06-12 12:38:15 +02:00
Alejo Sanchez	6e2c547388	test/pylib: one-shot error injection helper Existing helper with async context manager only worked for non one-shot error injections. Fix it and add another helper for one-shot without a context manager. Fix tests using the previous helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> (cherry picked from commit `9ceb6aba81`)	2023-06-12 12:38:05 +02:00
Kamil Braun	91aa2cd8d7	test: topology: wait for token ring/group 0 consistency after decommission There was a check for immediate consistency after a decommission operation has finished in one of the tests, but it turns out that also after decommission it might take some time for token ring to be updated on other nodes. Replace the check with a wait. Also do the wait in another test that performs a sequence of decommissions. We won't attempt to start another decommission until every node learns that the previously decommissioned node has left. Closes #12686 (cherry picked from commit `40142a51d0`)	2023-06-12 11:58:32 +02:00
Kamil Braun	05c3f7ecef	test: topology: verify that group 0 and token ring are consistent After topology changes like removing a node, verify that the set of group 0 members and token ring members is the same. Modify `get_token_ring_host_ids` to only return NORMAL members. The previous version which used the `/storage_service/host_id` endpoint might have returned non-NORMAL members as well. Fixes: #12153 Closes #12619 (cherry picked from commit `fa9cf81af2`)	2023-06-12 11:58:02 +02:00
Kamil Braun	3aa73e8b5a	Merge 'pytest: start after ungraceful stop' from Alecco If a server is stopped suddenly (i.e. not graceful), schema tables might be in inconsistent state. Add a test case and enable Scylla configuration option (force_schema_commit_log) to handle this. Fixes #12218 Closes #12630 * github.com:scylladb/scylladb: pytest: test start after ungraceful stop test.py: enable force_schema_commit_log (cherry picked from commit `5eadea301e`)	2023-06-12 11:57:09 +02:00
Nadav Har'El	a0ba3b3350	Merge 'test.py: improve test failure handling' from Kamil Braun Improve logging by printing the cluster at the end of each test. Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure. Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test. Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do. Closes #12652 * github.com:scylladb/scylladb: test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters test/topology: don't drop random_tables keyspace after a failed test test/pylib: mark cluster as dirty after a failed test test: pylib, topology: don't perform operations after test on a dirty cluster test/pylib: print cluster at the end of test (cherry picked from commit `2653865b34`)	2023-06-12 11:47:54 +02:00
Anna Mikhlin	ea08d409f1	release: prepare for 5.2.3	2023-06-08 22:04:50 +03:00
Avi Kivity	f32971b81f	Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an entire page worth of work. Also add a unit test checking that the mismatch is detected in the first place and that readers are closed. Fixes: #13784 Closes #13790 * github.com:scylladb/scylladb: test/boost/database_test: add unit test for semaphore mismatch on range scans partition_slice_builder: add set_specific_ranges() multishard_mutation_query: make reader_context::lookup_readers() exception safe multishard_mutation_query: lookup_readers(): make inner lambda a coroutine (cherry picked from commit `1c0e8c25ca`)	2023-06-08 04:29:51 -04:00
Michał Chojnowski	8872157422	data_dictionary: fix forgetting of UDTs on ALTER KEYSPACE Due to a simple programming oversight, one of keyspace_metadata constructors is using empty user_types_metadata instead of the passed one. Fix that. Fixes #14139 Closes #14143 (cherry picked from commit `1a521172ec`)	2023-06-06 21:52:47 +03:00
Kamil Braun	b5785ed434	auth: don't use infinite timeout in `default_role_row_satisfies` query A long long time ago there was an issue about removing infinite timeouts from distributed queries: #3603. There was also a fix: `620e950fc8`. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes #13545. Closes #14134 (cherry picked from commit `f51312e580`)	2023-06-06 19:39:29 +03:00
Pavel Emelyanov	70f93767fd	Update seastar submodule * seastar 98504c4b...8c86e6de (1): > rpc: Wait for server socket to stop before killing conns Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-30 20:10:44 +03:00
Tzach Livyatan	bb3751334c	Remove Ubuntu 18.04 support from 5.2 Ubuntu [18.04 will be soon out of standard support](https://ubuntu.com/blog/18-04-end-of-standard-support), and can be removed from 5.2 supported list https://github.com/scylladb/scylla-pkg/issues/3346 Closes #13529 (cherry picked from commit `e655060429`)	2023-05-30 16:25:42 +03:00
Beni Peled	9dd70a58c3	release: prepare for 5.2.2	2023-05-18 14:03:20 +03:00
Anna Stuchlik	0bc6694ac5	doc: fix the links to the Enterprise docs Fixes https://github.com/scylladb/scylladb/issues/13915 This commit fixes broken links to the Enterprise docs. They are links to the enterprise branch, which is not published. The links to the Enterprise docs should include "stable" instead of the branch name. This commit must be backported to branch-5.2, because the broken links are present in the published 5.2 docs. Closes #13917 (cherry picked from commit `6f4a68175b`)	2023-05-18 08:40:02 +03:00
Botond Dénes	486483b379	Merge '[Backport 5.2]: node ops backports' from Benny Halevy This branch backports to branch-5.2 several fixes related to node operations: - `ba919aa88a` (PR #12980; Fixes: #11011, #12969) - `53636167ca` (part of PR #12970; Fixes: #12764, #12956) - `5856e69462` (part of PR #12970) - `2b44631ded` (PR #13028; Fixes: #12989) - `6373452b31` (PR #12799; Fixes #12798) Closes #13531 * github.com:scylladb/scylladb: Merge 'Do not mask node operation errors' from Benny Halevy Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec storage_service: Wait for normal state handler to finish in replace storage_service: Wait for normal state handler to finish in bootstrap storage_service: Send heartbeat earlier for node ops	2023-05-17 16:46:49 +03:00
Tzach Livyatan	9afaec5b12	Update Azure recommended instances type from the Lsv2-series to the Lsv3-series Closes #13835 (cherry picked from commit `a73fde6888`)	2023-05-17 15:41:47 +03:00
Anna Stuchlik	9c99dc36b5	doc: add OS support for version 2023.1 Fixes https://github.com/scylladb/scylladb/issues/13857 This commit adds the OS support for ScyllaDB Enterprise 2023.1. The support is the same as for ScyllaDB Open Source 5.2, on which 2023.1 is based. After this commit is merged, it must be backported to branch-5.2. In this way, it will be merged to branch-2023.1 and available in the docs for Enterprise 2023.1 Closes: #13858 (cherry picked from commit `84ed95f86f`)	2023-05-16 10:11:21 +03:00
Tomasz Grabiec	548a7f73d3	Merge 'range_tombstone_change_generator: fix an edge case in flush()' from Michał Chojnowski range_tombstone_change_generator::flush() mishandles the case when two range tombstones are adjacent and flush(pos, end_of_range=true) is called with pos equal to the end bound of the lesser-position range tombstone. In such case, the start change of the greater-position rtc will be accidentally emitted, and there won't be an end change, which breaks reader assumptions by ending the stream with an unclosed range tombstone, triggering an assertion. This is due to a non-strict inequality used in a place where strict inequality should be used. The modified line was intended to close range tombstones which end exactly on the flush position, but this is unnecessary because such range tombstones are handled by the last `if` in the function anyway. Instead, this line caused range tombstones beginning right after the flush position to be emitted sometimes. Fixes https://github.com/scylladb/scylladb/issues/12462 Closes #13894 * github.com:scylladb/scylladb: tests: row_cache: Add reproducer for reader producing missing closing range tombstone range_tombstone_change_generator: fix an edge case in flush()	2023-05-15 23:29:08 +02:00
Raphael S. Carvalho	5c66875dbe	sstables: Fix use-after-move when making reader in reverse mode static report: sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed] legacy_reverse_slice_to_native_reverse_slice(schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor); Fixes #13394. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `213eaab246`)	2023-05-15 20:27:34 +03:00
Raphael S. Carvalho	26b4d2c3c1	db/view/build_progress_virtual_reader: Fix use-after-move use-after-free in ctor, which potentially leads to a failure when locating table from moved schema object. static report In file included from db/system_keyspace.cc:51: ./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed] _db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS), Fixes #13395. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `1ecba373d6`)	2023-05-15 20:26:01 +03:00
Raphael S. Carvalho	874062b72a	index/built_indexes_virtual_reader.hh: Fix use-after-move static report: ./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed] _db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS), Fixes #13396. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `f8df3c72d4`)	2023-05-15 20:24:35 +03:00
Raphael S. Carvalho	71ec750a59	replica: Fix use-after-move in table::make_streaming_reader Variant used by streaming/stream_transfer_task.cc: , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs)) as full slice is retrieved after schema is moved (clang evaluates left-to-right), the stream transfer task can be potentially working on a stale slice for a particular set of partitions. static report: In file included from replica/dirty_memory_manager.cc:6: replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed] return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice()); Fixes #13397. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `04932a66d3`)	2023-05-15 20:21:48 +03:00
Tomasz Grabiec	7c1bdc6553	tests: row_cache: Add reproducer for reader producing missing closing range tombstone Adds a reproducer for #12462. The bug manifests by reader throwing: std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}} The reason is that prior to the fix range_tombstone_change_generator::flush() was used with end_of_range=true to produce the closing range_tombstone_change and it did not handle correctly the case when there are two adjacent range tombstones and flush(pos, end_of_range=true) is called such that pos is the boundary between the two. Cherry-picked from `a717c803c7`.	2023-05-15 18:02:40 +02:00
Michał Chojnowski	24d966f806	range_tombstone_change_generator: fix an edge case in flush() range_tombstone_change_generator::flush() mishandles the case when two range tombstones are adjacent and flush(pos, end_of_range=true) is called with pos equal to the end bound of the lesser-position range tombstone. In such case, the start change of the greater-position rtc will be accidentally emitted, and there won't be an end change, which breaks reader assumptions by ending the stream with an unclosed range tombstone, triggering an assertion. This is due to a non-strict inequality used in a place where strict inequality should be used. The modified line was intended to close range tombstones which end exactly on the flush position, but this is unnecessary because such range tombstones are handled by the last `if` in the function anyway. Instead, this line caused range tombstones beginning right after the flush position to be emitted sometimes. Fixes #12462	2023-05-15 17:48:24 +02:00
Asias He	05a3a1bf55	tombstone_gc: Fix gc_before for immediate mode The immediate mode is similar to timeout mode with gc_grace_seconds zero. Thus, the gc_before returned should be the query_time instead of gc_clock::time_point::max in immediate mode. Setting gc_before to gc_clock::time_point::max, a row could be dropped by compaction even if the ttl is not expired yet. The following procedure reproduces the issue: - Start 2 nodes - Insert data ``` CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 }; CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'}; INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z') USING TTL 1000000; INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z') USING TTL 1000000; INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z') USING TTL 1000000; ``` - Run nodetool flush and nodetool compact - Compaction drops all data ``` ~128 total partitions merged to 0. ``` Fixes #13572 Closes #13800 (cherry picked from commit `7fcc403122`)	2023-05-15 10:33:29 +03:00
Takuya ASADA	f148a6be1d	scylla_kernel_check: suppress verbose iotune messages Stop printing verbose iotune messages while the check, just print error message. Fixes #13373. Closes #13362 (cherry picked from commit `160c184d0b`)	2023-05-14 21:25:57 +03:00
Benny Halevy	5785550e24	view: view_builder: start: demote sleep_aborted log error This is not really an error, so print it in debug log_level rather than error log_level. Fixes #13374 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13462 (cherry picked from commit `cc42f00232`)	2023-05-14 21:21:59 +03:00
Avi Kivity	401de17c82	Update seastar submodule (condition_variable tasktrace fix) * seastar aa46b980ec...98504c4bb6 (1): > condition-variable: replace the coroutine wakeup task with a promise Fixes #13368	2023-05-14 21:12:12 +03:00
Raphael S. Carvalho	94c9553e8a	Fix use-after-move when initializing row cache with dummy entry Courtersy of clang-tidy: row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move] _partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema}); ^ row_cache.cc:1191:60: note: move occurred here _partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema}); ^ row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated _partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema}); The use-after-move is UB, as for it to happen, depends on evaluation order. We haven't hit it yet as clang is left-to-right. Fixes #13400. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13401 (cherry picked from commit `d2d151ae5b`)	2023-05-14 21:02:24 +03:00
Anna Mikhlin	f1c45553bc	release: prepare for 5.2.1	2023-05-08 22:15:46 +03:00
Botond Dénes	1a288e0a78	Update seastar submodule * seastar 1488aaf8...aa46b980 (1): > core/on_internal_error: always log error with backtrace Fixes: #13786	2023-05-08 10:30:10 +03:00
Marcin Maliszkiewicz	a2fed1588e	db: view: use deferred_close for closing staging_sstable_reader When consume_in_thread throws the reader should still be closed. Related https://github.com/scylladb/scylla-enterprise/issues/2661 Closes #13398 Refs: scylladb/scylla-enterprise#2661 Fixes: #13413 (cherry picked from commit `99f8d7dcbe`)	2023-05-08 09:41:07 +03:00
Botond Dénes	f07a06d390	Merge 'service:forward_service: use long type instead of counter in function mocking' from Michał Jadwiszczak Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used. Fixes: #12939 Closes #12963 * github.com:scylladb/scylladb: test:boost: counter column parallelized aggregation test service:forward_service: use long type when column is counter (cherry picked from commit `61e67b865a`)	2023-05-07 14:27:29 +03:00
Anna Stuchlik	4ec531d807	doc: remove the sequential repair option from docs Fixes https://github.com/scylladb/scylladb/issues/12132 The sequential repair mode is not supported. This commit removes the incorrect information from the documentation. Closes #13544 (cherry picked from commit `3d25edf539`)	2023-05-07 14:27:29 +03:00
Asias He	4867683f80	storage_service: Fix removing replace node as pending Consider - n1, n2, n3 - n3 is down - n4 replaces n3 with the same ip address 127.0.0.3 - Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2 ``` auto host_id = _gossiper.get_host_id(endpoint); auto existing = tmptr->get_endpoint_for_host_id(host_id); ``` host_id = new host id existing = empty As a result, del_replacing_endpoint() will not be called. This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when replacing is done. This is wrong. This is a regression since commit `9942c60d93` (storage_service: do not inherit the host_id of a replaced a node), where replacing node uses a new host id than the node to be replaced. To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing is empty. Before: n1: storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3 token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3 storage_service - Node 127.0.0.3 state jump to normal storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3 After: n1: storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3 token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3 storage_service - Node 127.0.0.3 state jump to normal token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3 Tests: https://github.com/scylladb/scylla-dtest/pull/3126 Closes #13677 Fixes: https://github.com/scylladb/scylla-enterprise/issues/2852 (cherry picked from commit `a8040306bb`)	2023-05-03 14:15:13 +03:00
Botond Dénes	0e42defe06	readers: evictable_reader: skip progress guarantee when next pos is partition start The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the last buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward change has a bug: when the next expected position is a partition-start (another partition), the code would loop forever, effectively reading all there is from the underlying reader. To avoid this, add a special case to ignore the progress guarantee loop altogether when the next expected position is a partition start. In this case, progress is garanteed anyway, because there is exactly one partition-start fragment in each partition. Fixes: #13491 Closes #13563 (cherry picked from commit `72003dc35c`)	2023-05-02 21:58:41 +03:00
Avi Kivity	f73d017f05	tools: toolchain: regenerate Fixes #13744	2023-05-02 13:16:59 +03:00
Pavel Emelyanov	3723678b82	scylla-gdb: Parse and eval _all_threads without quotes I've no idea why the quotes are there at all, it works even without them. However, with quotes gdb-13 fails to find the _all_threads static thread-local variable _unless_ it's printed with gdb "p" command beforehand. fixes: #13125 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13132 (cherry picked from commit `537510f7d2`)	2023-05-02 13:16:59 +03:00
Botond Dénes	ea506f50cc	Merge 'Do not mask node operation errors' from Benny Halevy This series handles errors when aborting node operations and prints them rather letting them leak and be exposed to the user. Also, cleanup the node_ops logging formats when aborting different node ops and add more error logging around errors in the "worker" nodes. Closes #12799 * github.com:scylladb/scylladb: storage_service: node_ops_signal_abort: print a warning when signaling abort storage_service: s/node_ops_singal_abort/node_ops_signal_abort/ storage_service: node_ops_abort: add log messages storage_service: wire node_ops_ctl for node operations storage_service: add node_ops_ctl class to formalize all node_ops flow repair: node_ops_cmd_request: add print function repair: do_decommission_removenode_with_repair: log ignore_nodes repair: replace_with_repair: get ignore_nodes as unordered_set gossiper: get_generation_for_nodes: get nodes as unordered_set storage_service: don't let node_ops abort failures mask the real error (cherry picked from commit `6373452b31`)	2023-04-30 18:58:28 +03:00
Kamil Braun	42fd3704e4	Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec This patch fixes a problem which affects decommission and removenode which may lead to data consistency problems under conditions which lead one of the nodes to unliaterally decide to abort the node operation without the coordinator noticing. If this happens during streaming, the node operation coordinator would proceed to make a change in the gossiper, and only later dectect that one of the nodes aborted during sending of decommission_done or removenode_done command. That's too late, because the operation will be finalized by all the nodes once gossip propagates. It's unsafe to finalize the operation while another node aborted. The other node reverted to the old topolgy, with which they were running for some time, without considering the pending replica when handling requests. As a result, we may end up with consistency issues. Writes made by those coordinators may not be replicated to CL replicas in the new topology. Streaming may have missed to replicate those writes depending on timing. It's possible that some node aborts but streaming succeeds if the abort is not due to network problems, or if the network problems are transient and/or localized and affect only heartbeats. There is no way to revert after we commit the node operation to the gossiper, so it's ok to close node_ops sessions before making the change to the gossiper, and thus detect aborts and prevent later aborts after the change in the gossiper is made. This is already done during bootstrap (RBNO enabled) and replacenode. This patch canges removenode to also take this approach by moving sending of remove_done earlier. We cannot take this approach with decommission easily, because decommission_done command includes a wait for the node to leave the ring, which won't happen before the change to the gossiper is made. Separating this from decommission_done would require protocol changes. This patch adds a second-best solution, which is to check if sessions are still there right before making a change to the gossiper, leaving decommission_done where it was. The race can still happen, but the time window is now much smaller. The PR also lays down infrastructure which enables testing the scenarios. It makes node ops watchdog periods configurable, and adds error injections. Fixes #12989 Refs #12969 Closes #13028 * github.com:scylladb/scylladb: storage_service: node ops: Extract node_ops_insert() to reduce code duplication storage_service: Make node operations safer by detecting asymmetric abort storage_service: node ops: Add error injections service: node_ops: Make watchdog and heartbeat intervals configurable (cherry picked from commit `2b44631ded`)	2023-04-30 18:58:28 +03:00
Asias He	c9d19b3595	storage_service: Wait for normal state handler to finish in replace Similar to "storage_service: Wait for normal state handler to finish in bootstrap", this patch enables the check on the replace procedure. (cherry picked from commit `5856e69462`)	2023-04-30 18:58:28 +03:00
Asias He	9a873bf4b3	storage_service: Wait for normal state handler to finish in bootstrap In storage_service::handle_state_normal, storage_service::notify_joined will be called which drops the rpc connections to the node becomes normal. This causes rpc calls with that node fail with seastar::rpc::closed_error error. Consider this: - n1 in the cluster - n2 is added to join the cluster - n2 sees n1 is in normal status - n2 starts bootstrap process - notify_joined on n2 closes rpc connection to n1 in the middle of bootstrap - n2 fails to bootstrap For example, during bootstrap with RBNO, we saw repair failed in a test that sets ring_delay to zero and does not wait for gossip to settle. repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for keyspace=system_distributed_everywhere, status=failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed)}) This patch fixes the race by waiting for the handle_state_normal handler to finish before the bootstrap process. Fixes #12764 Fixes #12956 (cherry picked from commit `53636167ca`)	2023-04-30 18:58:28 +03:00
Asias He	51a00280a2	storage_service: Send heartbeat earlier for node ops Node ops has the following procedure: 1 for node in sync_nodes send prepare cmd to node 2 for node in sync_nodes send heartbeat cmd to node If any of the prepare cmd in step 1 takes longer than the heartbeat watchdog timeout, the heartbeat in step 2 will be too late to update the watchdog, as a result the watchdog will abort the operation. To prevent slow prepare cmd kills the node operations, we can start the heartbeat earlier in the procedure. Fixes #11011 Fixes #12969 Closes #12980 (cherry picked from commit `ba919aa88a`)	2023-04-30 18:58:28 +03:00
Wojciech Mitros	b0a7c02e09	rust: update dependencies Cranelift-codegen 0.92.0 and wasmtime 5.0.0 have security issues potentially allowing malicious UDFs to read some memory outside the wasm sandbox. This patch updates them to versions 0.92.1 and 5.0.1 respectively, where the issues are fixed. Fixes #13157 Closes #13171 (cherry picked from commit `aad2afd417`)	2023-04-27 22:01:44 +03:00
Wojciech Mitros	f18c49dcc6	rust: update dependencies Wasmtime added some improvements in recent releases - particularly, two security issues were patched in version 2.0.2. There were no breaking changes for our use other than the strategy of returning Traps - all of them are now anyhow::Errors instead, but we can still downcast to them, and read the corresponding error message. The cxx, anyhow and futures dependency versions now match the versions saved in the Cargo.lock. Closes #12830 (cherry picked from commit `8b756cb73f`) Ref #13157	2023-04-27 22:00:54 +03:00
Anna Stuchlik	35dfec78d1	doc: fixes https://github.com/scylladb/scylladb/issues/12964 , removes the information that the CDC options are experimental Closes #12973 (cherry picked from commit `4dd1659d0b`)	2023-04-27 21:06:49 +03:00
Raphael S. Carvalho	dbd8ca4ade	replica: Fix undefined behavior in table::generate_and_propagate_view_updates() Undefined behavior because the evaluation order is undefined. With GCC, where evaluation is right-to-left, schema will be moved once it's forwarded to make_flat_mutation_reader_from_mutations_v2(). The consequence is that memory tracking of mutation_fragment_v2 (for tracking only permit used by view update), which uses the schema, can be incorrect. However, it's more likely that Scylla will crash when estimating memory usage for row, which access schema column information using schema::column_at(), which in turn asserts that the requested column does really exist. Fixes #13093. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13092 (cherry picked from commit `3fae46203d`)	2023-04-27 19:56:38 +03:00
Anna Stuchlik	1be4afb842	doc: remove incorrect info about BYPASS CACHE Fixes https://github.com/scylladb/scylladb/issues/13106 This commit removes the information that BYPASS CACHE is an Enterprise-only feature and replaces that info with the link to the BYPASS CACHE description. Closes #13316 (cherry picked from commit `1cfea1f13c`)	2023-04-27 19:54:04 +03:00
Kefu Chai	7cc9f5a05f	dist/redhat: enforce dependency on %{release} also * tools/python3 279b6c1...cf7030a (1): > dist: redhat: provide only a single version s/%{version}/%{version}-%{release}/ in `Requires:` sections. this enforces the runtime dependencies of exactly the same releases between scylla packages. Fixes #13222 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `7165551fd7`)	2023-04-27 19:27:34 +03:00
Nadav Har'El	bf7fc9709d	test/rest_api: fix flaky test for toppartitions The REST test test_storage_service.py::test_toppartitions_pk_needs_escaping was flaky. It tests the toppartition request, which unfortunately needs to choose a sampling duration in advance, and we chose 1 second which we considered more than enough - and indeed typically even 1ms is enough! but very rarely (only know of only one occurance, in issue #13223) one second is not enough. Instead of increasing this 1 second and making this test even slower, this patch takes a retry approach: The tests starts with a 0.01 second duration, and is then retried with increasing durations until it succeeds or a 5-seconds duration is reached. This retry approach has two benefits: 1. It de-flakes the test (allowing a very slow test to take 5 seconds instead of 1 seconds which wasn't enough), and 2. At the same time it makes a successful test much faster (it used to always take a full second, now it takes 0.07 seconds on a dev build on my laptop). A failed test may, in some cases, take 10 seconds after this patch (although in some other cases, an error will be caught immediately), but I consider this acceptable - this test should pass, after all, and a failure indicates a regression and taking 10 seconds will be the last of our worries in that case. Fixes #13223. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13238 (cherry picked from commit `c550e681d7`)	2023-04-27 19:16:58 +03:00
Nadav Har'El	00a8c3a433	test/alternator: increase CQL connection timeout This patch increases the connection timeout in the get_cql_cluster() function in test/cql-pytest/run.py. This function is used to test that Scylla came up, and also test/alternator/run uses it to set up the authentication - which can only be done through CQL. The Python driver has 2-second and 5-second default timeouts that should have been more than enough for everybody (TM), but in #13239 we saw that in one case it apparently wasn't enough. So to be extra safe, let's increase the default connection-related timeouts to 60 seconds. Note this change only affects the Scylla boot in the test/*/run scripts, and it does not affect the actual tests - those have different code to connect to Scylla (see cql_session() in test/cql-pytest/util.py), and we already increased the timeouts there in #11289. Fixes #13239 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13291 (cherry picked from commit `4fdcee8415`)	2023-04-27 19:15:39 +03:00
Tomasz Grabiec	c08ed39a33	direct_failure_detector: Avoid throwing exceptions in the success path sleep_abortable() is aborted on success, which causes sleep_aborted exception to be thrown. This causes scylla to throw every 100ms for each pinged node. Throwing may reduce performance if happens often. Also, it spams the logs if --logger-log-level exception=trace is enabled. Avoid by swallowing the exception on cancellation. Fixes #13278. Closes #13279 (cherry picked from commit `99cb948eac`)	2023-04-27 19:14:31 +03:00
Kefu Chai	04424f8956	test: cql-pytest: test_describe: clamp bloom filter's fp rate before this change, we use `round(random.random(), 5)` for the value of `bloom_filter_fp_chance` config option. there are chances that this expression could return a number lower or equal to 6.71e-05. but we do have a minimal for this option, which is defined by `utils::bloom_calculations::probs`. and the minimal false positive rate is 6.71e-05. we are observing test failures where the we are using 0 for the option, and scylla right rejected it with the error message of ``` bloom_filter_fp_chance must be larger than 6.71e-05 and less than or equal to 1.0 (got 0) ```. so, in this change, to address the test failure, we always use a number slightly greater or equal to a number slightly greater to the minimum to ensure that the randomly picked number is in the range of supported false positive rate. Fixes #13313 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13314 (cherry picked from commit `33f4012eeb`)	2023-04-27 19:12:53 +03:00
Beni Peled	429b696bbc	release: prepare for 5.2.0	2023-04-27 16:26:43 +03:00
Beni Peled	a89867d8c2	release: prepare for 5.2.0-rc5	2023-04-25 14:37:54 +03:00
Benny Halevy	6ad94fedf3	utils: clear_gently: do not clear null unique_ptr Otherwise the null pointer is dereferenced. Add a unit test reproducing the issue and testing this fix. Fixes #13636 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `12877ad026`)	2023-04-24 17:51:01 +03:00
Anna Stuchlik	a6188d6abc	doc: document `tombstone_gc` as not experimental The tombstone_gc was documented as experimental in version 5.0. It is no longer experimental in version 5.2. This commit updates the information about the option. Closes #13469 (cherry picked from commit `a68b976c91`)	2023-04-24 11:54:06 +03:00
Botond Dénes	50095cc3a5	Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun in `make_group0_history_state_id_mutation`, when adding a new entry to the group 0 history table, if the parameter `gc_older_than` is engaged, we create a range tombstone in the mutation which deletes entries older than the new one by `gc_older_than`. In particular if `gc_older_than = 0`, we want to delete all older entries. There was a subtle bug there: we were using millisecond resolution when generating the tombstone, while the provided state IDs used microsecond resolution. On a super fast machine it could happen that we managed to perform two schema changes in a single millisecond; this happened sometimes in `group0_test.test_group0_history_clearing_old_entries` on our new CI/promotion machines, causing the test to fail because the tombstone didn't clear the entry correspodning to the previous schema change when performing the next schema change (since they happened in the same millisecond). Use microsecond resolution to fix that. The consecutive state IDs used in group 0 mutations are guaranteed to be strictly monotonic at microsecond resolution (see `generate_group0_state_id` in service/raft/raft_group0_client.cc). Fixes #13594 Closes #13604 * github.com:scylladb/scylladb: db: system_keyspace: use microsecond resolution for group0_history range tombstone utils: UUID_gen: accept decimicroseconds in min_time_UUID (cherry picked from commit `10c1f1dc80`)	2023-04-23 16:03:02 +03:00
Botond Dénes	7b2215d8e0	Merge 'Backport bugfixes regarding UDT, UDF, UDA interactions to branch-5.2' from Wojciech Mitros This patch backports https://github.com/scylladb/scylladb/pull/12710 to branch-5.2. To resolve the conflicts that it's causing, it also includes * https://github.com/scylladb/scylladb/pull/12680 * https://github.com/scylladb/scylladb/pull/12681 Closes #13542 * github.com:scylladb/scylladb: uda: change the UDF used in a UDA if it's replaced functions: add helper same_signature method uda: return aggregate functions as shared pointers udf: also check reducefunc to confirm that a UDF is not used in a UDA udf: fix dropping UDFs that share names with other UDFs used in UDAs pytest: add optional argument for new_function argument types udt: disallow dropping a user type used in a user function	2023-04-19 01:38:08 -04:00
Botond Dénes	da9f90362d	Merge 'Compaction reevaluation bug fixes' from Raphael "Raph" Carvalho A problem in compaction reevaluation can cause the SSTable set to be left uncompacted for indefinite amount of time, potentially causing space and read amplification to be suboptimal. Two revaluation problems are being fixed, one after off-strategy compaction ended, and another in compaction manager which intends to periodically reevaluate a need for compaction. Fixes https://github.com/scylladb/scylladb/issues/13429. Fixes https://github.com/scylladb/scylladb/issues/13430. Closes #13431 * github.com:scylladb/scylladb: compaction: Make compaction reevaluation actually periodic replica: Reevaluate regular compaction on off-strategy completion (cherry picked from commit `9a02315c6b`)	2023-04-19 01:14:33 -04:00
Botond Dénes	c9a17c80f6	mutation/mutation_compactor: consume_partition_end(): reset _stop The purpose of `_stop` is to remember whether the consumption of the last partition was interrupted or it was consumed fully. In the former case, the compactor allows retreiving the compaction state for the given partition, so that its compaction can be resumed at a later point in time. Currently, `_stop` is set to `stop_iteration::yes` whenever the return value of any of the `consume()` methods is also `stop_iteration::yes`. Meaning, if the consuming of the partition is interrupted, this is remembered in `_stop`. However, a partition whose consumption was interrupted is not always continued later. Sometimes consumption of a partitions is interrputed because the partition is not interesting and the downstream consumer wants to stop it. In these cases the compactor should not return an engagned optional from `detach_state()`, because there is not state to detach, the state should be thrown away. This was incorrectly handled so far and is fixed in this patch, but overwriting `_stop` in `consume_partition_end()` with whatever the downstream consumer returns. Meaning if they want to skip the partition, then `_stop` is reset to `stop_partition::no` and `detach_state()` will return a disengaged optional as it should in this case. Fixes: #12629 Closes #13365 (cherry picked from commit `bae62f899d`)	2023-04-18 02:32:24 -04:00
Wojciech Mitros	7242c42089	uda: change the UDF used in a UDA if it's replaced Currently, if a UDA uses a UDF that's being replaced, the UDA will still keep using the old UDF until the node is restarted. This patch fixes this behavior by checking all UDAs when replacing a UDF and updating them if necessary. Fixes #12709 (cherry picked from commit `02bfac0c66`)	2023-04-17 13:14:46 +02:00
Wojciech Mitros	70ff69afab	functions: add helper same_signature method When deciding whether two functions have the same signature, we have to check if they have the same name and parameter types. Additionally, if they're represented by pointers, we need to check if any of them is a nullptr. This logic is used multiple times, so it's extracted to a separate function. To use this function, the `used_by_user_aggregate` method takes now a function instead of name and types list - we can do it because we always use it with an existing user function (that we're trying to drop). The method will also be useful when we'll be not dropping, but replacing a user function. (cherry picked from commit `58987215dc`)	2023-04-17 13:14:40 +02:00
Wojciech Mitros	5fd4bb853b	uda: return aggregate functions as shared pointers We will want to reuse the functions that we get from an aggregate without making a deep copy, and it's only possible if we get pointers from the aggregate instead of actual values. (cherry picked from commit `20069372e7`)	2023-04-17 13:14:24 +02:00
Wojciech Mitros	313649e86d	udf: also check reducefunc to confirm that a UDF is not used in a UDA When dropping a UDF we're checking if it's not begin used in any UDAs and fail otherwise. However, we're only checking its state function and final function, and it may also be used as its reduce function. This patch adds the missing checks and a test for them. (cherry picked from commit `ef1dac813b`)	2023-04-17 13:14:16 +02:00
Wojciech Mitros	14d8cec130	udf: fix dropping UDFs that share names with other UDFs used in UDAs Currently, when dropping a function, we only check if there exist an aggregate that uses a function with the same name as its state function or final function. This may cause the drop to fail even when it's just another UDF with the same name that's used in the aggregate, even when the actual dropped function is not used there. This patch fixes this by checking whether not only the name of the UDA's sfunc and finalfunc, but also their argument types. (cherry picked from commit `49077dd144`)	2023-04-17 13:14:09 +02:00
Wojciech Mitros	203cbb79a1	pytest: add optional argument for new_function argument types When multiple functions with the same name but different argument types are created, the default drop statement for these functions will fail because it does not include the argument types. With this change, this problem can be worked around by specifying argument types when creating the function, as this will cause the drop statement to include them. (cherry picked from commit `8791b0faf5`)	2023-04-17 13:13:59 +02:00
Wojciech Mitros	51f19d1b8c	udt: disallow dropping a user type used in a user function Currently, nothing prevents us from dropping a user type used in a user function, even though doing so may make us unable to use the function correctly. This patch prevents this behavior by checking all function argument and return types when executing a drop type statement and preventing it from completing if the type is referenced by any of them. (cherry picked from commit `86c61828e6`)	2023-04-17 13:13:35 +02:00
Anna Stuchlik	83735ae77f	doc: update the metrics between 5.2 and 2023.1 Related: https://github.com/scylladb/scylla-enterprise/issues/2794 This commit adds the information about the metric changes in version 2023.1 compared to version 5.2. This commit is part of the 5.2-to-2023.1 upgrade guide and must be backported to branch-5.2. Closes #13506 (cherry picked from commit `989a75b2f7`)	2023-04-17 11:29:43 +02:00
Avi Kivity	9d384e3af2	Merge 'Backport "reader_concurrency_semaphore: don't evict inactive readers needlessly" to branch-5.2' from Botond Dénes The patch doesn't apply cleanly, so a targeted backport PR was necessary. I also needed to cherry-pick two patches from https://github.com/scylladb/scylladb/pull/13255 that the backported patch depends on. Decided against backporting the entire https://github.com/scylladb/scylladb/pull/13255 as it is quite an intrusive change. Fixes: https://github.com/scylladb/scylladb/issues/11803 Closes #13515 * github.com:scylladb/scylladb: reader_concurrency_semaphore: don't evict inactive readers needlessly reader_concurrency_semaphore: add stats to record reason for queueing permits reader_concurrency_semaphore: can_admit_read(): also return reason for rejection	2023-04-17 12:25:21 +03:00
Nadav Har'El	0da0c94f49	cql: USING TTL 0 means unlimited, not default TTL Our documentation states that writing an item with "USING TTL 0" means it should never expire. This should be true even if the table has a default TTL. But Scylla mistakenly handled "USING TTL 0" exactly like having no USING TTL at all (i.e., it took the default TTL, instead of unlimited). We had two xfailing tests demonstrating that Scylla's behavior in this is different from Cassandra. Scylla's behavior in this case was also undocumented. By the way, Cassandra used to have the same bug (CASSANDRA-11207) but it was fixed already in 2016 (Cassandra 3.6). So in this patch we fix Scylla's "USING TTL 0" behavior to match the documentation and Cassandra's behavior since 2016. One xfailing test starts to pass and the second test passes this bug and fails on a different one. This patch also adds a third test for "USING TTL ?" with UNSET_VALUE - it behaves, on both Scylla and Cassandra, like a missing "USING TTL". The origin of this bug was that after parsing the statement, we saved the USING TTL in an integer, and used 0 for the case of no USING TTL given. This meant that we couldn't tell if we have USING TTL 0 or no USING TTL at all. This patch uses an std::optional so we can tell the case of a missing USING TTL from the case of USING TTL 0. Fixes #6447 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13079 (cherry picked from commit `a4a318f394`)	2023-04-17 10:41:08 +03:00
Nadav Har'El	1a9f51b767	cql: fix empty aggregation, and add more tests This patch fixes #12475, where an aggregation (e.g., COUNT(*), MIN(v)) of absolutely no partitions (e.g., "WHERE p = null" or "WHERE p in ()") resulted in an internal error instead of the "zero" result that each aggregator expects (e.g., 0 for COUNT, null for MIN). The problem is that normally our aggregator forwarder picks the nodes which hold the relevant partition(s), forwards the request to each of them, and then combines these results. When there are no partitions, the query is sent to no node, and we end up with an empty result set instead of the "zero" results. So in this patch we recognize this case and build those "zero" results (as mentioned above, these aren't always 0 and depend on the aggregation function!). The patch also adds two tests reproducing this issue in a fairly general way (e.g., several aggregators, different aggregation functions) and confirming the patch fixes the bug. The test also includes two additional tests for COUNT aggregation, which uncovered an incompatibility with Cassandra which is still not fixed - so these tests are marked "xfail": Refs #12477: Combining COUNT with GROUP by results with empty results in Cassandra, and one result with empty count in Scylla. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12715 (cherry picked from commit `3ba011c2be`)	2023-04-17 10:41:08 +03:00
Raphael S. Carvalho	dba0e604a7	table: Fix disk-space related metrics total disk space used metric is incorrectly telling the amount of disk space ever used, which is wrong. It should tell the size of all sstables being used + the ones waiting to be deleted. live disk space used, by this defition, shouldn't account the ones waiting to be deleted. and live sstable count, shouldn't account sstables waiting to be deleted. Fix all that. Fixes #12717. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `529a1239a9`)	2023-04-16 22:14:01 +03:00
Michał Chojnowski	4ea67940cb	locator: token_metadata: get rid of a quadratic behaviour in get_address_ranges() Some callees of update_pending_ranges use the variant of get_address_ranges() which builds a hashmap of all <endpoint, owned range> pairs. For everywhere_topology, the size of this map is quadratic in the number of endpoints, making it big enough to cause contiguous allocations of tens of MiB for clusters of realistic size, potentially causing trouble for the allocator (as seen e.g. in #12724). This deserves a correction. This patch removes the quadratic variant of get_address_ranges() and replaces its uses with its linear counterpart. Refs #10337 Refs #10817 Refs #10836 Refs #10837 Fixes #12724 (cherry picked from commit `9e57b21e0c`)	2023-04-16 21:59:14 +03:00
Jan Ciolek	a8c49c44e5	cql/query_options: add a check for missing bind marker name There was a missing check in validation of named bind markers. Let's say that a user prepares a query like: ```cql INSERT INTO ks.tab (pk, ck, v) VALUES (:pk, :ck, :v) ``` Then they execute the query, but specify only values for `:pk` and `:ck`. We should detect that a value for :v is missing and throw an invalid_request_exception. Until now there was no such check, in case of a missing variable invalid `query_options` were created and Scylla crashed. Sadly it's impossible to create a regression test using `cql-pytest` or `boost`. `cql-pytest` uses the python driver, which silently ignores mising named bind variables, deciding that the user meant to send an UNSET_VALUE for them. When given values like `{'pk': 1, 'ck': 2}`, it will automaticaly extend them to `{'pk': 1, 'ck': 2, 'v': UNSET_VALUE}`. In `boost` I tried to use `cql_test_env`, but it only has methods which take valid `query_options` as a parameter. I could create a separate unit tests for the creation and validation of `query_options` but it won't be a true end-to-end test like `cql-pytest`. The bug was found using the rust driver, the reproducer is available in the issue description. Fixes: #12727 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #12730 (cherry picked from commit `2a5ed115ca`)	2023-04-16 21:57:28 +03:00
Nadav Har'El	12a29edf90	test/alternator: fix flaky test for partition-tombstone scan The test test_scan.py::test_scan_long_partition_tombstone_string checks that a full-table Scan operation ends a page in the middle of a very long string of partition tombstones, and does NOT scan the entire table in one page (if we did that, getting a single page could take an unbounded amount of time). The test is currently flaky, having failed in CI runs three times in the past two months. The reason for the flakiness is that we don't know exactly how long we need to make the sequence of partition tombstones in the test before we can be absolutely sure a single page will not read this entire sequence. For single-partition scans we have the "query_tombstone_page_limit" configuration parameter, which tells us exactly how long we need to make the sequence of row tombstones. But for a full-table scan of partition tombstones, the situation is more complicated - because the scan is done in parallel on several vnodes in parallel and each of them needs to read query_tombstone_page_limit before it stops. In my experiments, using query_tombstone_limit * 4 consecutive tombstones was always enough - I ran this test hundreds of times and it didn't fail once. But since it did fail on Jenkins very rarely (3 times in the last two months), maybe the multiplier 4 isn't enough. So this patch doubles it to 8. Hopefully this would be enough for anyone (TM). This makes this test even bigger and slower than it was. To make it faster, I changed this test's write isolation mode from the default always_use_lwt to forbid_rmw (not use LWT). This leaves the test's total run time to be similar to what it was before this patch - around 0.5 seconds in dev build mode on my laptop. Fixes #12817 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12819 (cherry picked from commit `14cdd034ee`)	2023-04-14 11:54:45 +03:00
Botond Dénes	3e10c3fc89	reader_concurrency_semaphore: don't evict inactive readers needlessly Inactive readers should only be evicted to free up resources for waiting readers. Evicting them when waiters are not admitted for any other reason than resources is wasteful and leads to extra load later on when these evicted readers have to be recreated end requeued. This patch changes the logic on both the registering path and the admission path to not evict inactive readers unless there are readers actually waiting on resources. A unit-test is also added, reproducing the overly-agressive eviction and checking that it doesn't happen anymore. Fixes: #11803 Closes #13286 (cherry picked from commit `bd57471e54`)	2023-04-14 10:37:30 +03:00
Botond Dénes	f11deb5074	reader_concurrency_semaphore: add stats to record reason for queueing permits When diagnosing problems, knowing why permits were queued is very valuable. Record the reason in a new stats, one for each reason a permit can be queued. (cherry picked from commit `7b701ac52e`)	2023-04-14 10:37:30 +03:00
Botond Dénes	1baf9dddd7	reader_concurrency_semaphore: can_admit_read(): also return reason for rejection So caller can bump the appropriate counters or log the reason why the the request cannot be admitted. (cherry picked from commit `bb00405818`)	2023-04-14 09:30:02 +03:00
Kamil Braun	9717ff5057	docs: cleaning up after failed membership change After a failed topology operation, like bootstrap / decommission / removenode, the cluster might contain a garbage entry in either token ring or group 0. This entry can be cleaned-up by executing removenode on any other node, pointing to the node that failed to bootstrap or leave the cluster. Document this procedure, including a method of finding the host ID of a garbage entry. Add references in other documents. Fixes: #13122 Closes #13186 (cherry picked from commit `c2a2996c2b`)	2023-04-13 10:35:02 +02:00
Anna Stuchlik	b293b1446f	doc: remove Enterprise upgrade guides from OSS doc This commit removes the Enterprise upgrade guides from the Open Source documentation. The Enterprise upgrade guides should only be available in the Enterprise documentation, with the source files stored in scylla-enterprise.git. In addition, this commit: - adds the links to the Enterprise user guides in the Enterprise documentation at https://enterprise.docs.scylladb.com/ - adds the redirections for the removed pages to avoid breaking any links. This commit must be reverted in scylla-enterprise.git. (cherry picked from commit `61bc05ae49`) Closes #13473	2023-04-11 14:26:35 +03:00
Yaron Kaikov	e6f7ac17f6	doc: update supported os for 2022.1 ubuntu22.04 is already supported on both `5.0` and `2022.1` updating the table Closes #13340 (cherry picked from commit `c80ab78741`)	2023-04-05 13:56:07 +03:00
Anna Stuchlik	36619fc7d9	doc: add upgrade guide from 5.2 to 2023.1 Related: https://github.com/scylladb/scylla-enterprise/issues/2770 This commit adds the upgrade guide from ScyllaDB Open Source 5.2 to ScyllaDB Enterprise 2023.1. This commit does not cover metric updates (the metrics file has no content, which needs to be added in another PR). As this is an upgrade guide, this commit must be merged to master and backported to branch-5.2 and branch-2023.1 in scylla-enterprise.git. Closes #13294 (cherry picked from commit `595325c11b`)	2023-04-05 06:43:01 +03:00
Anna Stuchlik	750414c196	doc: update Raft doc for versions 5.2 and 2023.1 Fixes https://github.com/scylladb/scylladb/issues/13345 Fixes https://github.com/scylladb/scylladb/issues/13421 This commit updates the Raft documentation page to be up to date in versions 5.2 and 2023.1. - Irrelevant information about previous releases is removed. - Some information is clarified. - Mentions of version 5.2 are either removed (if possible) or version 2023.1 is added. Closes #13426 (cherry picked from commit `447ce58da5`)	2023-04-05 06:42:28 +03:00
Botond Dénes	128050e984	Merge 'commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off' from Calle Wilund Fixes #12810 We did not update total_size_on_disk in commitlog totals when use o_dsync was off. This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments. Closes #12950 * github.com:scylladb/scylladb: commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off commitlog: change type of stored size (cherry picked from commit `e70be47276`)	2023-04-03 08:57:43 +03:00
Yaron Kaikov	d70751fee3	release: prepare for 5.2.0-rc4	2023-04-02 16:40:56 +03:00
Tzach Livyatan	1fba43c317	docs: minor improvments to the Raft Handling Failures and recovery procedure sections Closes #13292 (cherry picked from commit `46e6c639d9`)	2023-03-31 11:22:20 +02:00
Botond Dénes	e380c24c69	Merge 'Improve database shutdown verbosity' from Pavel Emelyanov The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler. refs: #13100 refs: #10941 Closes #13141 * github.com:scylladb/scylladb: database: Increase verbosity of database::stop() method large_data_handler: Increase verbosity on shutdown large_data_handler: Coroutinize .stop() method (cherry picked from commit `e22b27a107`)	2023-03-30 17:01:24 +03:00
Avi Kivity	76a76a95f4	Update tools/java submodule (hdrhistogram with Java 11) * tools/java 1c4e1e7a7d...83b2168b19 (1): > Fix cassandra-stress -log hdrfile=... with java 11 Fixes #13287	2023-03-29 14:10:27 +03:00
Anna Stuchlik	f6837afec7	doc: update the Ubuntu version used in the image Starting from 5.2 and 2023.1 our images are based on Ubuntu:22.04. See https://github.com/scylladb/scylladb/issues/13138#issuecomment-1467737084 This commit adds that information to the docs. It should be merged and backported to branch-5.2. Closes #13301 (cherry picked from commit `9e27f6b4b7`)	2023-03-27 14:08:57 +03:00
Botond Dénes	6350c8836d	Revert "repair: Reduce repair reader eviction with diff shard count" This reverts commit `c6087cf3a0`. Said commit can cause a deadlock when 2 or more repairs compete for locks on 2 or more nodes. Consider the following scenario: Node n1 and n2 in the cluster, 1 shard per node, rf = 2, each shard has 1 available unit for the reader lock n1 starts repair r1 r1-n1 (instance of r1 on node1) takes the reader lock on node1 n2 starts repair r2 r2-n2 (instance of r2 on node2) takes the reader lock on node2 r1-n2 will fail to take the reader lock on node2 r2-n1 will fail to take the reader lock on node1 As a result, r1 and r2 could not make progress and deadlock happens. The complexity comes from the fact that a repair job needs lock on more than one node. It is not guaranteed that all the participant nodes could take the lock in one short. There is no simple solution to this so we have to revert this locking mechanism and look for another way to prevent reader trashing when repairing nodes with mismatching shard count. Fixes: #12693 Closes #13266 (cherry picked from commit `7699904c54`)	2023-03-24 09:44:16 +02:00
Avi Kivity	5457948437	Update seastar submodule (rpc cancellation during negotiation) * seastar 8889cbc198...1488aaf842 (1): > Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov Fixes #11507.	2023-03-23 17:15:00 +02:00
Avi Kivity	da41001b5c	.gitmodules: point seastar submodule at scylla-seastar.git This allows is to backport seastar commits. Ref #11507.	2023-03-23 17:11:43 +02:00
Anna Stuchlik	dd61e8634c	doc: related https://github.com/scylladb/scylladb/issues/12754 ; add the missing information about reporting latencies to the upgrade guide 5.1 to 5.2 Closes #12935 (cherry picked from commit `26bb36cdf5`)	2023-03-22 10:38:28 +02:00
Anna Stuchlik	b642b4c30e	doc: fix the service name in upgrade guides Fixes https://github.com/scylladb/scylladb/issues/13207 This commit fixes the service and package names in the upgrade guides 5.0-to-2022.1 and 5.1-to-2022.2. Service name: scylla-server Package name: scylla-enterprise Previous PRs to fix the same issue in other upgrade guides: https://github.com/scylladb/scylladb/pull/12679 https://github.com/scylladb/scylladb/pull/12698 This commit must be backported to branch-5.1 and branch 5.2. Closes #13225 (cherry picked from commit `922f6ba3dd`)	2023-03-22 10:37:12 +02:00
Botond Dénes	c013336121	db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts We currently don't clean up the system_distributed.view_build_status table after removed nodes. This can cause false-positive check for whether view update generation is needed for streaming. The proper fix is to clean up this table, but that will be more involved, it even when done, it might not be immediate. So until then and to be on the safe side, filter out entries belonging to unknown hosts from said table. Fixes: #11905 Refs: #11836 Closes #11860 (cherry picked from commit `84a69b6adb`)	2023-03-22 09:03:50 +02:00
Kamil Braun	b6b35ce061	service: storage_proxy: sequence CDC preimage select with Paxos learn `paxos_response_handler::learn_decision` was calling `cdc_service::augment_mutation_call` concurrently with `storage_proxy::mutate_internal`. `augment_mutation_call` was selecting rows from the base table in order to create the preimage, while `mutate_internal` was writing rows to the table. It was therefore possible for the preimage to observe the update that it accompanied, which doesn't make any sense, because the preimage is supposed to show the state before the update. Fix this by performing the operations sequentially. We can still perform the CDC mutation write concurrently with the base mutation write. `cdc_with_lwt_test` was sometimes failing in debug mode due to this bug and was marked flaky. Unmark it. Fixes #12098 (cherry picked from commit `1ef113691a`)	2023-03-21 20:23:19 +02:00
Petr Gusev	069e38f02d	transport server: fix unexpected server errors handling If request processing ended with an error, it is worth sending the error to the client through make_error/write_response. Previously in this case we just wrote a message to the log and didn't handle the client connection in any way. As a result, the only thing the client got in this case was timeout error. A new test_batch_with_error is added. It is quite difficult to reproduce error condition in a test, so we use error injection instead. Passing injection_key in the body of the request ensures that the exception will be thrown only for this test request and will not affect other requests that the driver may send in the background. Closes: scylladb#12104 (cherry picked from commit `a4cf509c3d`)	2023-03-21 20:23:09 +02:00
Anna Mikhlin	61a8003ad1	release: prepare for 5.2.0-rc3	2023-03-20 10:10:27 +02:00
Botond Dénes	8a17066961	Merge 'doc: Updates the recommended OS to be Ubuntu 22.04' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/13138 Fixes https://github.com/scylladb/scylladb/issues/13153 This PR: - Fixes outdated information about the recommended OS. Since version 5.2, the recommended OS should be Ubuntu 22.04 because that OS is used for building the ScyllaDB image. - Adds the OS support information for version 5.2. This PR (both commits) needs to be backported to branch-5.2. Closes #13188 * github.com:scylladb/scylladb: doc: Add OS support for version 5.2 doc: Updates the recommended OS to be Ubuntu 22.04 (cherry picked from commit `f4b5679804`)	2023-03-17 10:30:06 +02:00
Pavel Emelyanov	487ba9f3e1	Merge '[backport] reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()' from Botond Dénes This PR backports `2f4a793457` to branch-5.2. Said patch depends on some other patches that are not part of any release yet. This PR should apply to 5.1 and 5.0 too. Closes #13162 * github.com:scylladb/scylladb: reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict() reader_permit: expose operator<<(reader_permit::state) reader_permit: add get_state() accessor	2023-03-16 18:41:08 +03:00
Botond Dénes	bd4f9e3615	Merge 'readers/nonforwarding: don't emit partition_end on next_partition,fast_forward_to' from Gusev Petr The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()` Fixes: #12249 Closes #12978 * github.com:scylladb/scylladb: flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE make_nonforwardable: test through run_mutation_source_tests make_nonforwardable: next_partition and fast_forward_to when single_partition is true make_forwardable: fix next_partition flat_mutation_reader_v2: drop forward_buffer_to nonforwardable reader: fix indentation nonforwardable reader: refactor, extract reset_partition nonforwardable reader: add more tests nonforwardable reader: no partition_end after fast_forward_to() nonforwardable reader: no partition_end after next_partition() nonforwardable reader: no partition_end for empty reader row_cache: pass partition_start though nonforwardable reader (cherry picked from commit `46efdfa1a1`)	2023-03-16 10:42:03 +02:00
Botond Dénes	c68deb2461	reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict() Instead of open-coding the same, in an incomplete way. clear_inactive_reads() does incomplete eviction in severeal ways: * it doesn't decrement _stats.inactive_reads * it doesn't set the permit to evicted state * it doesn't cancel the ttl timer (if any) * it doesn't call the eviction notifier on the permit (if there is one) The list goes on. We already have an evict() method that all this correctly, use that instead of the current badly open-coded alternative. This patch also enhances the existing test for clear_inactive_reads() and adds a new one specifically for `stop()` being called while having inactive reads. Fixes: #13048 Closes #13049 (cherry picked from commit `2f4a793457`)	2023-03-14 09:50:16 +02:00
Botond Dénes	dd96d3017a	reader_permit: expose operator<<(reader_permit::state) (cherry picked from commit `ec1c615029`)	2023-03-14 09:50:16 +02:00
Botond Dénes	6ca80ee118	reader_permit: add get_state() accessor (cherry picked from commit `397266f420`)	2023-03-14 09:40:11 +02:00
Jan Ciolek	eee8f750cc	cql3: preserve binary_operator.order in search_and_replace There was a bug in `expr::search_and_replace`. It doesn't preserve the `order` field of binary_operator. `order` field is used to mark relations created using the SCYLLA_CLUSTERING_BOUND. It is a CQL feature used for internal queries inside Scylla. It means that we should handle the restriction as a raw clustering bound, not as an expression in the CQL language. Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues, the database could end up selecting the wrong clustering ranges. Fixes: #13055 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #13056 (cherry picked from commit `aa604bd935`)	2023-03-09 12:52:39 +02:00
Botond Dénes	8d5206e6c6	sstables/sstable: validate_checksums(): force-check EOF EOF is only guarateed to be set if one tried to read past the end of the file. So when checking for EOF, also try to read some more. This should force the EOF flag into a correct value. We can then check that the read yielded 0 bytes. This should ensure that `validate_checksums()` will not falsely declare the validation to have failed. Fixes: #11190 Closes #12696 (cherry picked from commit `693c22595a`)	2023-03-09 12:30:44 +02:00
Anna Stuchlik	cfa40402f4	doc: Update the documentation landing page This commit makes the following changes to the docs landing page: - Adds the ScyllaDB enterprise docs as one of three tiles. - Modifies the three tiles to reflect the three flavors of ScyllaDB. - Moves the "New to ScyllaDB? Start here!" under the page title. - Renames "Our Products" to "Other Products" to list the products other than ScyllaDB itself. In addtition, the boxes are enlarged from to large-4 to look better. The major purpose of this commit is to expose the ScyllaDB documentation. docs: fix the link (cherry picked from commit `27bb8c2302`) Closes #13086	2023-03-06 14:18:15 +02:00
Botond Dénes	2d170e51cf	Merge 'doc: specify the versions where Alternator TTL is no longer experimental' from Anna Stuchlik This PR adds a note to the Alternator TTL section to specify in which Open Source and Enterprise versions the feature was promoted from experimental to non-experimental. The challenge here is that OSS and Enterprise are (still) documented together, but they're not in sync in promoting the TTL feature: it's still experimental in 5.1 (released) but no longer experimental in 2022.2 (to be released soon). We can take one of the following approaches: a) Merge this PR with master and ask the 2022.2 users to refer to master. b) Merge this PR with master and then backport to branch-5.1. If we choose this approach, it is necessary to backport https://github.com/scylladb/scylladb/pull/11997 beforehand to avoid conflicts. I'd opt for a) because it makes more sense from the OSS perspective and helps us avoid mess and backporting. Closes #12295 * github.com:scylladb/scylladb: doc: fix the version in the comment on removing the note doc: specify the versions where Alternator TTL is no longer experimental (cherry picked from commit `d5dee43be7`)	2023-03-02 12:09:16 +02:00
Anna Stuchlik	860e79e4b1	doc: fixes https://github.com/scylladb/scylladb/issues/12954 , adds the minimal version from which the 2021.1-to-2022.1 upgrade is supported for Ubuntu, Debian, and image Closes #12974 (cherry picked from commit `91b611209f`)	2023-02-28 13:02:05 +02:00
Anna Mikhlin	908a82bea0	release: prepare for 5.2.0-rc2	2023-02-28 10:13:06 +02:00
Gleb Natapov	39158f55d0	lwt: do not destroy capture in upgrade_if_needed lambda since the lambda is used more then once If on the first call the capture is destroyed the second call may crash. Fixes: #12958 Message-Id: <Y/sks73Sb35F+PsC@scylladb.com> (cherry picked from commit `1ce7ad1ee6`)	2023-02-27 14:19:37 +02:00
Raphael S. Carvalho	22c1685b3d	sstables: Temporarily disable loading of first and last position metadata It's known that reading large cells in reverse cause large allocations. Source: https://github.com/scylladb/scylladb/issues/11642 The loading is preliminary work for splitting large partitions into fragments composing a run and then be able to later read such a run in an efficiency way using the position metadata. The splitting is not turned on yet, anywhere. Therefore, we can temporarily disable the loading, as a way to avoid regressions in stable versions. Large allocations can cause stalls due to foreground memory eviction kicking in. The default values for position metadata say that first and last position include all clustering rows, but they aren't used anywhere other than by sstable_run to determine if a run is disjoint at clustering level, but given that no splitting is done yet, it does not really matter. Unit tests relying on position metadata were adjusted to enable the loading, such that they can still pass. Fixes #11642. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12979 (cherry picked from commit `d73ffe7220`)	2023-02-27 08:58:34 +02:00
Botond Dénes	9ba6fc73f1	mutation_compactor: only pass consumed range-tombstone-change to validator Currently all consumed range tombstone changes are unconditionally forwarded to the validator. Even if they are shadowed by a higher level tombstone and/or purgable. This can result in a situation where a range tombstone change was seen by the validator but not passed to the consumer. The validator expects the range tombstone change to be closed by end-of-partition but the end fragment won't come as the tombstone was dropped, resulting in a false-positive validation failure. Fix by only passing tombstones to the validator, that are actually passed to the consumer too. Fixes: #12575 Closes #12578 (cherry picked from commit `e2c9cdb576`)	2023-02-23 22:52:47 +02:00
Botond Dénes	f2e2c0127a	types: unserialize_value for multiprecision_int,bool: don't read uninitialized memory Check the first fragment before dereferencing it, the fragment might be empty, in which case move to the next one. Found by running range scan tests with random schema and random data. Fixes: #12821 Fixes: #12823 Fixes: #12708 Closes #12824 (cherry picked from commit `ef548e654d`)	2023-02-23 22:38:03 +02:00
Gleb Natapov	363ea87f51	raft: abort applier fiber when a state machine aborts After `5badf20c7a` applier fiber does not stop after it gets abort error from a state machine which may trigger an assertion because previous batch is not applied. Fix it. Fixes #12863 (cherry picked from commit `9bdef9158e`)	2023-02-23 14:12:12 +02:00
Kefu Chai	c49fd6f176	tools/schema_loader: do not return ref to a local variable we should never return a reference to local variable. so in this change, a reference to a static variable is returned instead. this should address following warning from Clang 17: ``` /home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address] return {}; ^~ ``` Fixes #12875 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12876 (cherry picked from commit `6eab8720c4`)	2023-02-22 22:02:43 +02:00
Takuya ASADA	3114589a30	scylla_coredump_setup: fix coredump timeout settings We currently configure only TimeoutStartSec, but probably it's not enough to prevent coredump timeout, since TimeoutStartSec is maximum waiting time for service startup, and there is another directive to specify maximum service running time (RuntimeMaxSec). To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it configures both TimeoutStartSec and TimeoutStopSec). Fixes #5430 Closes #12757 (cherry picked from commit `bf27fdeaa2`)	2023-02-19 21:13:36 +02:00
Anna Stuchlik	34f68a4c0f	doc: related https://github.com/scylladb/scylladb/issues/12658 , fix the service name in the upgrade guide from 2022.1 to 2022.2 Closes #12698 (cherry picked from commit `826f67a298`)	2023-02-17 12:17:48 +02:00
Botond Dénes	b336e11f59	Merge 'doc: fix the service name from "scylla-enterprise-server" "to "scylla-server"' from Anna Stuchlik Related https://github.com/scylladb/scylladb/issues/12658. This issue fixes the bug in the upgrade guides for the released versions. Closes #12679 * github.com:scylladb/scylladb: doc: fix the service name in the upgrade guide for patch releases versions 2022 doc: fix the service name in the upgrade guide from 2021.1 to 2022.1 (cherry picked from commit `325246ab2a`)	2023-02-17 12:16:52 +02:00
Anna Stuchlik	9ef73d7e36	doc: fixes https://github.com/scylladb/scylladb/issues/12754 , document the metric update in 5.2 Closes #12891 (cherry picked from commit `bcca706ff5`)	2023-02-17 12:16:13 +02:00
Botond Dénes	8700a72b4c	Merge 'Backport compaction-backlog-tracker fixes to branch-5.2' from Raphael "Raph" Carvalho Both patches are important to fix inefficiencies when updating the backlog tracker, which can manifest as a reactor stall, on a special event like schema change. No conflicts when backporting. Regression since `1d9f53c881`, which is present in branch 5.1 onwards. Closes #12851 * github.com:scylladb/scylladb: compaction: Fix inefficiency when updating LCS backlog tracker table: Fix quadratic behavior when inserting sstables into tracker on schema change	2023-02-15 07:22:25 +02:00
Raphael S. Carvalho	886dd3e1d2	compaction: Fix inefficiency when updating LCS backlog tracker LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker is calling STCS tracker's replace_sstables() with empty arguments even when higher levels (> 0) only had sstables replaced. This unnecessary call to STCS tracker will cause it to recompute the L0 backlog, yielding the same value as before. As LCS has a fragment size of 0.16G on higher levels, we may be updating the tracker multiple times during incremental compaction, which operates on SSTables on higher levels. Inefficiency is fixed by only updating the STCS tracker if any L0 sstable is being added or removed from the table. This may be fixing a quadratic behavior during boot or refresh, as new sstables are loaded one by one. Higher levels have a substantial higher number of sstables, therefore updating STCS tracker only when level 0 changes, reduces significantly the number of times L0 backlog is recomputed. Refs #12499. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12676 (cherry picked from commit `1b2140e416`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-02-14 12:14:27 -03:00
Raphael S. Carvalho	f565f3de06	table: Fix quadratic behavior when inserting sstables into tracker on schema change Each time backlog tracker is informed about a new or old sstable, it will recompute the static part of backlog which complexity is proportional to the total number of sstables. On schema change, we're calling backlog_tracker::replace_sstables() for each existing sstable, therefore it produces O(N ^ 2) complexity. Fixes #12499. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12593 (cherry picked from commit `87ee547120`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-02-14 12:14:21 -03:00
Anna Stuchlik	76ff6d981c	doc: related https://github.com/scylladb/scylladb/issues/12754 , add the requirement to upgrade Monitoring to version 4.3 Closes #12784 (cherry picked from commit `c7778dd30b`)	2023-02-10 10:28:35 +02:00
Botond Dénes	f924f59055	Merge 'Backport test.py improvements to 5.2' from Kamil Braun Backport the following improvements for test.py efficiency and user experience: - https://github.com/scylladb/scylladb/pull/12542 - https://github.com/scylladb/scylladb/pull/12560 - https://github.com/scylladb/scylladb/pull/12564 - https://github.com/scylladb/scylladb/pull/12563 - https://github.com/scylladb/scylladb/pull/12588 - https://github.com/scylladb/scylladb/pull/12613 - https://github.com/scylladb/scylladb/pull/12569 - https://github.com/scylladb/scylladb/pull/12612 - https://github.com/scylladb/scylladb/pull/12549 - https://github.com/scylladb/scylladb/pull/12678 Fixes #12617 Closes #12770 * github.com:scylladb/scylladb: test/pylib: put UNIX-domain socket in /tmp Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun Merge 'test.py: manual cluster pool handling for Python suite' from Alecco Merge 'test.py: handle broken clusters for Python suite' from Alecco test/pylib: scylla_cluster: don't leak server if stopping it fails Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun test/pylib: scylla_cluster: return error details from test framework endpoints test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot test: disable commitlog O_DSYNC, preallocation	2023-02-08 15:09:09 +02:00
Nadav Har'El	d5cef05810	test/pylib: put UNIX-domain socket in /tmp The "cluster manager" used by the topology test suite uses a UNIX-domain socket to communicate between the cluster manager and the individual tests. The socket is currently located in the test directory but there is a problem: In Linux the length of the path used as a UNIX-domain socket address is limited to just a little over 100 bytes. In Jenkins run, the test directory names are very long, and we sometimes go over this length limit and the result is that test.py fails creating this socket. In this patch we simply put the socket in /tmp instead of the test directory. We only need to do this change in one place - the cluster manager, as it already passes the socket path to the individual tests (using the "--manager-api" option). Tested by cloning Scylla in a very long directory name. A test like ./test.py --mode=dev test_concurrent_schema fails before this patch, and passes with it. Fixes #12622 Closes #12678 (cherry picked from commit `681a066923`)	2023-02-07 17:12:14 +01:00
Nadav Har'El	e0f4e99e9b	Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun `ScyllaClusterManager` is used to run a sequence of test cases from a single test file. Between two consecutive tests, if the previous test left the cluster 'dirty', meaning the cluster cannot be reused, it would free up space in the pool (using `steal`), stop the cluster, then get a new cluster from the pool. Between the `steal` and the `get`, a concurrent test run (with its own instance of `ScyllaClusterManager` would start, because there was free space in the pool. This resulted in undesirable behavior when we ran tests with `--repeat X` for a large `X`: we would start with e.g. 4 concurrent runs of a test file, because the pool size was 4. As soon as one of the runs freed up space in the pool, we would start another concurrent run. Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so on. We would have a large number of concurrent runs, even though the original 4 runs didn't finish yet. All of these concurrent runs would compete waiting on the pool, and waiting for space in the pool would take longer and longer (the duration is linear w.r.t number of concurrent competing runs). Tests would then time out because they would have to wait too long. Fix that by using the new `replace_dirty` function introduced to the pool. This function frees up space by returning a dirty cluster and then immediately takes it away to be used for a new cluster. Thanks to this, we will only have at most as many concurrent runs as the pool size. For example with --repeat 8 and pool size 4, we would run 4 concurrent runs and start the 5th run only when one of the original 4 runs finishes, then the 6th run when a second run finishes and so on. The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)` and a `destroy` function passed to the pool (now the pool is responsible for stopping the cluster and releasing its IPs). Fixes #11757 Closes #12549 * github.com:scylladb/scylladb: test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests test/pylib: pool: introduce `replace_dirty` test/pylib: pool: replace `steal` with `put(is_dirty=True)` (cherry picked from commit `132af20057`)	2023-02-07 17:08:17 +01:00
Kamil Braun	6795715011	Merge 'test.py: manual cluster pool handling for Python suite' from Alecco From reviews of https://github.com/scylladb/scylladb/pull/12569, avoid using `async with` and access the `Pool` of clusters with `get()`/`put()`. Closes #12612 * github.com:scylladb/scylladb: test.py: manual cluster handling for PythonSuite test.py: stop cluster if PythonSuite fails to start test.py: minor fix for failed PythonSuite test (cherry picked from commit `5bc7f0732e`)	2023-02-07 17:07:43 +01:00
Nadav Har'El	aa9e91c376	Merge 'test.py: handle broken clusters for Python suite' from Alecco If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it. Ignore exception re-raised by the context manager. Fixes #12360 Closes #12569 * github.com:scylladb/scylladb: test.py: handle broken clusters for Python suite test.py: Pool discard method (cherry picked from commit `54f174a1f4`)	2023-02-07 17:07:36 +01:00
Kamil Braun	ddfb9ebab2	test/pylib: scylla_cluster: don't leak server if stopping it fails `ScyllaCluster.server_stop` had this piece of code: ``` server = self.running.pop(server_id) if gracefully: await server.stop_gracefully() else: await server.stop() self.stopped[server_id] = server ``` We observed `stop_gracefully()` failing due to a server hanging during shutdown. We then ended up in a state where neither `self.running` nor `self.stopped` had this server. Later, when releasing the cluster and its IPs, we would release that server's IP - but the server might have still been running (all servers in `self.running` are killed before releasing IPs, but this one wasn't in `self.running`). Fix this by popping the server from `self.running` only after `stop_gracefully`/`stop` finishes. Make an analogous fix in `server_start`: put `server` into `self.running` before we actually start it. If the start fails, the server will be considered "running" even though it isn't necessarily, but that is OK - if it isn't running, then trying to stop it later will simply do nothing; if it is actually running, we will kill it (which we should do) when clearing after the cluster; and we don't leak it. Closes #12613 (cherry picked from commit `a0ff33e777`)	2023-02-07 17:05:20 +01:00
Nadav Har'El	d58a3e4d16	Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability. Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection. Closes #12588 * github.com:scylladb/scylladb: test/pylib: scylla_cluster: better logging for timeout on server startup test/pylib: scylla_cluster: use less expensive query to check for CQL availability (cherry picked from commit `ccc2c6b5dd`)	2023-02-07 17:05:02 +01:00
Kamil Braun	2ebac52d2d	test/pylib: scylla_cluster: return error details from test framework endpoints If an endpoint handler throws an exception, the details of the exception are not returned to the client. Normally this is desirable so that information is not leaked, but in this test framework we do want to return the details to the client so it can log a useful error message. Do it by wrapping every handler into a catch clause that returns the exception message. Also modify a bit how HTTPErrors are rendered so it's easier to discern the actual body of the error from other details (such as the params used to make the request etc.) Before: ``` E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error E E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff ``` After: ``` E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body: E Failed to start server at host 127.155.129.1. E Check the log files: E /home/kbraun/dev/scylladb/testlog/test.py.dev.log E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log ``` Closes #12563 (cherry picked from commit `2f84e820fd`)	2023-02-07 17:04:37 +01:00
Kamil Braun	b536614913	test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager When we obtained a new cluster for a test case after the previous test case left a dirty cluster, we would release the old cluster's used IP addresses (`_before_test` function). However, we would not release the last cluster's IP after the last test case. We would run out of IPs with sufficiently many test files or `--repeat` runs. Fix this. Also reorder the operations a bit: stop the cluster (and release its IPs) before freeing up space in the cluster pool (i.e. call `self.cluster.stop()` before `self.clusters.steal()`). This reduces concurrency a bit - fewer Scyllas running at the same time, which is good (the pool size gives a limit on the desired max number of concurrently running clusters). Killing a cluster is quick so it won't make a significant difference for the next guy waiting on the pool. Closes #12564 (cherry picked from commit `3ed3966f13`)	2023-02-07 17:04:19 +01:00
Kamil Braun	85df0fd2b1	test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot If a cluster fails to boot, it saves the exception in `self.start_exception` variable; the exception will be rethrown when a test tries to start using this cluster. As explained in `before_test`: ``` def before_test(self, name) -> None: """Check that the cluster is ready for a test. If there was a start error, throw it here - the server is running when it's added to the pool, which can't be attributed to any specific test, throwing it here would stop a specific test.""" ``` It's arguable whether we should blame some random test for a failure that it didn't cause, but nevertheless, there's a problem here: the `start_exception` will be rethrown and the test will fail, but then the cluster will be simply returned to the pool and the next test will attempt to use it... and so on. Prevent this by marking the cluster as dirty the first time we rethrow the exception. Closes #12560 (cherry picked from commit `147dd73996`)	2023-02-07 17:03:56 +01:00
Avi Kivity	cdf9fe7023	test: disable commitlog O_DSYNC, preallocation Commitlog O_DSYNC is intended to make Raft and schema writes durable in the face of power loss. To make O_DSYNC performant, we preallocate the commitlog segments, so that the commitlog writes only change file data and not file metadata (which would require the filesystem to commit its own log). However, in tests, this causes each ScyllaDB instance to write 384MB of commitlog segments. This overloads the disks and slows everything down. Fix this by disabling O_DSYNC (and therefore preallocation) during the tests. They can't survive power loss, and run with --unsafe-bypass-fsync anyway. Closes #12542 (cherry picked from commit `9029b8dead`)	2023-02-07 17:02:59 +01:00
Beni Peled	8ff4717fd0	release: prepare for 5.2.0-rc1	2023-02-06 22:13:53 +02:00
Kamil Braun	291b1f6e7f	service/raft: raft_group0: prevent double abort There was a small chance that we called `timeout_src.request_abort()` twice in the `with_timeout` function, first by timeout and then by shutdown. `abort_source` fails on an assertion in this case. Fix this. Fixes: #12512 Closes #12514 (cherry picked from commit `54170749b8`)	2023-02-05 18:31:50 +02:00
Kefu Chai	b2699743cc	db: system_keyspace: take the reserved_memory into account before this change, we returns the total memory managed by Seastar in the "total" field in system.memory. but this value only reflect the total memory managed by Seastar's allocator. if `reserve_additional_memory` is set when starting app_template, Seastar's memory subsystem just reserves a chunk of memory of this specified size for system, and takes the remaining memory. since `f05d612da8`, we set this value to 50MB for wasmtime runtime. hence the test of `TestRuntimeInfoTable.test_default_content` in dtest fails. the test expects the size passed via the option of `--memory` to be identical to the value reported by system.memory's "total" field. after this change, the "total" field takes the reserved memory for wasm udf into account. the "total" field should reflect the total size of memory used by Scylla, no matter how we use a certain portion of the allocated memory. Fixes #12522 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12573 (cherry picked from commit `4a0134a097`)	2023-02-05 18:30:05 +02:00
Botond Dénes	50ae73a4bd	types: is_tuple(): handle reverse types Currently reverse types match the default case (false), even though they might be wrapping a tuple type. One user-visible effect of this is that a schema, which has a reversed<frozen<UDT>> clustering key component, will have this component incorrectly represented in the schema cql dump: the UDT will loose the frozen attribute. When attempting to recreate this schema based on the dump, it will fail as the only frozen UDTs are allowed in primary key components. Fixes: #12576 Closes #12579 (cherry picked from commit `ebc100f74f`)	2023-02-05 18:20:21 +02:00
Calle Wilund	c3dd4a2b87	alterator::streams: Sort tables in list_streams to ensure no duplicates Fixes #12601 (maybe?) Sort the set of tables on ID. This should ensure we never generate duplicates in a paged listing here. Can obviously miss things if they are added between paged calls and end up with a "smaller" UUID/ARN, but that is to be expected. (cherry picked from commit `da8adb4d26`)	2023-02-05 17:44:00 +02:00
Benny Halevy	0f9fe61d91	view: row_lock: lock_ck: find or construct row_lock under partition lock Since we're potentially searching the row_lock in parallel to acquiring the read_lock on the partition, we're racing with row_locker::unlock that may erase the _row_locks entry for the same clustering key, since there is no lock to protect it up until the partition lock has been acquired and the lock_partition future is resolved. This change moves the code to search for or allocate the row lock _after_ the partition lock has been acquired to make sure we're synchronously starting the read/write lock function on it, without yielding, to prevent this use-after-free. This adds an allocation for copying the clustering key in advance even if a row_lock entry already exists, that wasn't needed before. It only us slows down (a bit) when there is contention and the lock already existed when we want to go locking. In the fast path there is no contention and then the code already had to create the lock and copy the key. In any case, the penalty of copying the key once is tiny compared to the rest of the work that view updates are doing. This is required on top of `5007ded2c1` as seen in https://github.com/scylladb/scylladb/issues/12632 which is closely related to #12168 but demonstrates a different race causing use-after-free. Fixes #12632 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `4b5e324ecb`)	2023-02-05 17:22:31 +02:00
Anna Stuchlik	59d30ff241	docs: fixes https://github.com/scylladb/scylladb/issues/12654 , update the links to the Download Center Closes #12655 (cherry picked from commit `64cc4c8515`)	2023-02-05 17:19:56 +02:00
Anna Stuchlik	fb82dff89e	doc: fixes https://github.com/scylladb/scylladb/issues/12672 , fix the redirects to the Cloud docs Closes #12673 (cherry picked from commit `2be131da83`)	2023-02-05 17:17:35 +02:00
Kefu Chai	b588b19620	cql3/selection: construct string_view using char* not size before this change, we construct a sstring from a comma statement, which evaluates to the return value of `name.size()`, but what we expect is `sstring(const char, size_t)`. in this change instead of passing the size of the string_view, both its address and size are used * `std::string_view` is constructed instead of sstring, for better performance, as we don't need to perform a deep copy the issue is reported by GCC-13: ``` In file included from cql3/selection/selectable.cc:11: cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result] auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size())); ^~~~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12666 (cherry picked from commit `186ceea009`) Fixes #12739.	2023-02-05 13:50:48 +02:00
Michał Chojnowski	608ef92a71	commitlog: fix total_size_on_disk accounting after segment file removal Currently, segment file removal first calls `f.remove_file()` and does `total_size_on_disk -= f.known_size()` later. However, `remove_file()` resets `known_size` to 0, so in effect the freed space in not accounted for. `total_size_on_disk` is not just a metric. It is also responsible for deciding whether a segment should be recycled -- it is recycled only if `total_size_on_disk - known_size < max_disk_size`. Therefore this bug has dire performance consequences: if `total_size_on_disk - known_size` ever exceeds `max_disk_size`, the recycling of commitlog segments will stop permanently, because `total_size_on_disk - known_size` will never go back below `max_disk_size` due to the accounting bug. All new segments from this point will be allocated from scratch. The bug was uncovered by a QA performance test. It isn't easy to trigger -- it took the test 7 hours of constant high load to step into it. However, the fact that the effect is permanent, and degrades the performance of the cluster silently, makes the bug potentially quite severe. The bug can be easily spotted with Prometheus as infinitely rising `commitlog_total_size_on_disk` on the affected shards. Fixes #12645 Closes #12646 (cherry picked from commit `fa7e904cd6`)	2023-02-01 21:54:37 +02:00
Kamil Braun	d2732b2663	Merge 'Enable Raft by default in new clusters' from Kamil Braun New clusters that use a fresh conf/scylla.yaml will have `consistent_cluster_management: true`, which will enable Raft, unless the user explicitly turns it off before booting the cluster. People using existing yaml files will continue without Raft, unless consistent_cluster_management is explicitly requested during/after upgrade. Also update the docs: cluster creation and node addition procedures. Fixes #12572. Closes #12585 * github.com:scylladb/scylladb: docs: mention `consistent_cluster_management` for creating cluster and adding node procedures conf: enable `consistent_cluster_management` by default (cherry picked from commit `5c886e59de`)	2023-01-26 12:21:55 +01:00
Anna Mikhlin	34ab98e1be	release: prepare for 5.2.0-rc0	2023-01-18 14:54:36 +02:00
Tomasz Grabiec	563998b69a	Merge 'raft: improve group 0 reconfiguration failure handling' from Kamil Braun Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`: - In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0. - As above but for `decommission`. - Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`. - Add an API to query the current group 0 configuration. Fixes #11723. Closes #12502 * github.com:scylladb/scylladb: test: test_topology: test for removing garbage group 0 members test/pylib: move some utility functions to util.py db: system_keyspace: add a virtual table with raft configuration db: system_keyspace: improve system.raft_snapshot_config schema service: storage_service: better error handling in `decommission` service: storage_service: fix indentation in removenode service: storage_service: make `removenode` work for group 0 members which are not token ring members service/raft: raft_group0: perform read_barrier in wait_for_raft service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove service/raft: raft_group0: link to Raft docs where appropriate service/raft: raft_group0: more logging service/raft: raft_group0: separate function for checking and waiting for Raft	2023-01-17 21:23:15 +01:00
Kamil Braun	d134c458e5	test/pylib: increase timeout when waiting for cluster before test Increase the timeout from default 5 minutes to 10 minutes. Sent as a workaround for #12546 to unblock next promotions. Closes #12547	2023-01-17 21:03:09 +02:00
Kamil Braun	4f1c317bdc	test: test_raft_upgrade: stop servers gracefully in test_recovery_after_majority_loss This test is frequently failing due to a timeout when we try to restart one of the nodes. The shutdown procedure apparently hangs when we try to stop the `hints_manager` service, e.g.: ``` INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Stopped INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped INFO 2023-01-13 03:22:56,997 [shard 0] hints_manager - Stopped ``` observe the 5 minute delay at the end. There is a known issue about `hints_manager` stop hanging: #8079. Now, for some reason, this is the only test case that is hitting this issue. We don't completely understand why. There is one significant difference between this test case and others: this is the only test case which kills 2 (out of 3) servers in the cluster and then tries to gracefully shutdown the last server. There's a hypothesis that the last server gets stuck trying to send hints to the killed servers. We weren't able to prove/falsify it yet. But if it's true, then this patch will: - unblock next promotions, - give us some important information when we see that the issue stops appearing. In the patch we shutdown all servers gracefully instead of killing them, like we do in the other test cases. Closes #12548	2023-01-17 20:51:09 +02:00
Pavel Emelyanov	4f415413d2	raft: Fix non-existing state_machine::apply_entry in docs The docs mention that method, but it doesn't exist. Instead, the state_machine interface defines plain .apply() one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12541	2023-01-17 12:53:05 +01:00
Kamil Braun	5545547d07	test: test_topology: test for removing garbage group 0 members Verify that `removenode` can remove group 0 members which are not token ring members.	2023-01-17 12:28:00 +01:00
Kamil Braun	c959ec455a	test/pylib: move some utility functions to util.py They were used in test_raft_upgrade, but we want to use them in other test files too.	2023-01-17 12:28:00 +01:00
Kamil Braun	a483915c62	db: system_keyspace: add a virtual table with raft configuration Add a new virtual table `system.raft_state` that shows the currently operating Raft configuration for each present group. The schema is the same as `system.raft_snapshot_config` (the latter shows the config from the last snapshot). In the future we plan to add more columns to this table, showing more information (like the current leader and term), hence the generic name. Adding the table requires some plumbing of `sharded<raft_group_registry>&` through function parameters to make it accessible from `register_virtual_tables`, but it's mostly straightforward. Also added some APIs to `raft_group_registry` to list all groups and find a given group (returning `nullptr` if one isn't found, not throwing an exception).	2023-01-17 12:28:00 +01:00
Kamil Braun	2bfe85ce9b	db: system_keyspace: improve system.raft_snapshot_config schema Remove the `ip_addr` column which was not used. IP addresses are not part of Raft configuration now and they can change dynamically. Swap the `server_id` and `disposition` columns in the clustering key, so when querying the configuration, we first obtain all servers with the current disposition and then all servers with the previous disposition (note that a server may appear both in current and previous).	2023-01-17 12:28:00 +01:00
Kamil Braun	c3ed82e5fb	service: storage_service: better error handling in `decommission` Improve the error handling in `decommission` in case `leave_group0` fails, informing the user what they should do (i.e. call `removenode` to get rid of the group 0 member), and allowing decommission to finish; it does not make sense to let the node continue to run after it leaves the token ring. (And I'm guessing it's also not safe. Or maybe impossible.)	2023-01-17 12:28:00 +01:00
Kamil Braun	beb0eee007	service: storage_service: fix indentation in removenode	2023-01-17 12:28:00 +01:00
Kamil Braun	aba33dd352	service: storage_service: make `removenode` work for group 0 members which are not token ring members Due to failures we might end up in a situation where we have a group 0 member which is not a token ring member: a decommission/removenode which failed after leaving/removing a node from the token ring but before leaving / removing a node from group 0. There was no way to get rid of such a group 0 member. A node that left the token ring must not be allowed to run further (or it can cause data loss, data resurrection and maybe other fun stuff), so we can't run decommission a second time (even if we tried, it would just say that "we're not a member of the token ring" and abort). And `removenode` would also not work, because it proceeds only if the node requested to be removed is a member of the token ring. We modify `removenode` so it can run in this situation and remove the group 0 member. The parts of `removenode` related to token ring modification are now conditioned on whether the node was a member of the token ring. The final `remove_from_group0` step is in its own branch. Some minor refactors were necessary. Some log messages were also modified so it's easier to understand which messages correspond the "token movement" part of the procedure. The `make_nonvoter` step happens only if token ring removal happens, otherwise we can skip directly to `remove_from_group0`. We also move `remove_from_group0` outside the "try...catch", fixing #11723. The "node ops" part of the procedure is related strictly to token ring movement, so it makes sense for `remove_from_group0` to happen outside. Indentation is broken in this commit for easier reviewability, fixed in the following commit. Fixes: #11723	2023-01-17 12:28:00 +01:00
Kamil Braun	ec2cd29e42	service/raft: raft_group0: perform read_barrier in wait_for_raft Right now wait_for_raft is called before performing group 0 configuration changes. We want to also call it before checking for membership, for that it's desirable to have the most recent information, hence call read_barrier. In the existing use cases it's not strictly necessary, but it doesn't hurt.	2023-01-17 12:28:00 +01:00
Kamil Braun	db734cd74f	service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode removenode currently works roughly like this: 1. stream/repair data so it ends up on new replica sets (calculated without the node we want to remove) 2. remove the node from the token ring 3. remove the node from group 0 configuration. If the procedure fails before after step 2 but before step 3 finishes, we're in trouble: the cluster is left with an additional voting group 0 member, which reduces group 0's availability, and there is no way to remove this member because `removenode` no longer considers it to be part of the cluster (it consults the token ring to decide). Improve this failure scenario by including a new step at the beginning: make the node a non-voter in group 0 configuration. Then, even if we fail after removing the node from the token ring but before removing it from group 0, we'll only be left with a non-voter which doesn't reduce availability. We make a similar change for `decommission`: between `unbootstrap()` (which streams data) and `leave_ring()` (which removes our tokens from the ring), become a non-voter. The difference here is that we don't become a non-voter at the beginning, but only after streaming/repair. In `removenode` it's desirable to make the node a non-voter as soon as possible because it's already dead. In decommission it may be desirable for us to remain a voter if we fail during streaming because we're still alive and functional in that case. In a later commit we'll also make it possible to retry `removenode` to remove a node that is only a group 0 member and not a token ring member.	2023-01-17 12:28:00 +01:00
Kamil Braun	1eee349a17	test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove The test would create a scenario where one node was down while the others started the Raft upgrade procedure. The procedure would get stuck, but it was possible to `removenode` the downed node using one of the alive nodes, which would unblock the Raft upgrade procedure. This worked because: 1. the upgrade procedure starts by ensuring that all peers can be contacted, 2. `removenode` starts by removing the node from the token ring. After removing the node from the token ring, the upgrade procedure becomes able to contact all peers (the peers set no longer contains the down node). At the end, after removing the node from the token ring, `removenode` would actually get stuck for a while, waiting for the upgrade procedure to finish before removing the peer from group 0. After the upgrade procedure finished, `removenode` would also finish. (so: first the upgrade procedure waited for removenode, then removenode waited for the upgrade procedure). We want to modify the `removenode` procedure and include a new step before removing the node from the token ring: making the node a non-voter. The purpose is to improve the possible failure scenarios. Previously, if the `removenode` procedure failed after removing the node from the token ring but before removing it from group 0, the cluster would contain a 'garbage' group 0 member which is a voter - reducing group 0's availability. If the node is made a non-voter first, then this failure will not be as big of a problem, because the leftover group 0 member will be a non-voter. However, to correctly perform group 0 operations including making someone a nonvoter, we must first wait for the Raft upgrade procedure to finish (or at least wait until everyone joins group 0). Therefore by including this 'make the node a non-voter' step at the beginning of `removenode`, we make it impossible to remove a token ring member in the middle of the upgrade procedure, on which the test case relied. The test case would get stuck waiting for the `removenode` operation to finish, which would never finish because it would wait for the upgrade procedure to finish, which would not finish because of the dead peer. We remove the test case; it was "lucky" to pass in the first place. We have a dedicated mechanism for handling dead peers during Raft upgrade procedure: the manual Raft group 0 RECOVERY procedure. There are other test cases in this file which are using that procedure.	2023-01-17 12:28:00 +01:00
Kamil Braun	4f0801406e	service/raft: raft_group0: link to Raft docs where appropriate Resolve some TODOs.	2023-01-17 12:28:00 +01:00
Kamil Braun	2befbaa341	service/raft: raft_group0: more logging Make the logs in leave_group0 consistent with logs in remove_from_group0.	2023-01-17 12:28:00 +01:00
Kamil Braun	77dc1c4c70	service/raft: raft_group0: separate function for checking and waiting for Raft leave_group0 and remove_from_group0 functions both start with the following steps: - if Raft is disabled or in RECOVERY mode, print a simple log message and abort - if Raft cluster feature flag is not yet enabled, print a complex log message and abort - wait for Raft upgrade procedure to finish - then perform the actual group 0 reconfiguration. Refactor these preparation steps to a separate function, `wait_for_raft`. This reduces code duplication; the function will also be used in more operations later (becoming a nonvoter or turning another server into a nonvoter). We also change the API so that the preparation function is called from outside by the caller before they call the reconfiguration function. This is because in later commits, some of the call sites (mainly `removenode`) will want to check explicitly whether Raft is enabled and wait for Raft's availabilty, then perform a sequence of steps related to group 0 configuration depending on the result. Also add a private function `raft_upgrade_complete()` which we use to assert that Raft is ready to be used.	2023-01-17 12:27:58 +01:00
Wojciech Mitros	5f45b32bfa	forward_service: prevent heap use-after-free of forward_aggregates Currently, we create `forward_aggregates` inside a function that returns the result of a future lambda that captures these aggregates by reference. As a result, the aggregates may be destructed before the lambda finishes, resulting in a heap use-after-free. To prolong the lifetime of these aggregates, we cannot use a move capture, because the lambda is wrapped in a with_thread_if_needed() call on these aggregates. Instead, we fix this by wrapping the entire return statement in a do_with(). Fixes #12528 Closes #12533	2023-01-17 13:25:57 +02:00
Gleb Natapov' via ScyllaDB development	15ebd59071	lwt: upgrade stored mutations to the latest schema during prepare Currently they are upgraded during learn on a replica. The are two problems with this. First the column mapping may not exist on a replica if it missed this particular schema (because it was down for instance) and the mapping history is not part of the schema. In this case "Failed to look up column mapping for schema version" will be thrown. Second lwt request coordinator may not have the schema for the mutation as well (because it was freed from the registry already) and when a replica tries to retrieve the schema from the coordinator the retrieval will fail causing the whole request to fail with "Schema version XXXX not found" Both of those problems can be fixed by upgrading stored mutations during prepare on a node it is stored at. To upgrade the mutation its column mapping is needed and it is guarantied that it will be present at the node the mutation is stored at since it is pre-request to store it that the corresponded schema is available. After that the mutation is processed using latest schema that will be available on all nodes. Fixes #10770 Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>	2023-01-17 11:14:46 +01:00
Raphael S. Carvalho	f2f839b9cc	compaction: LCS: don't reshape all levels if only a single breaks disjointness LCS reshape is compacting all levels if a single one breaks disjointness. That's unnecessary work because rewriting that single level is enough to restore disjointness. If multiple levels break disjointness, they'll each be reshaped in its own iteration, so reducing operation time for each step and disk space requirement, as input files can be released incrementally. Incremental compaction is not applied to reshape yet, so we need to avoid "major compaction", to avoid the space overhead. But space overhead is not the only problem, the inefficiency, when deciding what to reshape when overlapping is detected, motivated this patch. Fixes #12495. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12496	2023-01-17 09:55:15 +02:00
Michał Chojnowski	9e17564c70	types: add some missing explicit instantiations Some functions defined by a template in types.cc are used in other translation units (via `cql3/untyped_result_set.hh`), but aren't explicitly instantiated. Therefore their linking can fail, depending on inlining decisions. (I experienced this when playing with compiler options). Fix that. Closes #12539	2023-01-17 10:46:01 +02:00
Nadav Har'El	5bf94ae220	cql: allow disabling of USING TIMESTAMP sanity checking As requested by issue #5619, commit `2150c0f7a2` added a sanity check for USING TIMESTAMP - the number specified in the timestamp must not be more than 3 days into the future (when viewed as a number of microseconds since the epoch). This sanity checking helps avoid some annoying client-side bugs and mis-configurations, but some users genuinely want to use arbitrary or futuristic-looking timestamps and are hindered by this sanity check (which Cassandra doesn't have, by the way). So in this patch we add a new configuration option, restrict_future_timestamp If set to "true", futuristic timestamps (more than 3 days into the future) are forbidden. The "true" setting is the default (as has been the case sinced #5619). Setting this option to "false" will allow using any 64-bit integer as a timestamp, like is allowed Cassanda (and was allowed in Scylla prior to #5619. The error message in the case where a futuristic timestamp is rejected now mentions the configuration paramter that can be used to disable this check (this, and the option's name "restrict_*", is similar to other so-called "safe mode" options). This patch also includes a test, which works in Scylla and Cassandra, with either setting of restrict_future_timestamp, checking the right thing in all these cases (the futuristic timestamp can either be written and read, or can't be written). I used this test to manually verify that the new option works, defaults to "true", and when set to "false" Scylla behaves like Cassandra. Fixes #12527 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12537	2023-01-16 23:18:56 +02:00
Kefu Chai	114f30016a	main: use std::shift_left() to consume tool name for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12536	2023-01-16 21:01:34 +02:00
Nadav Har'El	feef3f9dda	test/cql-pytest: test more than one restriction on same clustering column Cassandra refuses a request with more than one relation to the same clustering column, for example DELETE FROM tbl WHERE p = ? and c = ? AND c > ? complains that c cannot be restricted by more than one relation if it includes an Equal But it produces different error messages for different operators and even order. Currently, Scylla doesn't consider such requests an error. Whether or not we should be compatible with Cassandra here is discussed in issue #12472. But as long as we do accept these queries, we should be sure we do the right thing: "WHERE c = 1 AND c > 2" should match nothing, "WHERE c = 1 AND c > 0" should match the matches of c = 1, and so on. This patch adds a test for verify that these requests indeed yield correct results. The test is scylla_only because, as explained above, Cassandra doesn't support these requests at all. Refs #12472 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12498	2023-01-16 20:41:16 +02:00
Kefu Chai	86b451d45c	SCYLLA-VERSION-GEN: remove unnecessary bashism remove unnecessary bashism, so that this script can be interpreted by a POSIX shell. /bin/sh is specified in the shebang line. on debian derivatives, /bin/sh is dash, which is POSIX compliant. but this script is written in the bash dialect. before this change, we could run into following build failure when building the tree on Debian: [7/904] ./SCYLLA-VERSION-GEN ./SCYLLA-VERSION-GEN: 37: [[: not found after this change, the build is able to proceed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12530	2023-01-16 20:34:01 +02:00
Avi Kivity	0b418fa7cf	cql3, transport, tests: remove "unset" from value type system The CQL binary protocol introduced "unset" values in version 4 of the protocol. Unset values can be bound to variables, which cause certain CQL fragments to be skipped. For example, the fragment `SET a = :var` will not change the value of `a` if `:var` is bound to an unset value. Unsets, however, are very limited in where they can appear. They can only appear at the top-level of an expression, and any computation done with them is invalid. For example, `SET list_column = [3, :var]` is invalid if `:var` is bound to unset. This causes the code to be littered with checks for unset, and there are plenty of tests dedicated to catching unsets. However, a simpler way is possible - prevent the infiltration of unsets at the point of entry (when evaluating a bind variable expression), and introduce guards to check for the few cases where unsets are allowed. This is what this long patch does. It performs the following: (general) 1. unset is removed from the possible values of cql3::raw_value and cql3::raw_value_view. (external->cql3) 2. query_options is fortified with a vector of booleans, unset_bind_variable_vector, where each boolean corresponds to a bind variable index and is true when it is unset. 3. To avoid churn, two compatiblity structs are introduced: cql3::raw_value{,_view}_vector_with_unset, which can be constructed from a std::vector<raw_value{,_view/}>, which is what most callers have. They can also be constructed with explicit unset vectors, for the few cases they are needed. (cql3->variables) 4. query_options::get_value_at() now throws if the requested bind variable is unset. This replaces all the throwing checks in expression evaluation and statement execution, which are removed. 5. A new query_options::is_unset() is added for the users that can tolerate unset; though it is not used directly. 6. A new cql3::unset_operation_guard class guards against unsets. It accepts an expression, and can be queried whether an unset is present. Two conditions are checked: the expression must be a singleton bind variable, and at runtime it must be bound to an unset value. 7. The modification_statement operations are split into two, via two new subclasses of cql3::operation. cql3::operation_no_unset_support ignores unsets completely. cql3::operation_skip_if_unset checks if an operand is unset (luckily all operations have at most one operand that tolerates unset) and applies unset_operation_guard to it. 8. The various sites that accept expressions or operations are modified to check for should_skip_operation(). This are the loops around operations in update_statement and delete_statement, and the checks for unset in attributes (LIMIT and PER PARTITION LIMIT) (tests) 9. Many unset tests are removed. It's now impossible to enter an unset value into the expression evaluation machinery (there's just no unset value), so it's impossible to test for it. 10. Other unset tests now have to be invoked via bind variables, since there's no way to create an unset cql3::expr::constant. 11. Many tests have their exception message match strings relaxed. Since unsets are now checked very early, we don't know the context where they happen. It would be possible to reintroduce it (by adding a format string parameter to cql3::unset_operation_guard), but it seems not to be worth the effort. Usage of unsets is rare, and it is explicit (at least with the Python driver, an unset cannot be introduced by ommission). I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't recognize unsets) with cql3::maybe_unset_value (that does), but that caused huge amounts of churn, so I abandoned that in favor of the current approach. Closes #12517	2023-01-16 21:10:56 +02:00
Kamil Braun	7510144fba	Merge 'Add replace-node-first-boot option' from Benny Halevy Allow replacing a node given its Host ID rather than its ip address. This series adds a replace_node_first_boot option to db/config and makes use of it in storage_service. The new option takes priority over the legacy replace_address* options. When the latter are used, a deprecation warning is printed. Documentation updated respectively. And a cql unit_test is added. Ref #12277 Closes #12316 * github.com:scylladb/scylladb: docs: document the new replace_node_first_boot option dist/docker: support --replace-node-first-boot db: config: describe replace_address* options as deprecated test: test_topology: test replace using host_id test: pylib: ServerInfo: add host_id storage_service: get rid of get_replace_address storage_service: is_replacing: rely directly on config options storage_service: pass replacement_info to run_replace_ops storage_service: pass replacement_info to booststrap storage_service: join_token_ring: reuse replacement_info.address storage_service: replacement_info: add replace address init: do not allow cfg.replace_node_first_boot of seed node db: config: add replace_node_first_boot option	2023-01-16 15:08:31 +01:00
Michał Sala	bbbe12af43	forward_service: fix timeout support in parallel aggregates `forward_request` verb carried information about timeouts using `lowres_clock::time_point` (that came from local steady clock `seastar::lowres_clock`). The time point was produced on one node and later compared against other node `lowres_clock`. That behavior was wrong (`lowres_clock::time_point`s produced with different `lowres_clock`s cannot be compared) and could lead to delayed or premature timeout. To fix this issue, `lowres_clock::time_point` was replaced with `lowres_system_clock::time_point` in `forward_request` verb. Representation to which both time point types serialize is the same (64-bit integer denoting the count of elapsed nanoseconds), so it was possible to do an in-place switch of those types using logic suggested by @avikivity: - using steady_clock is just broken, so we aren't taking anything from users by breaking it further - once all nodes are upgraded, it magically starts to work Closes #12529	2023-01-16 12:08:13 +02:00
Botond Dénes	3d9ab1d9eb	Merge 'Get recursive tasks' statuses with task manager api call' from Aleksandra Martyniuk The PR adds an api call allowing to get the statuses of a given task and all its descendants. The parent-child tree is traversed in BFS order and the list of statuses is returned to user. Closes #12317 * github.com:scylladb/scylladb: test: add test checking recursive task status api: get task statuses recursively api: change retrieve_status signature	2023-01-16 11:44:50 +02:00
Tzach Livyatan	073f0f00c6	Add Scylla Summit 2023 in the top banner Closes #12519	2023-01-16 08:05:20 +02:00
Avi Kivity	5a07641b95	Update python3 submodule (license file fix) * tools/python3 548e860...279b6c1 (1): > create-relocatable-package: s/pyhton3-libs/python3-libs/	2023-01-15 17:59:27 +02:00
Benny Halevy	de3142e540	docs: document the new replace_node_first_boot option And mention that replacing a node using the legacy replace_addr* options is deprecated. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:41:44 +02:00
Benny Halevy	d4f1563369	dist/docker: support --replace-node-first-boot And mention that replace_address_first_boot is deprecated Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:09 +02:00
Benny Halevy	1577aa8098	db: config: describe replace_address* options as deprecated The replace_address options are still supported But mention in their description that they are now deprecated and the user should use replace_node_first_boot instead. While at it fix a typo in ignore_dead_nodes_for_replace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:09 +02:00
Benny Halevy	90faeedb77	test: test_topology: test replace using host_id Add test cases exercising the --replace-node-first-boot option by replacing nodes using their host_id rather than ip address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:09 +02:00
Benny Halevy	7d0d9e28f1	test: pylib: ServerInfo: add host_id Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:07 +02:00
Benny Halevy	db2b76beb5	storage_service: get rid of get_replace_address It is unused now. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:34:29 +02:00
Benny Halevy	17f70e4619	storage_service: is_replacing: rely directly on config options Rather than on get_replace_address, before we remove the latter. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:34:29 +02:00
Benny Halevy	7282d58d11	storage_service: pass replacement_info to run_replace_ops So it won't need to call get_replace_address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:34:09 +02:00
Benny Halevy	08598e4f64	storage_service: pass replacement_info to booststrap So it won't need to call get_replace_address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Benny Halevy	b863f7a75f	storage_service: join_token_ring: reuse replacement_info.address Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Benny Halevy	add2f209b8	storage_service: replacement_info: add replace address Populate replacement_info.address in prepare_replacement_info as a first step towards getting rid of get_replace_address(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Benny Halevy	75c8a5addc	init: do not allow cfg.replace_node_first_boot of seed node Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Benny Halevy	32e79185d4	db: config: add replace_node_first_boot option For replacing a node given its (now unique) Host ID. The existing options for replace_address* will be deprecated in the following patches and eventually we will stop supporting them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Tomasz Grabiec	abc43f97c9	Merge 'Simplify some Raft tables' from Kamil Braun Rename `system.raft_config` to `system.raft_snapshot_config` to make it clearer what the table stores. Remove the `my_server_id` partition key column from `system.raft_snapshot_config` and a corresponding column from `system.raft_snapshots` which would store the Raft server ID of the local node. It's unnecessary, all servers running on a given node in different groups will use the same ID - the Raft ID of the node which is equal to its Host ID. There will be no multiple servers running in a single Raft group on the same node. Closes #12513 * github.com:scylladb/scylladb: db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config'	2023-01-13 00:23:21 +01:00
Botond Dénes	4e41e7531c	docs/dev/debugging.md: recommend open-coredump.sh for opening coredumps Leave the guide for manual opening in though, the script might not work in all cases. Also update the version example, we changed how development versions look like. Closes #12511	2023-01-12 19:30:59 +02:00
Botond Dénes	ab8171ffd5	open-coredump.sh: handle dev versions Like: 5.2.0~dev, which really means master. Don't try to checkout branch-5.2 in this case, it doesn't exist yet, checkout master instead. Closes #12510	2023-01-12 19:28:58 +02:00
Kamil Braun	be390285b6	db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG A single node will run a single Raft server in any given Raft group, so this column is not necessary.	2023-01-12 16:48:50 +01:00
Kamil Braun	bed555d1e5	db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config' Make it clear that the table stores the snapshot configuration, which is not necessarily the currently operating configuration (the last one appended to the log). In the future we plan to have a separate virtual table for showing the currently operating configuration, perhaps we will call it `system.raft_config`.	2023-01-12 16:21:26 +01:00
Botond Dénes	f87e3993ef	Merge 'configure.py: a bunch of clean-up changes' from Michał Chojnowski The planned integration of cross-module optimizations in scylladb/scylladb-enterprise requires several changes to `configure.py`. To minimize the divergence between the `configure.py`s of both repositories, this series upstreams some of these changes to scylladb/scylladb. The changes mostly remove dead code and fix some traps for the unaware. Closes #12431 * github.com:scylladb/scylladb: configure.py: prevent deduplication of seastar compile options configure.py: rename clang_inline_threshold() configure.py: rework the seastar_cflags variable configure.py: hoist the pkg_config() call for seastar-testing.pc configure.py: unify the libs variable for tests and non-tests configure.py: fix indentation configure.py: remove a stale code path for .a artifacts	2023-01-12 16:40:02 +02:00
Wojciech Mitros	082bfea187	rust: use depfile and Cargo.lock to avoid building rust when unnecessary Currently, we call cargo build every time we build scylla, even when no rust files have been changed. This is avoided by adding a depfile to the ninja rule for the rust library. The rust file is generated by default during cargo build, but it uses the full paths of all depenencies that it includes, and we use relative paths. This is fixed by specifying CARGO_BUILD_DEP_INFO_BASEDIR='.', which makes it so the current path is subtracted from all generated paths. Instead of using 'always' when specifying when to run the cargo build, a dependency on Cargo.lock is added additionally to the depfile. As a result, the rust files are recompiled not only when the source files included in the depfile are modified, but also when some rust dependency is updated. Cargo may put an old cached file as a result of the build even when the Cargo.lock was recently updated. Because of that, the the build result may be older than the Cargo.lock file even if the build was just performed. This may cause ninja to rebuilt the file every following time. To avoid this, we 'touch' the build result, so that its last modification time is up to date. Because the dependency on Cargo.lock was added, the new command for the build does not modify it. Instead, the developer must update it when modifying the dependencies - the docs are updated to reflect that. Closes #12489 Fixes #12508	2023-01-12 14:44:11 +02:00
Kefu Chai	77baea2add	docs/architecture: fix typo of SyllaDB s/SyllaDB/ScyllaDB/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12505	2023-01-12 12:25:53 +02:00
Michał Chojnowski	1ff4abef4a	configure.py: prevent deduplication of seastar compile options In its infinite wisdom, CMake deduplicates the options passed to `target_compile_options`, making it impossible to pass options which require duplication, such as -mllvm. Passing e.g. `-mllvm;-pgso=false;-mllvm;-inline-threshold=2500` invokes the compiler `-mllvm -pgso=false -inline-threshold=2500`, breaking the options. As a workaround, CMake added the `SHELL:` syntax, which makes it possible to pass the list of options not as a CMake list, but as a shell-quoted string. Let's use it, so we can pass multiple -mllvm options.	2023-01-12 11:24:10 +01:00
Michał Chojnowski	85facefe45	configure.py: rename clang_inline_threshold() There's a global variable (the CLI argument) with the same name. Rename one of the two to avoid accidental mixups.	2023-01-12 11:24:10 +01:00
Michał Chojnowski	d9de78f6d3	configure.py: rework the seastar_cflags variable The name of this variable is misleading. What it really does is pass flags to static libraries compiled by us, not just to seastar. We will need this capability to implement cross-artifact optimizations in our build. We will also need to pass linker flags, and we will need to vary those flags depending on the build mode. This patch splits the seastar_cflags variable into per-mode lib_cflags and lib_ldflags variables. It shouldn't change the resulting build.ninja for now, but will be needed by later planned patches.	2023-01-12 11:24:10 +01:00
Michał Chojnowski	ee462a9d3c	configure.py: hoist the pkg_config() call for seastar-testing.pc Put the pkg_config() for seastar-testing.pc in the same area as the call for seastar.pc, outside of the loop. This is a cosmetic change aimed at making following commits cleaner.	2023-01-12 11:24:10 +01:00
Michał Chojnowski	c9aeeeae11	configure.py: unify the libs variable for tests and non-tests This is a cosmetic change aimed at make following commits in the same area cleaner.	2023-01-12 11:24:09 +01:00
Michał Chojnowski	10ac881ef1	configure.py: fix indentation Fix indentation after the preceeding commit.	2023-01-12 11:23:32 +01:00
Michał Chojnowski	be419adaf8	configure.py: remove a stale code path for .a artifacts Scylla haven't had `.a` artifacts for a long time (since the Urchin days, I believe), and the piece of code responsible for them is stale and untested. Remove it.	2023-01-12 11:22:49 +01:00
Botond Dénes	8a86f8d4ef	gdbinit: add ignore clause for SIG35 Another real-time even often raised in scylla, making debugging a live process annoying. Closes #12507	2023-01-12 12:13:04 +02:00
Avi Kivity	7a8a442c1e	transport: drop some dead code around v1 and v2 protocols In `424dbf43f` ("transport: drop cql protocol versions 1 and 2"), we dropped support for protocols 1 and 2, but some code remains that checks for those versions. It is now dead code, so remove it. Closes #12497	2023-01-12 12:52:19 +02:00
Avi Kivity	4de2524a42	build: update toolchain for scylla-driver package Pull updated scylla-driver package, fixing an IP change related bug [1]. [1] https://github.com/scylladb/python-driver/issues/198 Closes #12501	2023-01-11 22:16:35 +02:00
Nadav Har'El	7192283172	Merge 'doc: add the upgrade guide for ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/12315 This PR adds the upgrade guide from ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2. Instead of adding separate guides per platform, I've merged the information to create one platform-agnostic guide, similar to what we did for [OSS->OSS](https://docs.scylladb.com/stable/upgrade/upgrade-opensource/upgrade-guide-from-5.0-to-5.1/) and [Enterprise->Enterprise ](https://github.com/scylladb/scylladb/pull/12339)guides. Closes #12450 * github.com:scylladb/scylladb: doc: add the new upgrade guide to the toctree and fix its name docs: add the upgrade guide from ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2	2023-01-11 21:01:34 +02:00
Avi Kivity	cb2cb8a606	utils: small_vector: mark throw_out_of_range() const It can be called from the const version of small_vector::at. Closes #12493	2023-01-11 20:58:53 +02:00
Nadav Har'El	04d6402780	docs: cql-extensions.md: explain our NULL handling Our handling of NULLs in expressions is different from Cassandra's, and more uniform. For example, the filter "WHERE x = NULL" is an error in Cassandra, but supported in Scylla. Let's explain how and why. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12494	2023-01-11 20:56:50 +02:00
Wojciech Mitros	95031074a5	configure: fix the order of rust header generation Currently, no rule enforces that the cxx.h rust header is generated before compiling the .cc files generated from rust. This patch adds this dependency. Closes #12492	2023-01-11 16:55:53 +02:00
Botond Dénes	210738c9ce	Merge 'test.py: improve logging' from Kamil Braun Make it easy to see which clusters are operated on by which tests in which build modes and so on. Add some additional logs. These improvements would have saved me a lot of debugging time if I had them last week and we would have https://github.com/scylladb/scylladb/pull/12482 much faster. Closes #12483 * github.com:scylladb/scylladb: test.py: harmonize topology logs with test.py format test/pylib: additional logging during cluster setup test/pylib: prefix cluster/manager logs with the current test name test/pylib: pool: pass args and *kwargs to the build function from get() test.py: include mode in ScyllaClusterManager logs	2023-01-11 16:32:56 +02:00
Aleksandra Martyniuk	fcb3f76e78	test: add test checking recursive task status Rest api test checking whether task manager api returns recursive tasks' statuses properly in BFS order.	2023-01-11 12:34:17 +01:00
Aleksandra Martyniuk	6b79c92cb7	api: get task statuses recursively Sometimes to debug some task manager module, we may want to inspect the whole tree of descendants of some task. To make it easier, an api call getting a list of statuses of the requested task and all its descendants in BFS order is added.	2023-01-11 12:34:06 +01:00
Konstantin Osipov	f3440240ee	test.py: harmonize topology logs with test.py format We need millisecond resolution in the log to be able to correlate test log with test.py log and scylla logs. Harmonize the log format for tests which actively manage scylla servers.	2023-01-11 10:09:42 +01:00
Kamil Braun	79712185d5	test/pylib: additional logging during cluster setup This would have saved me a lot of debugging time.	2023-01-11 10:09:42 +01:00
Kamil Braun	4f7e5ee963	test/pylib: prefix cluster/manager logs with the current test name The log file produced by test.py combines logs coming from multiple concurrent test runs. Each test has its own log file as well, but this "global" log file is useful when debugging problems with topology tests, since many events related to managing clusters are stored there. Make the logs easier to read by including information about the test case that's currently performing operations such as adding new servers to clusters and so on. This includes the mode, test run name and the name of the test case. We do this by using custom `Logger` objects (instead of calling `logging.info` etc. which uses the root logger) with `LoggerAdapter`s that include the prefixes. A bit of boilerplate 'plumbing' through function parameters is required but it's mostly straightforward. This doesn't apply to all events, e.g. boost test cases which don't setup a "real" Scylla cluster. These events don't have additional prefixes. Example: ``` 17:41:43.531 INFO> [dev/topology.test_topology.1] Cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) adding server... 17:41:43.531 INFO> [dev/topology.test_topology.1] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-10... 17:41:43.603 INFO> [dev/topology.test_topology.1] starting server at host 127.40.246.10 in scylla-10... 17:41:43.614 INFO> [dev/topology.test_topology.2] Cluster ScyllaCluster(name: 7a497fce-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(2, 127.40.246.2, f59d3b1d-efbb-4657-b6d5-3fa9e9ef786e), ScyllaServer(5, 127.40.246.5, 9da16633-ce53-4d32-8687-e6b4d27e71eb), ScyllaServer(9, 127.40.246.9, e60c69cd-212d-413b-8678-dfd476d7faf5), stopped: ) adding server... 17:41:43.614 INFO> [dev/topology.test_topology.2] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-11... 17:41:43.670 INFO> [dev/topology.test_topology.2] starting server at host 127.40.246.11 in scylla-11... ```	2023-01-11 10:09:39 +01:00
Avi Kivity	de0c31b3b6	cql3: query_options: simplify batch query_options constructor The batch constructor uses an unnecessarily complicated template, where in fact it only vector<vector<raw_value \| raw_value_view>>. Simplify the constructor to allow exactly that. Delete some confusing comments around it. Closes #12488	2023-01-11 07:54:54 +02:00
Kamil Braun	2bda0f9830	test/pylib: pool: pass args and *kwargs to the build function from get() This will be used to specify a custom logger when building new clusters before starting tests, allowing to easily pinpoint which tests are waiting for clusters to be built and what's happening to these particular clusters.	2023-01-10 17:41:54 +01:00
Kamil Braun	ff2c030bf9	test.py: include mode in ScyllaClusterManager logs The logs often mention the test run and the current test case in a given run, such as `test_topology.1` and `test_topology.1::test_add_server_add_column`. However, if we run test.py in multiple modes, the different modes might be running the same test case and the logs become confusing. To disambiguate, prefix the test run/case names with the mode name. Example: ``` Leasing Scylla cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4 760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) for test dev/topology.test_topology.1::test_add_server_add_column ```	2023-01-10 17:41:54 +01:00
Wojciech Mitros	e558c7d988	functions: initialize aggregates on scylla start Currently, UDAs can't be reused if Scylla has been restarted since they have been created. This is caused by the missing initialization of saved UDAs that should have inserted them to the cql3::functions::functions::_declared map, that should store all (user-)created functions and aggregates. This patch adds the missing implementation in a way that's analogous to the method of inserting UDF to the _declared map. Fixes #11309	2023-01-10 17:44:18 +02:00
Wojciech Mitros	d1b809754c	database: wrap lambda coroutines used as arguments in coroutine::lambda Using lambda coroutines as arguments can lead to a use-after-free. Currently, the way these lambdas were used in do_parse_schema_tables did not lead to such a problem, but it's better to be safe and wrap them in coroutine::lambda(), so that they can't lead to this problem as long as we ensure that the lambda finishes in the do_parse_schema_tables() statement (for example using co_await). Closes #12487	2023-01-10 17:24:52 +02:00
Nadav Har'El	0edb090c67	test/cql-pytest: add simple tests for SELECT DISTINCT This patch adds a few simple functional test for the SELECT DISTINCT feature, and how it interacts with other features especiall GROUP BY. 2 of the 5 new tests are marked xfail, and reproduce one old and one newly-discovered issue: Refs #5361: LIMIT doesn't work when using GROUP BY (the test here uses LIMIT and GROUP BY together with SELECT DISTINCT, so the LIMIT isn't honored). Refs #12479: SELECT DISTINCT doesn't refuse GROUP BY with clustering column. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12480	2023-01-10 13:29:26 +02:00
Michał Radwański	dcab289656	boost/mvcc_test: use failure_injecting_allocation_strategy where it is meant to In test_apply_is_atomic, a basic form of exception testing is used. There is failure_injecting_allocation_strategy, which however is not used for any allocation, since for some reason, `with_allocator(r.allocator()` is used instead of `with_allocator(alloc`. Fix that. Closes #12354	2023-01-10 12:01:36 +01:00
Tomasz Grabiec	ebcd736343	cache: Fix undefined behavior when populating with non-full keys Regression introduced in `23e4c8315`. view_and_holder position_in_partiton::after_key() triggers undefined behavior when the key was not full because the holder is moved, which invalidates the view. Fixes #12367 Closes #12447	2023-01-10 12:51:54 +02:00
Jan Ciolek	8d7e35caef	cql3: expr: remove reference to temporary in get_rhs_receiver The function underlying_type() returns an data_type by value, but the code assigned it to a reference. At first I was sure this is an error (assigning temporary value to a reference), but it turns out that this is most likely correct due to C++ lifetime extension rules. I think it's better to avoid such unituitive tricks. Assigning to value makes it clearer that the code is correct and there are no dangling references. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #12485	2023-01-10 09:42:49 +02:00
Raphael "Raph" Carvalho	407c7fdaf2	docs: Fix command to create a symbolic link to relocatable pkg dir Closes #12481	2023-01-10 07:09:14 +02:00
Kamil Braun	822410c49b	test/pylib: scylla_cluster: release IPs when cluster is no longer needed With sufficiently many test cases we would eventually run out of IP addresses, because IPs (which are leased from a global host registry) would only be released at the end of an entire test suite. In fact we already hit this during next promotions, causing much pain indeed. Release IPs when a cluster, after being marked dirty, is stopped and thrown away. Closes #12482	2023-01-10 06:59:41 +02:00
Avi Kivity	e71e1dc964	Merge 'tools/scylla-sstable: add lua scripting support' from Botond Dénes Introduce a new "script" operation, which loads a script from the specified path, then feeds the mutation fragment stream to it. The script can then extract, process and present information from the sstable as it wishes. For now only Lua scripts are supported for the simple reason that Lua is easy to write bindings for, it is simple and lightweight and more importantly we already have Lua included in the Scylla binary as it is used as the implementation language for UDF/UDA. We might consider WASM support in the future, but for now we don't have any language support in WASM available. Example: ```lua function new_stats(key) return { partition_key = key, total = 0, partition = 0, static_row = 0, clustering_row = 0, range_tombstone_change = 0, }; end total_stats = new_stats(nil); function inc_stat(stats, field) stats[field] = stats[field] + 1; stats.total = stats.total + 1; total_stats[field] = total_stats[field] + 1; total_stats.total = total_stats.total + 1; end function on_new_sstable(sst) max_partition_stats = new_stats(nil); if sst then current_sst_filename = sst.filename; else current_sst_filename = nil; end end function consume_partition_start(ps) current_partition_stats = new_stats(ps.key); inc_stat(current_partition_stats, "partition"); end function consume_static_row(sr) inc_stat(current_partition_stats, "static_row"); end function consume_clustering_row(cr) inc_stat(current_partition_stats, "clustering_row"); end function consume_range_tombstone_change(crt) inc_stat(current_partition_stats, "range_tombstone_change"); end function consume_partition_end() if current_partition_stats.total > max_partition_stats.total then max_partition_stats = current_partition_stats; end end function on_end_of_sstable() if current_sst_filename then print(string.format("Stats for sstable %s:", current_sst_filename)); else print("Stats for stream:"); end print(string.format("\t%d fragments in %d partitions - %d static rows, %d clustering rows and %d range tombstone changes", total_stats.total, total_stats.partition, total_stats.static_row, total_stats.clustering_row, total_stats.range_tombstone_change)); print(string.format("\tPartition with max number of fragments (%d): %s - %d static rows, %d clustering rows and %d range tombstone changes", max_partition_stats.total, max_partition_stats.partition_key, max_partition_stats.static_row, max_partition_stats.clustering_row, max_partition_stats.range_tombstone_change)); end ``` Running this script wilt yield the following: ``` $ scylla sstable script --script-file fragment-stats.lua --system-schema system_schema.columns /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/me-1-big-Data.db Stats for sstable /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f//me-1-big-Data.db: 397 fragments in 7 partitions - 0 static rows, 362 clustering rows and 28 range tombstone changes Partition with max number of fragments (180): system - 0 static rows, 179 clustering rows and 0 range tombstone changes ``` Fixes: https://github.com/scylladb/scylladb/issues/9679 Closes #11649 * github.com:scylladb/scylladb: tools/scylla-sstable: consume_reader(): improve pause heuristincs test/cql-pytest/test_tools.py: add test for scylla-sstable script tools: add scylla-sstable-scripts directory tools/scylla-sstable: remove custom operation tools/scylla-sstable: add script operation tools/sstable: introduce the Lua sstable consumer dht/i_partitioner.hh: ring_position_ext: add weight() accessor lang/lua: export Scylla <-> lua type conversion methods lang/lua: use correct lib name for string lib lang/lua: fix type in aligned_used_data (meant to be user_data) lang/lua: use lua_State* in Scylla type <-> Lua type conversions tools/sstable_consumer: more consistent method naming tools/scylla-sstable: extract sstable_consumer interface into own header tools/json_writer: add accessor to underlying writer tools/scylla-sstable: fix indentation tools/scylla-sstable: export mutation_fragment_json_writer declaration tools/scylla-sstable: mutation_fragment_json_writer un-implement sstable_consumer tools/scylla-sstable: extract json writing logic from json_dumper tools/scylla-sstable: extract json_writer into its own header tools/scylla-sstable: use json_writer::DataKey() to write all keys tools/scylla-types: fix use-after-free on main lambda captures	2023-01-09 20:54:42 +02:00
Raphael S. Carvalho	05ffb024bb	replica: Kill table::calculate_shard_from_sstable_generation() Inferring shard from generation is long gone. We still use it in some scripts, but that's no longer needed in Scylla, when loading the SSTables, and it also conflicts with ongoing work of UUID-based generations. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12476	2023-01-09 20:17:57 +02:00
Takuya ASADA	548c9e36a1	main: add tcp_timestamps sanity check Check net.ipv4.tcp_timestamps, show warning message when it's not set to 1. Fixes #12144 Closes #12199	2023-01-09 19:08:21 +02:00
Nadav Har'El	d6e6820f33	Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity The CQL binary protocol version 3 was introduced in 2014. All Scylla version support it, and Cassandra versions 2.1 and newer. Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer use 32-bit collection sizes. Unfortunately, we implemented support for multiple serialization formats very intrusively, by pushing the format everywhere. This avoids the need to re-serialize (sometimes) but is quite obnoxious. It's also likely to be broken, since it's almost untested and it's too easy to write cql_serialization_format::internal() instead of propagating the client specified value. Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's easy to verify that they are no longer in use on a running system by examining the `system.clients` table before upgrade. Fixes #10607 Closes #12432 * github.com:scylladb/scylladb: treewide: drop cql_serialization_format cql: modification_statement: drop protocol check for LWT transport: drop cql protocol versions 1 and 2	2023-01-09 18:52:41 +02:00
Botond Dénes	bd42da6e69	tools/scylla-sstable: consume_reader(): improve pause heuristincs The consume loop had some heuristics in place to determine whether after pausing, the consumer wishes to skip just the partition or the remaining content of the sstable. This heuristics was flawed so replace it with a non-heuristic method: track the last consumed fragment and look at this to determine what should be done.	2023-01-09 09:46:57 -05:00
Botond Dénes	1d222220e0	test/cql-pytest/test_tools.py: add test for scylla-sstable script To test the script operation, we use some of the example scripts from the example directory. Namely, dump.lua and slice.lua. These two scripts together have a very good coverage of the entire script API. Testing their functionality therefore also provides a good coverage of the lua bindings. A further advantage is that since both scripts dump output in identical format to that of the data-dump operation, it is trivial to do a comparison against this already tested operation. A targeted test is written for the sstable skip functionality of the consumer API.	2023-01-09 09:46:57 -05:00
Botond Dénes	ace42202df	tools: add scylla-sstable-scripts directory To be the home of example scripts for scylla-sstable. For now only a README.md is added describing the directory's purpose and with links to useful resources. One example script is added in this patch, more will come later.	2023-01-09 09:46:57 -05:00
Botond Dénes	7b40463f29	tools/scylla-sstable: remove custom operation We now have a script operation, the custom operation (poor man's script operation) has no reason to exist anymore.	2023-01-09 09:46:57 -05:00
Botond Dénes	e5071fdeab	tools/scylla-sstable: add script operation Loads the script from the specified path, then feeds the mutation fragment stream to it. For now only Lua scripts are supported for the simple reason that Lua is easy to write bindings for, it is simple and lightweight and more importantly we already have Lua included in the Scylla binary as it is used as the implementation language for UDF/UDA. We might consider WASM support in the future, but for now we don't have any language support in WASM available.	2023-01-09 09:46:57 -05:00
Botond Dénes	9dd5107919	tools/sstable: introduce the Lua sstable consumer The Lua sstable consumer loads a script from the specified path then feeds the mutation fragment stream to the script via the sstable_consumer methods, each method of which the script is allowed to define, effectively overloading the virtual method in Lua. This allows for very wide and flexible customization opportunities for what to extract from sstables and how to process and present them, without the need to recompile the scylla-sstable tool.	2023-01-09 09:46:57 -05:00
Botond Dénes	50b155e706	dht/i_partitioner.hh: ring_position_ext: add weight() accessor	2023-01-09 09:46:57 -05:00
Botond Dénes	8699fe5001	lang/lua: export Scylla <-> lua type conversion methods Currently hidden in lang/lua.cc, declare these in a header so others can use it.	2023-01-09 09:46:57 -05:00
Botond Dénes	e9a52837cf	lang/lua: use correct lib name for string lib AFAIK the mistake had no real consequence, but still it is nicer to have it correct.	2023-01-09 09:46:57 -05:00
Botond Dénes	76663d7774	lang/lua: fix type in aligned_used_data (meant to be user_data)	2023-01-09 09:46:57 -05:00
Botond Dénes	943fc3b6f3	lang/lua: use lua_State* in Scylla type <-> Lua type conversions Instead of the lua_slice_state which is local to this file. We want to reuse the Scylla type <-> Lua type conversion functions but for that they have to use the more generic lua_State*. No functionality or convenience is lost with the switch, the code didn't make use of the other fields bundled in lua_slice_state.	2023-01-09 09:46:57 -05:00
Botond Dénes	8045751867	tools/sstable_consumer: more consistent method naming Use `consume_` consistently across the entire interface, instead of having some methods with `on_` and others with `consume_` prefixes.	2023-01-09 09:46:57 -05:00
Botond Dénes	8e117501ac	tools/scylla-sstable: extract sstable_consumer interface into own header So it can be used in code outside scylla-sstable.cc. This source file is quite large already, and as we have yet another large chunk of code to add, we want to add it in a separate file.	2023-01-09 09:46:57 -05:00
Botond Dénes	9b1c486051	tools/json_writer: add accessor to underlying writer	2023-01-09 09:46:57 -05:00
Botond Dénes	cfb5afbe9b	tools/scylla-sstable: fix indentation Left broken by previous patches.	2023-01-09 09:46:57 -05:00
Botond Dénes	d42b0bb5d5	tools/scylla-sstable: export mutation_fragment_json_writer declaration To json_writer.hh. Method definition are left in scylla-sstable.cc. Indentation is left broken, will be fixed by the next patch.	2023-01-09 09:46:57 -05:00
Botond Dénes	517135e155	tools/scylla-sstable: mutation_fragment_json_writer un-implement sstable_consumer There is no point in the former implementing said interface. For one it is a futurized interface, which is not needed for something writing to the stdout. Rename the methods to follow the naming convention of rjson writers more closely.	2023-01-09 09:46:57 -05:00
Botond Dénes	0ee1c6ca57	tools/scylla-sstable: extract json writing logic from json_dumper We want to split this class into two parts: one with the actual logic converting mutation fragments to json, and a wrapper over this one, which implements the sstable_consumer interface. As a first step we extract the class as is (no changes) and just forward all-calls from now empty wrapper to it.	2023-01-09 09:46:57 -05:00
Botond Dénes	55ef0ed421	tools/scylla-sstable: extract json_writer into its own header Other source files will want to use it soon.	2023-01-09 09:46:57 -05:00
Botond Dénes	8623818a8d	tools/scylla-sstable: use json_writer::DataKey() to write all keys This method was renamed from its previous name of PartitionKey. Since in json partition keys and clustering keys look alike, with the only difference being that the former may also have a token, it makes to have a single method to write them (with an optional token parameter). This was the case at some point, json_dumper::write_key() taking this role. However at a later point, json_writer::PartitionKey() was introduced and now the code uses both. Standardize on the latter and give it a more generic name.	2023-01-09 09:46:57 -05:00
Botond Dénes	602fca0a12	tools/scylla-types: fix use-after-free on main lambda captures The main lambda of scylla-types, the one passed to app_template::run() was recently made a coroytine. app_template::run() however doesn't keep this lambda alive and hence after the first suspention point, accessing the lambda's captures triggers use-after-free. The simple fix is to convert the coroutine into continuation chain.	2023-01-09 09:46:57 -05:00
Tomasz Grabiec	f97268d8f2	row_cache: Fix violation of the "oldest version are evicted first" when evicting last dummy Consider the following MVCC state of a partition: v2: ==== <7> [entry2] ==== <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] Where === means a continuous range and --- means a discontinuous range. After two LRU items are evicted (entry1 and entry2), we will end up with: v2: ---------------------- <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] This will cause readers to incorrectly think there are no rows before entry <9>, because the range is continuous in v1, and continuity of a snapshot is a union of continuous intervals in all versions. The cursor will see the interval before <9> as continuous and the reader will produce no rows. This is only temporary, because current MVCC merging rules are such that the flag on the latest entry wins, so we'll end up with this once v1 is no longer needed: v2: ---------------------- <9> ===== <last dummy> ...and the reader will go to sstables to fetch the evicted rows before entry <9>, as expected. The bug is in rows_entry::on_evicted(), which treats the last dummy entry in a special way, and doesn't evict it, and doesn't clear the continuity by omission. The situation is not easy to trigger because it requires certain eviction pattern concurrent with multiple reads of the same partition in different versions, so across memtable flushes. Closes #12452	2023-01-09 16:10:52 +02:00
Avi Kivity	1bb1855757	Merge 'replica/database: fix read related metrics' from Botond Dénes Sstable read related metrics are broken for a long time now. First, the introduction of inactive reads (https://github.com/scylladb/scylladb/issues/1865) diluted this metric, as it now also contained inactive reads (contrary to the metric's name). Then, after moving the semaphore in front of the cache (`3d816b7c1`) this metric became completely broken as this metric now contains all kinds of reads: disk, in-memory and inactive ones too. This series aims to remedy this: * `scylla_database_active_reads` is fixed to only include active reads. * `scylla_database_active_reads_memory_consumption` is renamed to `scylla_database_reads_memory_consumption` and its description is brought up-to-date. * `scylla_database_disk_reads` is added to track current reads that are gone to disk. * `scylla_database_sstables_read` is added to track the number of sstables read currently. Fixes: https://github.com/scylladb/scylladb/issues/10065 Closes #12437 * github.com:scylladb/scylladb: replica/database: add disk_reads and sstables_read metrics sstables: wire in the reader_permit's sstable read count tracking reader_concurrency_semaphore: add disk_reads and sstables_read stats replica/database: fix active_reads_memory_consumption_metric replica/database: fix active_reads metric	2023-01-09 12:18:49 +02:00
Pavel Emelyanov	e20738cd7d	azure_snitch: Handle empty zone returned from IMDS Azure metadata API may return empty zone sometimes. If that happens shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to other shards. Empty zones response should be handled regardless. refs: #12185 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12274	2023-01-09 11:57:45 +02:00
Nadav Har'El	2d845b6244	test/cql-pytest: a test for more than one equality in WHERE Cassandra refuses a request with more than one equality relation to the same column, for example DELETE FROM tbl WHERE partitionKey = ? AND partitionKey = ? It complains that partitionkey cannot be restricted by more than one relation if it includes an Equal Currently, Scylla doesn't consider such requests an error. Whether or not we should be compatible with Cassandra here is discussed in issue #12472. But as long as we do accept this query, we should be sure we do the right thing: "WHERE p = 1 AND p = 2" should match nothing (not the first, or last, value being tested..), and "WHERE p = 1 AND p = 1" should match the matches of p = 1. This patch adds a test for verify that these requests indeed yield correct results. The test is scylla_only because, as explained above, Cassandra doesn't support this feature at all. Refs #12472 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12473	2023-01-09 11:56:39 +02:00
Anna Stuchlik	b61515c871	doc: replace Scylla with ScyllaDB on the menu tree and major links; related: https://github.com/scylladb/scylla-docs/issues/3962 Closes #12456	2023-01-09 08:39:50 +02:00
Avi Kivity	42575340ba	Update seastar submodule * seastar ca586cfb8d...8889cbc198 (14): > http: request_parser: fix grammar ambiguity in field_content Fixes #12468 > sstring: use fold expression to simply copy_str_to() > sstring: use fold expression to simply str_len() > metrics: capture by move in make_function() > metrics: replace homebrew is_callable<> with is_invocable_v<> > reactor: use std::move() to avoid copy. > reactor: remove redundant semicolon. > reactor: use mutable to make std::move() work. > build: install liburing explicitly on ArchLinux. > reactor: use a for loop for submitting ios > metrics: add spaces around '=' > parallel utils: align concept with implementation > reactor: s/resize(0)/clear()/ > reactor: fix a typo in comment Closes #12469	2023-01-08 18:56:00 +02:00
Alejo Sanchez	d632e1aa7a	test/pytest: add missing import, remove unused import Add missed import time and remove unused name import. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12446	2023-01-08 17:38:46 +02:00
Avi Kivity	5ffe4fee6d	Merge 'Remove legacy half reverse' from Michał Radwański This commit removes consume_in_reverse::legacy_half_reverse, an option once used to indicate that the given key ranges are sorted descending, based on the clustering key of the start of the range, and that the range tombstones inside partition would be sorted (descending, as all the mutation fragments would) according to their end (but range tombstone would still be stored according to their start bound). As it turns out, mutation::consume, when called with legacy_half_reverse option produces invalid fragment stream, one where all the row tombstone changes come after all the clustering rows. This was not an issue, since when constructing results from the query, Scylla would not pass the tombstones to the client, but instead compact data beforehand. In this commit, the consume_in_reverse::legacy_half_reverse is removed, along with all the uses. As for the swap out in mutation_partition.cc in query_mutation and to_data_query_result: The downstream was not prepared to deal with legacy_half_reverse. mutation::consume contains ``` if (reverse == consume_in_reverse::yes) { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } else { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } ``` So why did it work at all? to_data_query_result deals with a single slice. The used consumer (compact_for_query_v2) compacts-away the range tombstone changes, and thus the only difference between the consume_in_reverse::no and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys and the second one was ordered decreasing. This property is maintained if we swap out for the consume_in_reverse::yes format. Refs: #12353 Closes #12453 * github.com:scylladb/scylladb: mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes mutation: move consume_in_reverse def to mutation_consumer.hh	2023-01-08 15:42:00 +02:00
Botond Dénes	c4688563e3	sstables: track decompressed buffers Convert decompressed temporary buffers into tracked buffers just before returning them to the upper layer. This ensures these buffers are known to the reader concurrency semaphore and it has an accurate view of the actual memory consumption of reads. Fixes: #12448 Closes #12454	2023-01-08 15:34:28 +02:00
Kamil Braun	b77df84543	test: test_topology: make test_nodes_with_different_smp less hacky The test would use a trick to start a separate Scylla cluster from the one provided originally by the test framework. This is not supported by the test framework and may cause unexpected problems. Change the test to perform regular node operations. Instead of starting a fresh cluster of 3 nodes, we join the first of these nodes to the original framework-provided cluster, then decommission the original nodes, then bootstrap the other 2 fresh nodes. Also add some logging to the test. Refs: #12438, #12442 Closes #12457	2023-01-08 15:33:17 +02:00
Avi Kivity	02c9968e73	Merge 'Add WASM UDF implementation in Rust' from Wojciech Mitros This series adds the implementation and usage of rust wasmtime bindings. The WASM UDFs introduced by this patch are interruptable and use memory allocated using the seastar allocator. This series includes #11102 (the first two commits) because #11102 required disabling wasm UDFs completely. This patch disables them in the middle of the series, and enables them again at the end. After this patch, `libwasmtime.a` can be removed from the toolchain. This patch also removes the workaround for #https://github.com/scylladb/scylladb/issues/9387 but it hasn't been tested with ARM yet - if the ARM test causes issues I'll revert this part of the change. Closes #11351 * github.com:scylladb/scylladb: build: remove references to unused c bindings of wasmtime test: assert that WASM allocations can fail without crashing wasm: limit memory allocated using mmap wasm: add configuration options for instance cache and udf execution test: check that wasmtime functions yield wasm: use the new rust bindings of wasmtime rust: add Wasmtime bindings rust: add build profiles more aligned with ninja modes rust: adjust build according to cxxbridge's recommendations tools: toolchain: dbuild: prepare for sharing cargo cache	2023-01-08 15:31:09 +02:00
Nadav Har'El	f5cda3cfc3	test/cql-pytest: add more tests for "timestamp" column type In issue #3668, a discussion spanning several years theorized that several things are wrong with the "timestamp" type. This patch begins by adding several tests that demonstrate that Scylla is in fact behaving correctly, and mostly identically to Cassandra except one esoteric error handling case. However, after eliminating the red herrings, we are left for the real issue that prompted opening #3668, which is a duplicate of issues #2693 and #2694, and this patch also adds a reproducer for that. The issue is that Cassandra 4 added support for arithmetic expressions on values, and timestamps can be added durations, for example: '2011-02-03 04:05:12.345+0000' - 1d is a valid timestamp - and we don't currently support this syntax. So the new test - which passes on Cassandra 4 and fails on Scylla (or Cassandra 3) is marked xfail. Refs #2693 Refs #2694 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12436	2023-01-08 15:00:49 +02:00
Michał Chojnowski	08b3a9c786	configure: don't reduce parsers' optimization level to 1 in release The line modified in this patch was supposed to increase the optimization levels of parsers in debug mode to 1, because they were too slow otherwise. But as a side effect, it also reduced the optimization level in release mode to 1. This is not a problem for the CQL frontend, because statement preparation is not performance-sensitive, but it is a serious performance problem for Alternator, where it lies in the hot path. Fix this by only applying the -O1 to debug modes. Fixes #12463 Closes #12460	2023-01-06 18:04:36 +02:00
Wojciech Mitros	903c4874d0	build: remove references to unused c bindings of wasmtime Before the changes intorducing the new wasmtime bindings we relied on an downloaded static library libwasmtime.a. Now that the bindings are introduced, we do not rely on it anymore, so all references to it can be removed.	2023-01-06 14:07:29 +01:00
Wojciech Mitros	996a942e05	test: assert that WASM allocations can fail without crashing The main source of big allocations in the WASM UDF implementation is the WASM Linear Memory. We do not want Scylla to crash even if a memory allocation for the WASM Memory fails, so we assert that an exception is thrown instead. The wasmtime runtime does not actually fail on an allocation failure (assuming the memory allocator does not abort and returns nullptr instead - which our seastar allocator does). What happens then depends on the failed allocation handling of the code that was compiled to WASM. If the original code threw an exception or aborted, the resulting WASM code will trap. To make sure that we can handle the trap, we need to allow wasmtime to handle SIGILL signals, because that what is used to carry information about WASM traps. The new test uses a special WASM Memory allocator that fails after n allocations, and the allocations include both memory growth instructions in WASM, as well as growing memory manually using the wasmtime API. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2023-01-06 14:07:29 +01:00
Wojciech Mitros	f05d612da8	wasm: limit memory allocated using mmap The wasmtime runtime allocates memory for the executable code of the WASM programs using mmap and not the seastar allocator. As a result, the memory that Scylla actually uses becomes not only the memory preallocated for the seastar allocator but the sum of that and the memory allocated for executable codes by the WASM runtime. To keep limiting the memory used by Scylla, we measure how much memory do the WASM programs use and if they use too much, compiled WASM UDFs (modules) that are currently not in use are evicted to make room. To evict a module it is required to evict all instances of this module (the underlying implementation of modules and instances uses shared pointers to the executable code). For this reason, we add reference counts to modules. Each instance using a module is a reference. When an instance is destroyed, a reference is removed. If all references to a module are removed, the executable code for this module is deallocated. The eviction of a module is actually acheved by eviction of all its references. When we want to free memory for a new module we repeatedly evict instances from the wasm_instance_cache using its LRU strategy until some module loses all its instances. This process may not succeed if the instances currently in use (so not in the cache) use too much memory - in this case the query also fails. Otherwise the new module is added to the tracking system. This strategy may evict some instances unnecessarily, but evicting modules should not happen frequently, and any more efficient solution requires an even bigger intervention into the code.	2023-01-06 14:07:29 +01:00
Wojciech Mitros	b8d28a95bf	wasm: add configuration options for instance cache and udf execution Different users may require different limits for their UDFs. This patch allows them to configure the size of their cache of wasm, the maximum size of indivitual instances stored in the cache, the time after which the instances are evicted, the fuel that all wasm UDFs are allowed to consume before yielding (for the control of latency), the fuel that wasm UDFs are allowed to consume in total (to allow performing longer computations in the UDF without detecting an infinite loop) and the hard limit of the size of UDFs that are executed (to avoid large allocations)	2023-01-06 14:07:27 +01:00
Wojciech Mitros	3214f5c2db	test: check that wasmtime functions yield The new implementation for WASM UDFs allows executing the UDFs in pieces. This commit adds a test asserting that the UDF is in fact divided and that each of the execution segments takes no longer than 1ms.	2023-01-06 14:05:53 +01:00
Wojciech Mitros	3146807192	wasm: use the new rust bindings of wasmtime This patch replaces all dependencies on the wasmtime C++ bindings with our new ones. The wasmtime.hh and wasm_engine.hh files are deleted. The libwasmtime.a library is no longer required by configure.py. The SCYLLA_ENABLE_WASMTIME macro is removed and wasm udfs are now compiled by default on all architectures. In terms of implementation, most of code using wasmtime was moved to the Rust source files. The remaining code uses names from the new bindings (which are mostly unchanged). Most of wasmtime objects are now stored as a rust::Box<>, to make it compatible with rust lifetime requirements. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2023-01-06 14:05:53 +01:00
Wojciech Mitros	50b24cf036	rust: add Wasmtime bindings The C++ bindings provided by wasmtime are lacking a crucial capability: asynchronous execution of the wasm functions. This forces us to stop the execution of the function after a short time to prevent increasing the latency. Fortunately, this feature is implemented in the native language of Wasmtime - Rust. Support for Rust was recently added to scylla, so we can implement the async bindings ourselves, which is done in this patch. The bindings expose all the objects necessary for creating and calling wasm functions. The majority of code implemented in Rust is a translation of code that was previously present in C++. Types exported from Rust are currently required to be defined by the same crate that contains the bridge using them, so wasmtime types can't be exported directly. Instead, for each class that was supposed to be exported, a wrapper type is created, where its first member is the wasmtime class. Note that the members are not visible from C++ anyway, the difference only applies to Rust code. Aside from wasmtime types and methods, two additional types are exported with some associated methods. - The first one is ValVec, which is a wrapper for a rust Vec of wasmtime Vals. The underlying vector is required by wasmtime methods for calling wasm functions. By having it exported we avoid multiple conversions from a Val wrapper to a wasmtime Val, as would be required if we exported a rust Vec of Val wrappers (the rust Vec itself does not require wrappers if the type it contains is already wrapped) - The second one is Fut. This class represents an computation tha may or may not be ready. We're currently using it to control the execution of wasm functions from C++. This class exposes one method: resume(), which returns a bool that signals whether the computation is finished or not. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2023-01-06 14:05:53 +01:00
Wojciech Mitros	33c97de25c	rust: add build profiles more aligned with ninja modes A cargo profile is created for each of build modes: dev, debug, sanitize, realease and coverage. The names of cargo profiles are prefixed by "rust-" because cargo does not allow separate "dev" and "debug" profiles. The main difference between profiles are their optimization levels, they correlate to the levels used in configure.py. The debug info is stripped only in the dev mode, and only this mode uses "incremental" compilation to speed it up.	2023-01-06 14:05:53 +01:00
Wojciech Mitros	4d7858e66d	rust: adjust build according to cxxbridge's recommendations Currently, the rust build system in Scylla creates a separate static library for each incuded rust package. This could cause duplicate symbol issues when linking against multiple libraries compiled from rust. This issue is fixed in this patch by creating a single static library to link against, which combines all rust packages implemented in Scylla. The Cargo.lock for the combined build is now tracked, so that all users of the same scylla version also use the same versions of imported rust modules. Additionally, the rust package implementation and usage docs are modified to be compatible with the build changes. This patch also adds a new header file 'rust/cxx.hh' that contains definitions of additional rust types available in c++.	2023-01-06 14:05:53 +01:00
Avi Kivity	eeaa475de9	tools: toolchain: dbuild: prepare for sharing cargo cache Rust's cargo caches downloaded sources in ~/.cargo. However dbuild won't provide access to this directory since it's outside the source directory. Prepare for sharing the cargo cache between the host and the dbuild environment by: - Creating the cache if it doesn't already exist. This is likely if the user only builds in a dbuild environment. - Propagating the cache directory as a mounted volume. - Respecting the CARGO_HOME override.	2023-01-06 14:05:53 +01:00
Avi Kivity	6868dcf30b	tools: toolchain: drop s390x from prepare script architecture list It's been a long while since we built ScyllaDB for s390x, and in fact the last time I checked it was broken on the ragel parser generator generating bad source files for the HTTP parser. So just drop it from the list. I kept s390x in the architecture mapping table since it's still valid. Closes #12455	2023-01-06 09:08:01 +02:00
Michał Radwański	1fbf433966	mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse This commit removes consume_in_reverse::legacy_half_reverse, an option once used to indicate that the given key ranges are sorted descending, based on the clustering key of the start of the range, and that the range tombstones inside partition would be sorted (descending, as all the mutation fragments would) according to their end (but range tombstone would still be stored according to their start bound). As it turns out, mutation::consume, when called with legacy_half_reverse option produces invalid fragment stream, one where all the row tombstone changes come after all the clustering rows. This was not an issue, since when constructing results from the query, Scylla would not pass the tombstones to the client, but instead compact data beforehand. In this commit, the consume_in_reverse::legacy_half_reverse is removed, along with all the uses. As for the swap out in mutation_partition.cc in query_mutation and to_data_query_result: The downstream was not prepared to deal with legacy_half_reverse. mutation::consume contains ``` if (reverse == consume_in_reverse::yes) { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } else { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } ``` So why did it work at all? to_data_query_result deals with a single slice. The used consumer (compact_for_query_v2) compacts-away the range tombstone changes, and thus the only difference between the consume_in_reverse::no and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys and the second one was ordered decreasing. This property is maintained if we swap out for the consume_in_reverse::yes format.	2023-01-05 18:48:55 +01:00
Botond Dénes	2612f98a6c	Merge 'Abort repair tasks' from Aleksandra Martyniuk Aborting of repair operation is fully managed by task manager. Repair tasks are aborted: - on shutdown; top level repair tasks subscribe to global abort source. On shutdown all tasks are aborted recursively - through node operations (applies to data_sync_repair_task_impls and their descendants only); data_sync_repair_task_impl subscribes to node_ops_info abort source - with task manager api (top level tasks are abortable) - with storage_service api and on failure; these cases were modified to be aborted the same way as the ones from above are. Closes #12085 * github.com:scylladb/scylladb: repair: make top level repair tasks abortable repair: unify a way of aborting repair operations repair: delete sharded abort source from node_ops_info repair: delete unused node_ops_info from data_sync_repair_task_impl repair: delete redundant abort subscription from shard_repair_task_impl repair: add abort subscription to data sync task tasks: abort tasks on system shutdown	2023-01-05 15:21:35 +01:00
Avi Kivity	cc6010b512	Merge 'Make restore_replica_count abortable' from Benny Halevy Similar to the way we allow aborting streaming-based removenode, subscribe to storage_service::_abort_source to request abort locally and pass a shared_ptr<abort_source> to `node_ops_info`, used to abort removenode_with_repair on shutdown. Fixes #12429 Closes #12430 * github.com:scylladb/scylladb: storage_service: restore_replica_count: demote status_checker related logging to debug level storage_service: restore_replica_count: allow aborting removenode_with_repair storage_service: coroutinize restore_replica_count storage_service: restore_replica_count: undefer stop_status_checker storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification storage_service: restore_replica_count: coroutinize status_checker	2023-01-05 15:21:35 +01:00
Kamil Braun	09da661eeb	Merge 'raft: replace experimental raft option with dedicated flag' from Gleb Natapov Unlike other experimental feature we want to raft to be opt in even after it leaves experimental mode. For that we need to have a separate option to enable it. The patch adds the binary option "consistent-cluster-management" for that. * 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev: raft: replace experimental raft option with dedicated flag main: move supervisor notification about group registry start where it actually starts	2023-01-05 15:21:35 +01:00
Anna Stuchlik	44e6f18d1b	doc: add the new upgrade guide to the toctree and fix its name	2023-01-05 14:13:33 +01:00
Anna Stuchlik	0ad2e3e63a	docs: add the upgrade guide from ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2	2023-01-05 13:30:10 +01:00
Aleksandra Martyniuk	dcb91457da	api: change retrieve_status signature Sometimes we may need task status to be nothrow move constructible. httpd::task_manager_json::task_status does not satisfy this requirement. retrieve_status returns future<full_task_status> instead of future<task_status> to provide an intermediate struct with better properties. An argument is passed by reference to prevent the necessity to copy foreign_ptr.	2023-01-05 13:28:51 +01:00
Kamil Braun	df72536fc5	Merge 'docs: add the upgrade guide for Enterprise from 2022.1 to 2022.2' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/12314 This PR adds the upgrade guide for ScyllaDB Enterprise - from version 2022.1 to 2022.2. Using this opportunity, I've replaced "Scylla" with "ScyllaDB" in the upgrade-enterprise index file. In previous releases, we added several upgrade guides - one per platform (and version). In this PR, I've merged the information for different platforms to create one generic upgrade guide. It is similar to what @kbr- added for the Open Source upgrade guide from 5.0 to 5.1. See https://docs.scylladb.com/stable/upgrade/upgrade-opensource/upgrade-guide-from-5.0-to-5.1/. Closes #12339 * github.com:scylladb/scylladb: docs: add the info about minor release docs: add the new upgade guide 2022.1 to 2022.2 to the index and the toctree docs: add the index file for the new upgrage guide from 2022.1 to 2022.2 docs: add the metrics update file to the upgrade guide 2022.1 to 2022.2 docs: add the upgrade guide for ScyllaDB Enterprise from 2022.1 to 2022.2	2023-01-04 18:07:00 +01:00
Benny Halevy	086546f575	storage_service: restore_replica_count: demote status_checker related logging to debug level the status_checker is not the main line of business of restore_replica_count, starting and stopping it do nt seem to deserve info level logging, which might have been useful in the past to debug issues surrounding that. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	3879ee1db8	storage_service: restore_replica_count: allow aborting removenode_with_repair Similar to the way we allow aborting streaming-based removenode, subscribe to storage_service::_abort_source to request abort locally and pass a shared_ptr<abort_source> to `node_ops_info`, used to abort removenode_with_repair on shutdown. Fixes #12429 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	afece5bdc4	storage_service: coroutinize restore_replica_count and unwrap the async thread started for streaming. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	d1eadc39c1	storage_service: restore_replica_count: undefer stop_status_checker Now that all exceptions in the rest of the function are swallowed, just execute the stop_status_checker deferred action serially before returning, on the wau to coroutinizing restore_replica_count (since we can't co_await status_checker inside the deferred action). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	788ecb738d	storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification On the way to coroutinizing restore_replica_count, extract awaiting stream_async and send_replication_notification into a try/catch blocks so we can later undefer stop_status_checker. The exception is still returned as an exceptional future which is logged by the caller as warning. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:02:42 +02:00
Benny Halevy	b54d121dfd	storage_service: restore_replica_count: coroutinize status_checker There is no need to start a thread for the status_checker and can be implemented using a background coroutine. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:02:20 +02:00
Botond Dénes	1d273a98b9	readers/multishard: shard_reader::close() silence read-ahead timeouts Timouts are benign, especially on a read-ahead that turned out to be not needed at all. They just introduce noise in the logs, so silence them. Fixes: #12435 Closes #12441	2023-01-04 16:10:09 +02:00
Kamil Braun	4268b1bbc2	Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Since --smp can only be specified via the command line, a corresponding parameter was added to the ManagerClient.server_add method. It allows to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. Fixes: #12252 Closes #12374 * github.com:scylladb/scylladb: raft: raft_group0, register RPC verbs on all shards raft: raft_append_entries, copy entries to the target shard test.py, allow to specify the node's command line in test	2023-01-04 11:11:21 +01:00
Marcin Maliszkiewicz	61a9816bad	utils/rjson: enable inlining in rapidjson library Due to lack of NDEBUG macro inlining was disabled. It's important for parsing and printing performance. Testing with perf_simple_query shows that it reduced around 7000 insns/op, thus increasing median tps by 4.2% for the alternator frontend. Because inlined functions are called for every character in json this scales with request/response size. When default write size is increased by around 7x (from ~180 to ~ 1255 bytes) then the median tps increased by 12%. Running: ./build/release/test/perf/perf_simple_query_g --smp 1 \ --alternator forbid --default-log-level error \ --random-seed=1235000092 --duration=60 --write Results before the patch: median 46011.50 tps (197.1 allocs/op, 12.1 tasks/op, 170989 insns/op, 0 errors) median absolute deviation: 296.05 maximum: 46548.07 minimum: 42955.49 Results after the patch: median 47974.79 tps (197.1 allocs/op, 12.1 tasks/op, 163723 insns/op, 0 errors) median absolute deviation: 303.06 maximum: 48517.53 minimum: 44083.74 The change affects both json parsing and printing. Closes #12440	2023-01-04 10:27:35 +02:00
Michał Jadwiszczak	83bb77b8bb	test/boost/cql_query_test: enable `parallelized_aggregation` Run tests for parallelized aggregation with `enable_parallelized_aggregation` set always to true, so the tests work even if the default value of the option is false. Closes #12409	2023-01-04 10:11:25 +02:00
Anna Stuchlik	c4d779e447	doc: Fix https://github.com/scylladb/scylla-doc-issues/issues/854 - update the procedure to update topology strategy when nodes are on different racks Closes #12439	2023-01-04 09:50:10 +02:00
Avi Kivity	2739ac66ed	treewide: drop cql_serialization_format Now that we don't accept cql protocol version 1 or 2, we can drop cql_serialization format everywhere, except when in the IDL (since it's part of the inter-node protocol). A few functions had duplicate versions, one with and one without a cql_serialization_format parameter. They are deduplicated. Care is taken that `partition_slice`, which communicates the cql_serialization_format across nodes, still presents a valid cql_serialization_format to other nodes when transmitting itself and rejects protocol 1 and 2 serialization\ format when receiving. The IDL is unchanged. One test checking the 16-bit serialization format is removed.	2023-01-03 19:54:13 +02:00
Avi Kivity	654b96660a	cql: modification_statement: drop protocol check for LWT CQL protocol 1 did not support LWT, but since we don't support it any more, we can drop the check and the supporting get_protocol_version() helper.	2023-01-03 19:51:57 +02:00
Avi Kivity	424dbf43f3	transport: drop cql protocol versions 1 and 2 Version 3 was introduced in 2014 (Cassandra 2.1) and was supported in the very first version of Scylla (`2a7da21481` "CQL binary protocol"). Cassandra 3.0 (2015) dropped protocols 1 and 2 as well. It's safe enough to drop it now, 9 years after introduction of v3 and 7 years after Cassandra stopped supporting it. Dropping it allows dropping cql_serialization_format, which causes quite a lot of pain, and is probably broken. This will be dropped in the following patch.	2023-01-03 19:47:49 +02:00
Avi Kivity	f600ad5c1b	Update seastar submodule * seastar 3db15b5681...ca586cfb8d (28): > reactor: trim returned buffer to received number of bytes > util/process: include used header > build: drop unused target_include_directories() > build: use BUILD_IN_SOURCE instead chdir <SOURCE_DIR> > build: specify CMake policy CMP0135 to new > tests: only destroy allocated pending connections > build: silence the output when generating private keys > tests, httpd: Limit loopback connection factory sharding > lw_shared_ptr: Add nullptr_t comparing operators > noncopyable_function: Add concept for (Func func) constructor > reactor: add process::terminate() and process::kill() > Merge 'tests, include: include headers without ".." in path' from Kefu Chai > build: customize toolset for building Boost > build: use different toolset base on specified compiler > allocator: add an option to reserve additional memory for the OS > Merge 'build: pass cflags and ldflags to cooking.sh' from Kefu Chai > build: build static library of cryptopp > gate: add gate holders debugging > build: detect debug build of yaml-cpp also > build: do not use pkg_search_module(IMPORTED_TARGET) for finding yaml-cpp > build: bump yaml-cpp to 0.7.0 in cooking_recipe > build: bump cryptopp to 8.7.0 in cooking_recipe > build: bump boost to 1.81.0 in cooking_recipe > build: bump fmtlib to 9.1.0 in cooking_recipe > shared_ptr: add overloads for fmt::ptr() > chunked_fifo: const_iterator: use the base class ctor > build: s/URING_LIBARIES/URING_LIBRARIES/ > build: export the full path of uring with URING_LIBRARIES Closes #12434	2023-01-03 17:58:31 +02:00
Alejo Sanchez	889acf710c	test/python: increase CQL connection timeout for... test_ssl In very slow debug builds the default driver timeouts are too low and tests might fail. Bump up the values to a more reasonable time. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12408	2023-01-03 17:10:46 +02:00
Nadav Har'El	1c96d2134f	docs,alternator: link to issue about missing ACL feature The alternator compatibility.md document mentions the missing ACL (access control) feature, but unlike other missing features we forgot to link to the open issue about this missing feature. So let's add that link. Refs #5047. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12399	2023-01-03 16:50:33 +02:00
Kamil Braun	fc57626afa	Merge 'docs: remove auto_bootstrap option from the documentation' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/12318 This PR removes all occurrences of the `auto_bootstrap` option in the docs. In most cases, I've simply removed the option name and its definition, but sometimes additional changes were necessary: - In node-joined-without-any-data.rst, I removed the `auto_bootstrap `option as one of the causes of the problem. - In rebuild-node.rst, I removed the first step in the procedure (enabling the `auto_bootstrap `option). - In admin. rst, I removed the section about manual bootstrapping - it's based on setting `auto_bootstrap` to false, which is not possible now. Closes #12419 * github.com:scylladb/scylladb: docs: remove the auto_bootstrap option from the admin procedures - involves removing the Manual Bootstraping section docs: remove the auto_bootstrap option from the procedure to replace a dead node docs: remove the auto_bootstrap option from the Troubleshooting article about a node joining with no data docs: remove the auto_bootstrap option from the procedure to rebuild a node after losing the data volume docs: remove the auto_bootstrap option from the procedures to create a cluster or add a DC	2023-01-03 15:44:00 +01:00
Botond Dénes	e4d5b2a373	replica/database: add disk_reads and sstables_read metrics Tracking the current number of reads gone to disk and the current number of sstables read by all such reads respectively.	2023-01-03 09:37:29 -05:00
Botond Dénes	2acfa950d7	sstables: wire in the reader_permit's sstable read count tracking Hook in the relevant methods when creating and destroying sstable readers.	2023-01-03 09:37:29 -05:00
Botond Dénes	2c0de50969	reader_concurrency_semaphore: add disk_reads and sstables_read stats And the infrastructure to reader_permit to update them. The infrastructure is not wired in yet. These metrics will be used to count the number of reads gone to disk and the number of sstables read currently respectively.	2023-01-03 09:37:29 -05:00
Botond Dénes	dcd2deb5af	replica/database: fix active_reads_memory_consumption_metric Rename to reads_memory_consumption and drop the "active" from the description as well. This metric tracks the memory consumption of all reads: active or inactive. We don't even currently have a way to track the memory consumption of only active reads. Drop the part of the description which explains the interaction with other metrics: this part is outdated and the new interactions are much more complicated, no way to explain in a metric description. Also ask the semaphore to calculate the memory amount, instead of doing it in the metric itself.	2023-01-03 09:25:47 -05:00
Petr Gusev	8417840647	raft: raft_group0, register RPC verbs on all shards raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are (usually) processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Fixes: #12252	2023-01-03 17:04:07 +03:00
Anna Stuchlik	00ef20c3df	docs: remove the auto_bootstrap option from the admin procedures - involves removing the Manual Bootstraping section	2023-01-03 14:48:01 +01:00
Anna Stuchlik	b7d62b2fc7	docs: remove the auto_bootstrap option from the procedure to replace a dead node	2023-01-03 14:47:55 +01:00
Anna Stuchlik	bc62e61df1	docs: remove the auto_bootstrap option from the Troubleshooting article about a node joining with no data	2023-01-03 14:46:38 +01:00
Anna Stuchlik	1602f27cd7	docs: remove the auto_bootstrap option from the procedure to rebuild a node after losing the data volume	2023-01-03 14:45:08 +01:00
Botond Dénes	929481ea9c	replica/database: fix active_reads metric This metric has been broken for a long time, since inactive reads were introduced. As calculated currently, it includes all permits that passed admission, including inactive reads. On the other hand, it excludes permits created bypassing admission. Fix by using the newly introduced (in this patch) reader_concurrency_semaphore::active_reads() as the basis of this metric: this now includes all permits (reads) that are currently active, excluding waiters and inactive reads.	2023-01-03 08:12:25 -05:00
Petr Gusev	7725e03a09	raft: raft_append_entries, copy entries to the target shard If append_entries RPC was received on a non-zero shard, we may need to pass it to a zero (or, potentially, some other) shard. The problem is that raft::append_request contains entries in the form of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't support cross-shard reference counting. In debug mode it contains a special ref-counting facility debug_shared_ptr_counter_type, which resorts to on_internal_error if it detects such a case. To solve this, we just copy log entries to the target shard if it isn't equal to the current one. In most cases, if --smp setting is the same on all nodes, RPC will be handled on zero shard, so there will be no overhead.	2023-01-03 15:25:00 +03:00
Petr Gusev	1c23390f12	test.py, allow to specify the node's command line in test An optional parameter cmdline has been added to the ManagerClient.server_add method. It allows you to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. To change or add a parameter just specify its name and value one after the other. To remove parameter use the special keyword __remove__ as a value. To set a parameter without a value (such as --overprovisioned) use the special keyword __missing__ as the value.	2023-01-03 15:24:54 +03:00
Nadav Har'El	eb85f136c8	cql-pytest: document how to write new cql-pytest tests Add to test/cql-pytest/README.md an explanation of the philosophy of the cql-pytest test suite, and some guideliness on how to write good tests in that framework. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12400	2023-01-03 12:13:22 +02:00
Anna Stuchlik	994bc33147	docs: fix the command on the Manager-Monitoring Integration troubleshooting page Closes #12375	2023-01-03 11:41:16 +02:00
Anna Stuchlik	9d17d812c0	docs: Fix https://github.com/scylladb/scylla-doc-issues/issues/870 , update the nodetool rebuild command Closes #12416	2023-01-03 11:40:40 +02:00
Gleb Natapov	1688163233	raft: replace experimental raft option with dedicated flag Unlike other experimental feature we want to raft to be optional even after it leaves experimental mode. For that we need to have a separate option to enable it. The patch adds the binary option "consistent-cluster-management" for that.	2023-01-03 11:15:11 +02:00
Gleb Natapov	29060cc235	main: move supervisor notification about group registry start where it actually starts `99fe580068` moved raft_group_registry::start call a bit later, but forget to move supervisor notification call. Do it now.	2023-01-03 11:09:30 +02:00
Botond Dénes	2ef71e9c70	Merge 'Improve verbosity of task manager api' from Aleksandra Martyniuk The PR introduces changes to task manager api: - extends tasks' list returned with get_tasks with task type, keyspace, table, entity, and sequence number - extends status returned with get_task_status and wait_task with a list of children's ids Closes #12338 * github.com:scylladb/scylladb: api: extend status in task manager api api: extend get_tasks in task manager api	2023-01-03 10:39:41 +02:00
Botond Dénes	82101b786d	Merge 'docs: document scylla-api-client' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/11999. This PR adds a description of scylla-api-cli. Closes #12392 * github.com:scylladb/scylladb: docs: fix the description of the system log POST example docs: uptate the curl tool name docs: describe how to use the scylla-api-client tool docs: fix the scylla-api-client tool name docs: document scylla-api-cli	2023-01-03 10:30:04 +02:00
Benny Halevy	63c2cdafe8	sstables: index_reader: close(index_bound&) reset current_list When closing _lower_bound and *_upper_bound in the final close() call, they are currently left with an engaged current_list member. If the index_reader uses a _local_index_cache, it is evicted with evict_gently which will, rightfully, see the respective pages as referenced, and they won't be evicted gently (only later when the index_reader is destroyed). Reset index_bound.current_list on close(index_bound&) to free up the reference. Ref #12271 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12370	2023-01-02 16:42:33 +01:00
Avi Kivity	767b7be8be	Merge 'Get rid of handle_state_replacing' from Benny Halevy Since [repair: Always use run_replace_ops](`2ec1f719de`), nodes no longer publish HIBERNATE state so we don't need to support handling it. Replace is now always done using node operations (using repair or streaming). so nodes are never expected to change status to HIBERNATE. Therefore storage_service:handle_state_replacing is not needed anymore. This series gets rid of it and updates documentation related to STATUS:HIBERNATE respectively. Fixes #12330 Closes #12349 * github.com:scylladb/scylladb: docs: replace-dead-node: get rid of hibernate status storage_service: get rid of handle_state_replacing	2023-01-02 13:35:29 +02:00
Gleb Natapov	28952d32ff	storage_service: move leave_ring outside of unbootstrap() We want to reuse the later without the call. Message-Id: <20221228144944.3299711-17-gleb@scylladb.com>	2023-01-02 12:03:29 +02:00
Gleb Natapov	229cef136d	raft: add trace logging to raft::server::start Allows to see initial state of the server during start. Message-Id: <20221228144944.3299711-15-gleb@scylladb.com>	2023-01-02 11:57:53 +02:00
Gleb Natapov	96453ff75f	service: raft: improve group0_state_machine::apply logging Trace how many entries are applied as well. Message-Id: <20221228144944.3299711-14-gleb@scylladb.com>	2023-01-02 11:57:16 +02:00
Gleb Natapov	dbd5b97201	storage_service: improve logging in update_pending_ranges() function We pass the reason for the change. Log it as well. Message-Id: <20221228144944.3299711-11-gleb@scylladb.com>	2023-01-02 11:54:03 +02:00
Gleb Natapov	04ab673359	messaging: check that a node knows its own topology before accessing it We already check is remote's node topology is missing before creating a connection, but local node topology can be missing too when we will use raft to manage it. Raft needs to be able to create connections before topology is knows. Message-Id: <20221228144944.3299711-7-gleb@scylladb.com>	2023-01-02 11:53:14 +02:00
Gleb Natapov	6f104982e1	topology: use std::erase_if on std::map instead of ad-hoc loop There is std::erase_if since c++20. We can use it here. Message-Id: <20221228144944.3299711-6-gleb@scylladb.com>	2023-01-02 11:45:52 +02:00
Gleb Natapov	84eb5924ac	system_keyspace: remove redundant include storage_proxy.hh is included twice Message-Id: <20221228144944.3299711-4-gleb@scylladb.com>	2023-01-02 11:39:22 +02:00
Gleb Natapov	5182543df2	raft: fix typo in read_barrier logging The log logs applied index not append one. Message-Id: <20221228144944.3299711-3-gleb@scylladb.com>	2023-01-02 11:38:47 +02:00
Gleb Natapov	5a96751534	storage_service: remove start_leaving since it is no longer used Message-Id: <20221228144944.3299711-2-gleb@scylladb.com>	2023-01-02 11:37:48 +02:00
Raphael S. Carvalho	b4e4bbd64a	database_test: Reduce x_log2_compaction_group values to avoid timeout database_test in timing out because it's having to run the tests calling do_with_cql_env_and_compaction_groups 3x, one for each compaction group setting. reduce it to 2 settings instead of 3 if running in debug mode. Refs #12396. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12421	2023-01-01 13:56:18 +02:00
Raphael S. Carvalho' via ScyllaDB development	a7c4a129cb	sstables: Bump row_reads metrics for mx version Metric was always 0 despite a row was processed by mx reader. Fixes #12406. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20221227220202.295790-1-raphaelsc@scylladb.com>	2022-12-30 18:38:30 +01:00
Anna Stuchlik	601aeb924a	docs: remove the auto_bootstrap option from the procedures to create a cluster or add a DC	2022-12-30 13:10:06 +01:00
Avi Kivity	8635d24424	build: drop abseil submodule, replace with distribution abseil This lets us carry fewer things and rely on the distribution for maintenance. The frozen toolchain is updated. Incidental updates include clang 15.0.6, and pytest that doesn't need workarounds. Closes #12397	2022-12-28 19:02:23 +02:00
Avi Kivity	eced91b575	Revert "view: coroutinize maybe_mark_view_as_built" This reverts commit `ac2e2f8883`. It causes a regression ("std::bad_variant_access in load_view_build_progress"). Commit `2978052113` (a reindent) is also reverted as part of the process. Fixes #12395	2022-12-28 15:36:05 +02:00
Nadav Har'El	200bc82913	test/cql-pytest: exit immediately if Scylla is down In commit `acfa180766` we added to test/cql-pytest a mechanism to detect when Scylla crashes in the middle of a test function - in which case we report the culprit test and exit immediately to avoid having a hundred more tests report that they failed as well just because Scylla was down. However, if Scylla was never up - e.g., if the user ran "pytest" without ever running Scylla - we still report hundreds of tests as having failed, which is confusing and not helpful. So with this patch, if a connection cannot be made to Scylla at all, the test exits immediately, explaining what went wrong, not blaming any specific test: $ pytest ... ! _pytest.outcomes.Exit: Cannot connect to Scylla at --host=localhost --port=9042 ! ============================ no tests ran in 0.55s ============================= Beyond being a helpful reminder for a developer who runs "pytest" without having started Scylla first (or using test/cql-pytest/run or test.py to start Scylla easily), this patch is also important when running tests through test.py if it reuses an instance of Scylla that crashed during an earlier pytest file's run. This patch does not fix test.py - it can still try to run pytest with a dead Scylla server without checking. But at least with this patch pytest will notice this problem immediately and won't report hundreds of test functions having failed. The only report the user will see will be the last test which crashed Scylla, which will make it easier to find this failure without being hidden between hundreds of spurious failures. Fixes #12360 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12401	2022-12-28 13:04:28 +02:00
Anna Stuchlik	d0db1a27c3	docs: fix the description of the system log POST example	2022-12-28 11:25:54 +01:00
Anna Stuchlik	b7ec99b10b	docs: uptate the curl tool name	2022-12-28 10:33:07 +01:00
Asias He	b9e5e340aa	streaming: Enable offstrategy for all classic streaming based node ops This patch enables offstrategy compaction for all classic streaming based node ops. We can use this method because tables are streamed one after another. As long as there is still streamed data for a given table, we update the automatic trigger timer. When all the streaming has finished, the trigger timer will timeout and fire the offstrategy compaction for the given table. I checked with this patch, rebuild is 3X faster. There was no compaction in the middle of the streaming. The streamed sstables are compacted together after streaming is done. Time Before: INFO 2022-11-25 10:06:08,213 [shard 0] range_streamer - Rebuild succeeded, took 67 seconds, nr_ranges_remaining=0 Time After: INFO 2022-11-25 09:42:50,943 [shard 0] range_streamer - Rebuild succeeded, took 23 seconds, nr_ranges_remaining=0 Compaciton Before: 88 sstables were written -> 88 sstables were added into main set Compaction After: 88 sstables written -> after offstretegy 2 sstables were added into main seet Closes #11848	2022-12-28 11:12:02 +02:00
Michał Chojnowski	5e79d6b30b	tasks: task_manager: move invoke_on_task<> to .hh invoke_on_task is used in translation units where its definition is not visible, yet it has no explicit instantiations. If the compiler always decides to inline the definition, not to instantiate it implicitly, linking invoke_on_task will fail. (It happened to me when I turned up inline-threshold). Fix that. Closes #12387	2022-12-28 10:55:43 +02:00
Alejo Sanchez	d408b711e3	test/python: increase CQL connection timeouts In very slow debug builds the default driver timeouts are too low and tests might fail. Bump up the values to more reasonable time. These timeout values are the same as used in topology tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12405	2022-12-28 10:06:33 +02:00
Anna Stuchlik	39ade2f5a5	docs: describe how to use the scylla-api-client tool	2022-12-27 14:46:16 +01:00
Anna Stuchlik	2789501023	docs: fix the scylla-api-client tool name	2022-12-27 14:28:27 +01:00
Alejo Sanchez	1bfe234133	test/pylib: API get/set logger level of Scylla server Provide helpers to get and set logger level for Scylla servers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12394	2022-12-25 13:58:43 +02:00
Anna Stuchlik	ea7e23bf92	docs: fix the option name from compaction to compression on the Data Definition page Fixes the option name in the "Other table options" table on the Data Definition page. Fixes #12334 Closes #12382	2022-12-25 11:24:56 +02:00
Botond Dénes	b0d95948e1	mutation_compactor: reset stop flag on page start When the mutation compactor has all the rows it needs for a page, it saves the decision to stop in a member flag: _stop. For single partition queries, the mutation compactor is kept alive across pages and so it has a method, start_new_page() to reset its state for the next page. This method didn't clear the _stop flag. This meant that the value set at the end of the previous could cause the new page and subsequently the entire query to be stopped prematurely. This can happen if the new page starts with a row that is covered by a higher level tombstone and is completely empty after compaction. Reset the _stop flag in start_new_page() to prevent this. This commit also adds a unit test which reproduces the bug. Fixes: #12361 Closes #12384	2022-12-24 13:52:45 +02:00
Takuya ASADA	642d035067	docker: prevent hostname -i failure when server address is specified On some docker instance configuration, hostname resolution does not work, so our script will fail on startup because we use hostname -i to construct cqlshrc. To prevent the error, we can use --rpc-address or --listen-address for the address since it should be same. Fixes #12011 Closes #12115	2022-12-24 13:52:16 +02:00
Asias He	d819d98e78	storage_service: Ignore dropped table for repair_updater In case a table is dropped, we should ignore it in the repair_updater, since we can not update off strategy trigger for a dropped table. Refs #12373 Closes #12388	2022-12-24 13:48:25 +02:00
Raphael S. Carvalho	67ebd70e6e	compaction_manager: Fix reactor stalls during periodic submissions Every 1 hour, compaction manager will submit all registered table_state for a regular compaction attempt, all without yielding. This can potentially cause a reactor stall if there are 1000s of table states, as compaction strategy heuristics will run on behalf of each, and processing all buckets and picking the best one is not cheap. This problem can be magnified with compaction groups, as each group is represented by a table state. This might appear in dashboard as periodic stalls, every 1h, misleading the investigator into believing that the problem is caused by a chronological job. This is fixed by piggybacking on compaction reevaluation loop which can yield between each submission attempt if needed. Fixes #12390. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12391	2022-12-24 13:43:16 +02:00
Anna Stuchlik	74fd776751	docs: document scylla-api-cli	2022-12-23 11:27:37 +01:00
Benny Halevy	8797958dfc	schema: operator<<: print also tombstone_gc_options They are currently missing from the printout when the a table is created, but they are determinal to understanding the mode with which tombstones are to be garbage-collected in the table. gcGraceSeconds alone is no longer enough since the introduction of tombstone_gc_option in `a8ad385ecd`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12381	2022-12-22 16:40:18 +02:00
Anna Stuchlik	7e8977bf2d	docs: add the info about minor release	2022-12-22 10:26:33 +01:00
Nadav Har'El	ef2e5675ed	materialized views, test: add tests for CLUSTERING ORDER BY In issue #10767, concerned were raised that the CLUSTERING ORDER BY clause is handled incorrectly in a CREATE MATERIALIZED VIEW definition. The tests in this patch try to explore the different ways in which CLUSTERING ORDER BY can be used in CREATE MATERIALIZED VIEW and allows us to compare Scylla's behaivor to Cassandra, and to common sense. The tests discover that the CLUSTERING ORDER BY feature in materialized views generally works as expected, but there are three differences between Scylla and Cassandra in this feature. We consider two differences to be bugs (and hence the test is marked xfail) and one a Scylla extension: 1. When a base table has a reverse-order clustering column and this clustering column is used in the materialized view, in Cassandra the view's clustering order inherits the reversed order. In Scylla, the view's clustering order reverts to the default order. Arguably, both behaviors can be justified, but usually when in doubt we should implement Cassandra's behavior - not pick a different behavior, even if the different behavior is also reasonable. So this test (test_mv_inherit_clustering_order()) is marked "xfail", and a new issue was created about this difference: #12308. If we want to fix this behavior to match Cassandra's we should also consider backward compatibility - what happens if we change this behavior in Scylla now, after we had the opposite behavior in previous releases? We may choose to enshrine Scylla's Cassandra- incompatible behavior here - and document this difference. 2. The CLUSTERING ORDER BY should, as its name suggests, only list clustering columns. In Scylla, specifying other things, like regular columns, partition-key columns, or non-existent columns, is silently ignored, whereas it should result in an Invalid Request error (as it does in Cassandra). So test_mv_override_clustering_order_error() is marked "xfail". This is the difference already discovered in #10767. 3. When a materialized view has several clustering columns, Cassandra requires that a CLUSTERING ORDER BY clause, if present, must specify the order of all of all clustering columns. Scylla, in contrast, allows the user to override the order of only some of these columns - and the rest get the default order. I consider this to be a legitimate Scylla extension, and not a compatibility bug, so marked the test with "scylla_only", and no issue was opened about it. Refs #10767 Refs #12308 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12307	2022-12-22 09:48:16 +02:00
Nadav Har'El	6d2e146aa6	test/cql-pytest.py: add scylla_inject_error() utility This patch adds a scylla_inject_error(), a context manager which tests can use to temporarily enable some error injection while some test code is running. It can be used to write tests that artificially inject certain errors instead of trying to reach the elaborate (and often requiring precise timing or high amounts of data) situation where they occur naturally. The error-injection API is Scylla-specific (it uses the Scylla REST API) and does not work on "release"-mode builds (all other modes are supported), so when Cassandra or release-mode build are being tested, the test which uses scylla_inject_error() gets skipped. Example usage: ```python from rest_api import scylla_inject_error with scylla_inject_error(cql, "injection_name", one_shot=True): # do something here ... ``` Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12264	2022-12-22 09:39:10 +02:00
Nadav Har'El	01f0644b22	Merge 'scylla-gdb.py: introduce `scylla get-config-value`' from Botond Dénes Retrieves the configuration item with the given name and prints its value as well as its metadata. Example: (gdb) scylla get-config-value compaction_static_shares value: 100, type: "float", source: SettingsFile, status: Used, live: MustRestart Closes #12362 * github.com:scylladb/scylladb: scylla-gdb.py: add scylla get-config-value gdb command scylla-gdb.py: extract $downcast_vptr logic to standalone method test: scylla-gdb/run: improve diagnostics for failed tests	2022-12-21 18:38:23 +02:00
Aleksandra Martyniuk	599fce16cf	repair: make top level repair tasks abortable	2022-12-21 11:52:58 +01:00
Aleksandra Martyniuk	e77de463e4	repair: unify a way of aborting repair operations	2022-12-21 11:52:53 +01:00
Aleksandra Martyniuk	f56e886127	repair: delete sharded abort source from node_ops_info Sharded abort source in node_ops_info is no longer needed since its functionality is provided by task manager's tasks structure.	2022-12-21 11:37:03 +01:00
Aleksandra Martyniuk	18efe0a4e8	repair: delete unused node_ops_info from data_sync_repair_task_impl	2022-12-21 11:28:30 +01:00
Aleksandra Martyniuk	ee13a5dde8	api: extend status in task manager api Status of tasks returned with get_task_status and wait_task is extended with the list of ids of child tasks.	2022-12-21 10:54:56 +01:00
Aleksandra Martyniuk	697af4ccf2	api: extend get_tasks in task manager api Each task stats in a list returned from tm::get_task api call is extended with info about: task type, keyspace, table, entity, and sequence number.	2022-12-21 10:54:50 +01:00
Michał Chojnowski	19049150ef	configure.py: remove --static, --pie, --so These options have been nonsense since 2017. --pie and --so are ignored, --static disables (sic!) static linking of libraries. Remove them. Closes #12366	2022-12-21 11:01:56 +02:00
Botond Dénes	29d49e829e	scylla-gdb.py: add scylla get-config-value gdb command Retrieves the configuration item with the given name and prints its value as well as its metadata. Example: (gdb) scylla get-config-value compaction_static_shares value: 100, type: "float", source: SettingsFile, status: Used, live: MustRestart	2022-12-21 03:05:56 -05:00
Botond Dénes	0cdb89868a	scylla-gdb.py: extract $downcast_vptr logic to standalone method So it can be reused by regular python code.	2022-12-21 03:05:56 -05:00
Botond Dénes	24022c19a6	test: scylla-gdb/run: improve diagnostics for failed tests By instructing gdb to print the full python stack in case of errors.	2022-12-21 03:05:56 -05:00
Michał Chojnowski	d9269abf5b	sstables: index_reader: always evict the local cache gently Due to an oversight, the local index cache isn't evicted gently when _upper_bound existed. This is a source of reactor stalls. Fix that. Fixes #12271 Closes #12364	2022-12-20 18:23:27 +02:00
Michał Radwański	e7fbcd6c9d	mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes The consume_in_reverse::legacy_half_reverse format is soon to be phased out. This commit starts treating frozen_mutations from replicas for reversed queries so that they are consumed with consume_in_reverse::yes.	2022-12-20 17:05:02 +01:00
Benny Halevy	1adb2bff18	mutation: move consume_in_reverse def to mutation_consumer.hh To be used also by frozen_mutation consumer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-20 16:23:10 +01:00
Avi Kivity	bb731b4f52	Merge 'docs: move documentation of tools online' from Botond Dénes Currently the scylla tools (`scylla-types` and `scylla-sstable`) have documentation in two places: high level documentation can be found at `docs/operating-scylla/admin-tools/scylla-{types,sstable}.rst`, while low level, more detailed documentation is embedded in the tool itself. This is especially pronounced for `scylla-sstable`, which only has a short description of its operations online, all details being found only in the command-line help. We want to move away from this model, such that all documentation can be found online, with the command-line help being reserved to documenting how the various switches and flags work, on top of a short description of the operation and a link to the detailed online docs. Closes #12284 * github.com:scylladb/scylladb: tool/scylla-sstable: move documentation online docs: scylla-sstable.rst: add sstable content section docs: scylla-{sstable,types}.rst: drop Syntax section	2022-12-20 17:04:47 +02:00
Avi Kivity	3fce43124a	Merge 'Static compaction groups' from Raphael "Raph" Carvalho Allows static configuration of number of compaction groups per table per shard. To bootstrap the project, config option x_log2_compaction_groups was added which controls both number of groups and partitioning within a shard. With a value of 0 (default), it means 1 compaction group, therefore all tokens go there. With a value of 3, it means 8 compaction groups, and 3 most-significant-bits of tokens being used to decide which group owns the token. And so on. It's still missing: - integration with repair / streaming - integration with reshard / reshape. perf/perf_simple_query --smp 1 --memory 1G BEFORE ----- median 61358.55 tps ( 71.1 allocs/op, 12.2 tasks/op, 56375 insns/op, 0 errors) median 61322.80 tps ( 71.1 allocs/op, 12.2 tasks/op, 56391 insns/op, 0 errors) median 61058.58 tps ( 71.1 allocs/op, 12.2 tasks/op, 56386 insns/op, 0 errors) median 61040.94 tps ( 71.1 allocs/op, 12.2 tasks/op, 56381 insns/op, 0 errors) median 61118.40 tps ( 71.1 allocs/op, 12.2 tasks/op, 56379 insns/op, 0 errors) AFTER ----- median 61656.12 tps ( 71.1 allocs/op, 12.2 tasks/op, 56486 insns/op, 0 errors) median 61483.29 tps ( 71.1 allocs/op, 12.2 tasks/op, 56495 insns/op, 0 errors) median 61638.05 tps ( 71.1 allocs/op, 12.2 tasks/op, 56494 insns/op, 0 errors) median 61726.09 tps ( 71.1 allocs/op, 12.2 tasks/op, 56509 insns/op, 0 errors) median 61537.55 tps ( 71.1 allocs/op, 12.2 tasks/op, 56491 insns/op, 0 errors) Closes #12139 * github.com:scylladb/scylladb: test: mutation_test: Test multiple compaction groups test: database_test: Test multiple compaction groups test: database_test: Adapt it to compaction groups db: Add config for setting static number of compaction groups replica: Introduce static compaction groups test: sstable_test: Stop referencing single compaction group api: compaction_manager: Stop a compaction type for all groups api: Estimate pending tasks on all compaction groups api: storage_service: Run maintenance compactions on all compaction groups replica: table: Adapt assertion to compaction groups replica: database: stop and disable compaction on behalf of all groups replica: Introduce table::parallel_foreach_table_state() replica: disable auto compaction on behalf of all groups replica: table: Rework compaction triggers for compaction groups replica: Adapt table::get_sstables_including_compacted_undeleted() to compaction groups replica: Adapt table::rebuild_statistics() to compaction groups replica: table: Perform major compaction on behalf of all groups replica: table: Perform off-strategy compaction on behalf of all groups replica: table: Perform cleanup compaction on behalf of all groups replica: Extend table::discard_sstables() to operate on all compaction groups replica: table: Create compound sstable set for all groups replica: table: Set compaction strategy on behalf of all groups replica: table: Return min memtable timestamp across all groups replica: Adapt table::stop() to compaction groups replica: Adapt table::clear() to compaction groups replica: Adapt table::can_flush() to compaction groups replica: Adapt table::flush() to compaction groups replica: Introduce parallel_foreach_compaction_group() replica: Adapt table::set_schema() to compaction groups replica: Add memtables from all compaction groups for reads replica: Add memtable_count() method to compaction_group replica: table: Reserve reader list capacity through a callback replica: Extract addition of memtables to reader list into a new function replica: Adapt table::occupancy() to compaction groups replica: Adapt table::active_memtable() to compaction groups replica: Introduce table::compaction_groups() replica: Preparation for multiple compaction groups scylla-gdb: Fix backward compatibility of scylla_memtables command	2022-12-20 17:04:47 +02:00
Avi Kivity	623be22d25	Merge 'sstables: allow bypassing min max position metadata loading' from Botond Dénes Said mechanism broke tools and tests to some extent: the read it executes on sstable load time means that if the sstable is broken enough to fail this read, it will fail to load, preventing diagnostic tools to load it and examine it and preventing tests from producing broken sstables for testing purposes. Closes #12359 * github.com:scylladb/scylladb: sstables: allow bypassing first/last position metadata loading sstables: sstable::{load,open_data}(): fix indentation sstables: coroutinize sstable::open_data() sstables: sstable::open_data(): use clear_gently() to clear token ranges sstables: coroutinize sstable::load()	2022-12-20 17:04:47 +02:00
Aleksandra Martyniuk	60e298fda1	repair: change utils::UUID to node_ops_id Type of the id of node operations is changed from utils::UUID to node_ops_id. This way the id of node operations would be easily distinguished from the ids of other entities. Closes #11673	2022-12-20 17:04:47 +02:00
Avi Kivity	88a1fbd72f	Update seastar submodule * seastar 3a5db04197...3db15b5681 (27): > build: get the full path of c-ares > build: unbreak pkgconfig output > http: Add 206 Partial Content response code > http: Carry integer content_length on reply > tls_test: drop duplicated includes > tls_test: remove duplicated test case > reactor: define __NR_pidfd_open if not defined > sockets: Wait on socket peer closing the connection > tcp: Close connection when getting RST from server > Merge 'Enhance rpc tester with delays, timeouts and verbosity' from Pavel Emelyanov > Merge 'build: use pkg_search_module(.. IMPORTED_TARGET ..) ' from Kefu Chai > build: define GnuTLS_{LIBRARIES, INCLUDE_DIRS} only if GnuTLS is found > build: use pkg_search_module(.. IMPORTED_TARGET ..) > addr2line: extend asan regex > abort_source: move-assign operator: call base class unlink > coroutine: correct syntax error in doxygen comment > demo: Extend http connection demo with https > test: temporarily disable warning for tests triggering warnings > tests/unit/coroutine: Include <ranges> > sstring: Document why sstring exists at all > test: log error when read/write to pipe fails > test: use executables in /bin > tests: spawn_test: use BOOST_CHECK_EQUAL() for checking equality of temporary_buffer > docker: bump up to clang {14,15} and gcc {11,12} > shared_ptr: ignore false alarm from GCC-12 > build: check for fix of CWG2631 > circleci: use versioned container image Closes #12355	2022-12-20 17:04:47 +02:00
Botond Dénes	3c8949d34c	sstables: allow bypassing first/last position metadata loading When loading an sstable. Tests and tools might want to do this to be able to load a damaged sstable to do tests/diagnostics on it.	2022-12-20 01:45:38 -05:00
Botond Dénes	bba956c13c	sstables: sstable::{load,open_data}(): fix indentation	2022-12-20 01:45:38 -05:00
Botond Dénes	c85ff7945d	sstables: coroutinize sstable::open_data() Used once when sstable is opened on startup, not performance sensitive.	2022-12-20 01:45:38 -05:00
Botond Dénes	15966a0b1b	sstables: sstable::open_data(): use clear_gently() to clear token ranges Instead of an open-coded loop. It also makes the code easier to coroutinize (next patch).	2022-12-20 01:45:22 -05:00
Nadav Har'El	08c8e0d282	test/alternator: enable tests for long strings of consecutive tombstones In the past we had issue #7933 where very long strings of consecutive tombstones caused Alternator's paging to take an unbounded amount of time and/or memory for a single page. This issue was fixed (by commit `e9cbc9ee85`) but the two tests we had reproducing that issue were left with the "xfail" mark. They were also marked "veryslow" - each taking about 100 seconds - so they didn't run by default so nobody noticed they started to pass. In this patch I make these tests much faster (taking less than a second together), confirm that they pass - and remove the "xfail" mark and improve their descriptions. The trick to making these tests faster is to not create a million tombstones like we used to: We now know that after string of just 10,000 tombstones ('query_tombstone_page_limit') the page should end, so we can check specifically this number. The story is more complicated for partition tombstones, but there too it should be a multiple of query_tombstone_page_limit. To make the tests even faster, we change run.py to lower the query_tombstone_page_limit from the default 10,000 to 1000. The tests work correctly even without this change, but they are ten times faster with it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12350	2022-12-20 07:08:36 +02:00
Botond Dénes	94f3fb341f	Merge 'Fix nix devenv' from Michael Livshin * Update Nixpkgs base * Clarify some comments * Get rid of custom-packaged cxxbridge (it's now present in Nixpkgs as cxx-rs) * Add missing libraries (libdeflate, libxcrypt) * Fix expected hash of the gdb patch * Fix a couple of small build problems Fixes #12259 Closes #12346 * github.com:scylladb/scylladb: build: fix Nix devenv cql3: mark several private fields as maybe_unused configure.py: link with more abseil libs	2022-12-20 07:01:06 +02:00
Michael Livshin	7c383c6249	build: fix Nix devenv * Update Nixpkgs base * Clarify some comments * Get rid of custom-packaged cxxbridge (it's now present in Nixpkgs as cxx-rs) * Add missing libraries (libdeflate, libxcrypt) * Fix expected hash of the gdb patch * Bump Python driver to 3.25.20-scylla Fixes #12259	2022-12-19 20:53:07 +02:00
Michael Livshin	4407828766	cql3: mark several private fields as maybe_unused Because they are indeed unused -- they are initialized, passed down through some layers, but not actually used. No idea why only Clang 12 in debug mode in Nix devenv complains about it, though.	2022-12-19 20:53:07 +02:00
Michael Livshin	c0c8afb79e	configure.py: link with more abseil libs Specifically libabsl_strings{,_internal}.a. This fixes failure to link tests in the Nix devenv; since presumably all is good in other setups, it must be something weird having to do with inlining? The extra linked libraries shouldn't hurt in any case.	2022-12-19 20:53:07 +02:00
Raphael S. Carvalho	e7380bea65	test: mutation_test: Test multiple compaction groups Extends mutation_test to run the tests with more than one compaction group, in addition to a single one (default). Piggyback on existing tests. Avoids duplication. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 12:36:07 -03:00
Raphael S. Carvalho	e3e7c3c7e5	test: database_test: Test multiple compaction groups Extends database_test to run the tests with more than one compaction group, in addition to a single one (default). Piggyback on existing tests. Avoids duplication. Caught a bug when snapshotting, in implementation of table::can_flush(), showing its usefulness. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 12:36:07 -03:00
Raphael S. Carvalho	e103e41c76	test: database_test: Adapt it to compaction groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 12:36:05 -03:00
Aleksandra Martyniuk	be529cc209	repair: delete redundant abort subscription from shard_repair_task_impl data_sync_repair_task_impl subscribes to corresponding node_ops_info abort source and then, when requested, all its descedants are aborted recursively. Thus, shard_repair_task_impl does not need to subscribe to the node_ops_info abort source, since the parent task will take care of aborting once it is requested. abort_subscription and connected attributes are deleted from the shard_repair_task_impl.	2022-12-19 16:07:28 +01:00
Aleksandra Martyniuk	e48ca62390	repair: add abort subscription to data sync task When node operation is aborted, same should happen with the corresponding task manager's repair task. Subscribe data_sync_repair_task_impl abort() to node_ops_info abort_source.	2022-12-19 15:57:35 +01:00
Aleksandra Martyniuk	2b35d7df1b	tasks: abort tasks on system shutdown When system shutdowns, all task manager's top level tasks are aborted. Responsibility for aborting child tasks is on their parents.	2022-12-19 15:57:35 +01:00
Botond Dénes	827cd0d37b	sstables: coroutinize sstable::load() It nicely simplified by it. No regression expected, this method is supposedly only used by tests and tools.	2022-12-19 09:33:52 -05:00
Raphael S. Carvalho	d9ab59043e	db: Add config for setting static number of compaction groups This new option allows user to control the number of compaction groups per table per shard. It's 0 by default which implies a single compaction group, as is today. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:24 -03:00
Raphael S. Carvalho	9cf4dc7b62	replica: Introduce static compaction groups This is the initial support for multiple groups. _x_log2_compaction_groups controls the number of compaction groups and the partitioning strategy within a single table. The value in _x_log2_compaction_groups refers to log base 2 of the actual number of groups. 0 means 1 compaction group. 1 means 2 groups and 2 most significant bits of token being used to pick the target group. The group partitioner should be later abstracted for making tablet integration easier in the future. _x_log2_compaction_groups is still a constant but a config option will come next. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:23 -03:00
Raphael S. Carvalho	c807e61715	test: sstable_test: Stop referencing single compaction group Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:20 -03:00
Raphael S. Carvalho	254c38c4d2	api: compaction_manager: Stop a compaction type for all groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:19 -03:00
Raphael S. Carvalho	4e836cb96c	api: Estimate pending tasks on all compaction groups Estimates # of compaction jobs to be performed on a table. Adaptation is done by adding estimation from all groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:17 -03:00
Raphael S. Carvalho	640436e72a	api: storage_service: Run maintenance compactions on all compaction groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:15 -03:00
Raphael S. Carvalho	e0c5cbee8d	replica: table: Adapt assertion to compaction groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:13 -03:00
Raphael S. Carvalho	d35cf88f09	replica: database: stop and disable compaction on behalf of all groups With compaction group model, truncate_table_on_all_shards() needs to stop and disable compaction for all groups. replica::table::as_table_state() will be removed once no user remains, as each table may map to multiple groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:12 -03:00
Raphael S. Carvalho	50b02ee0bd	replica: Introduce table::parallel_foreach_table_state() This will replace table::as_table_state(). The latter will be killed once its usage drops to zero. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:10 -03:00
Raphael S. Carvalho	fd69bd433e	replica: disable auto compaction on behalf of all groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:08 -03:00
Raphael S. Carvalho	6fefbe5706	replica: table: Rework compaction triggers for compaction groups Allow table-wide compaction trigger, as well as fine-grained trigger like after flushing a memtable on behalf of a single group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:07 -03:00
Raphael S. Carvalho	6a6adea3ab	replica: Adapt table::get_sstables_including_compacted_undeleted() to compaction groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:05 -03:00
Raphael S. Carvalho	5919836da8	replica: Adapt table::rebuild_statistics() to compaction groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:04 -03:00
Raphael S. Carvalho	70b727db31	replica: table: Perform major compaction on behalf of all groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:01 -03:00
Raphael S. Carvalho	e3ccdb17a0	replica: table: Perform off-strategy compaction on behalf of all groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:00 -03:00
Raphael S. Carvalho	6efc9fd1f6	replica: table: Perform cleanup compaction on behalf of all groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:58 -03:00
Raphael S. Carvalho	36e11eb2a5	replica: Extend table::discard_sstables() to operate on all compaction groups discard_sstables() runs on context of truncate, which is a table-wide operation today, and will remain so with multiple static groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:55 -03:00
Raphael S. Carvalho	24c3687c3f	replica: table: Create compound sstable set for all groups Avoids extra compound set for single-compaction-group table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:52 -03:00
Raphael S. Carvalho	eb620da981	replica: table: Set compaction strategy on behalf of all groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:50 -03:00
Raphael S. Carvalho	7a0e4f900f	replica: table: Return min memtable timestamp across all groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:49 -03:00
Raphael S. Carvalho	ceaa8a1ef1	replica: Adapt table::stop() to compaction groups Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:47 -03:00
Raphael S. Carvalho	facf923440	replica: Adapt table::clear() to compaction groups clear() clears memtable content and cache. Cache is shared by groups, therefore adaptation happens by only clearing memtables of all groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:45 -03:00
Raphael S. Carvalho	a9c902cd5e	replica: Adapt table::can_flush() to compaction groups can_flush() is used externally to determine if a table has an active memtable that can be flushed. Therefore, adaptation happens by returning true if any of the groups can be flushed. A subsequent flush request will flush memtable of all groups that are ready for it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:44 -03:00
Raphael S. Carvalho	ea42090d47	replica: Adapt table::flush() to compaction groups Adaptation of flush() happens by trigger flush on memtable of all groups. table::seal_active_memtable() will bail out if memtable is empty, so it's not a problem to call flush on a group which memtable is empty. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:42 -03:00
Raphael S. Carvalho	7274c83098	replica: Introduce parallel_foreach_compaction_group() This variant will be useful when iterating through groups and performing async actions on each. It guarantees that all groups are alive by the time they're reached in the loop. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:40 -03:00
Raphael S. Carvalho	89ab9d7227	replica: Adapt table::set_schema() to compaction groups set_schema() is used by the database to apply schema changes to table components which include memtables. Adaptation happens by setting schema to memtable(s) of all groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:38 -03:00
Raphael S. Carvalho	0022322ae3	replica: Add memtables from all compaction groups for reads Let's add memtables of all compaction groups. Point queries are optimized by picking a single group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:36 -03:00
Raphael S. Carvalho	e044001176	replica: Add memtable_count() method to compaction_group Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:34 -03:00
Raphael S. Carvalho	f2ea79f26c	replica: table: Reserve reader list capacity through a callback add_memtables_to_reader_list() will be adapted to compaction groups. For point queries, it will add memtables of a single group. With the callback, add_memtables_to_reader_list() can tell its caller the exact amount of memtable readers to be added, so it can reserve precisely the readers capacity. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:33 -03:00
Raphael S. Carvalho	e841508685	replica: Extract addition of memtables to reader list into a new function Will make it easier for adding memtables of all compaction groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:19 -03:00
Raphael S. Carvalho	530956b2de	replica: Adapt table::occupancy() to compaction groups table::occupancy() provides accumulated occupancy stats from memtables. Adaptation happens by accumulating stats from memtables of all groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:17 -03:00
Raphael S. Carvalho	ef8f542d75	replica: Adapt table::active_memtable() to compaction groups active_memtable() was fine to a single group, but with multiple groups, there will be one active memtable per group. Let's change the interface to reflect that. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:14 -03:00
Raphael S. Carvalho	429c5aa2f9	replica: Introduce table::compaction_groups() Useful for iterating through all groups. This is intermediary implementation which requires allocation as only one group is supported today. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:12 -03:00
Raphael S. Carvalho	514008f136	replica: Preparation for multiple compaction groups Adjusts scylla_memtables gdb command to multiple groups, while keeping backward compatibility. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:10 -03:00
Raphael S. Carvalho	52b94b6dd7	scylla-gdb: Fix backward compatibility of scylla_memtables command Fix it while refactoring the code for arrival of multiple compaction groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:07 -03:00
Anna Stuchlik	bbfb9556fc	doc: mark the in-memory tables feature as deprecated Closes #12286	2022-12-19 15:39:31 +02:00
Avi Kivity	c70a9b0166	test: make test xml filenames more unique `ea99750de7` ("test: give tests less-unique identifiers") made the disambiguating ids only be unambiguous within a single test case. This made all tests named "run" have the name name "run.1". Fix that by adding the suite name everywhere: in test paths, and in junit test case names. Fixes #12310. Closes #12313	2022-12-19 15:03:51 +02:00
Botond Dénes	3e6ddf21bc	Merge 'storage_service: unbootstrap: avoid unnecessary copy of ranges_to_stream' from Benny Halevy `ranges_to_stream` is a map of ` std::unordered_multimap<dht::token_range, inet_address>` per keyspace. On large clusters with a large number of keyspace, copying it may cause reactor stalls as seen in #12332 This series eliminates this copy by using std::move and also turns `stream_ranges` into a coroutine, adding maybe_yield calls to avoid further stalls down the road. Fixes #12332 Closes #12343 * github.com:scylladb/scylladb: storage_service: stream_ranges: unshare streamer storage_service: stream_ranges: maybe_yield storage_service: coroutinize stream_ranges storage_service: unbootstrap: move ranges_to_stream_by_keyspace to stream_ranges	2022-12-19 12:53:16 +02:00
Benny Halevy	e8aa1182b2	docs: replace-dead-node: get rid of hibernate status With replace using node operations, the HIBERNATE gossip status is not used anymore. This change updates documentation to reflect that. During replace, the replacing nodes shows in gossipinfo in STATUS:NORMAL. Also, the replaced node shows as DN in `nodetool status` while being replaced, so remove paragraph showing it's not listed in `nodetool status`. Plus. tidy up the text alignment. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-19 12:19:10 +02:00
Benny Halevy	c9993f020d	storage_service: get rid of handle_state_replacing Since `2ec1f719de` nodes no longer publish HIBERNATE state so we don't need to support handling it. Replace is now always done using node operations (using repair or streaming). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-19 12:19:08 +02:00
Benny Halevy	60de7d28db	storage_service: stream_ranges: unshare streamer Now that stream_ranges is a coroutine streamer can be an automatic variable on the coroutine stack frame. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-19 07:42:07 +02:00
Benny Halevy	9badcd56ca	storage_service: stream_ranges: maybe_yield Prevent stalls with a large number of keyspaces and token ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-19 07:42:07 +02:00
Benny Halevy	2cf75319b0	storage_service: coroutinize stream_ranges Before adding maybe_yield calls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-19 07:42:01 +02:00
Benny Halevy	82486bb5d2	storage_service: unbootstrap: move ranges_to_stream_by_keyspace to stream_ranges Avoid a potentially large memory copy causing a reactor stall with a large number of keyspaces. Fixes #12332 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-19 07:39:48 +02:00
Avi Kivity	7c7eb81a66	Merge 'Encapsulate filesystem access by sstable into filesystem_storage subsclass' from Pavel Emelyanov This is to define the API sstable needs from underlying storage. When implementing object-storage backend it will need to implement those. The API looks like future<> snapshot(const sstable& sst, sstring dir, absolute_path abs) const; future<> quarantine(const sstable& sst, delayed_commit_changes* delay); future<> move(const sstable& sst, sstring new_dir, generation_type generation, delayed_commit_changes* delay); void open(sstable& sst, const io_priority_class& pc); // runs in async context future<> wipe(const sstable& sst) noexcept; future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity); It doesn't have "list" or alike, because it's not a method of an individual sstable, but rather the one from sstables_manager. It will come as separate PR. Closes #12217 * github.com:scylladb/scylladb: sstable, storage: Mark dir/temp_dir private sstable: Remove get_dir() (well, almost) sstable: Add quarantine() method to storage sstable: Use absolute/relative path marking for snapshot() sstable: Remove temp_... stuff from sstable sstable: Move open_component() on storage sstable: Mark rename_new_sstable_component_file() const sstable: Print filename(type) on open-component error sstable: Reorganize new_sstable_component_file() sstable: Mark filename() private sstable: Introduce index_filename() tests: Disclosure private filename() calls sstable: Move wipe_storage() on storage sstable: Remove temp dir in wipe_storage() sstable: Move unlink parts into wipe_storage sstable: Remove get_temp_dir() sstable: Move write_toc() to storage sstable: Shuffle open_sstable() sstable: Move touch_temp_dir() to storage sstable: Move move() to storage sstable: Move create_links() to storage sstable: Move seal_sstable() to storage sstable: Tossing internals of seal_sstable() sstable: Move remove_temp_dir() to storage sstable: Move create_links_common() to storage sstable: Move check_create_links_replay() to storage sstable: Remove one of create_links() overloads sstable: Remove create_links_and_mark_for_removal() sstable: Indentation fix after prevuous patch sstable: Coroutinize create_links_common() sstable: Rename create_links_common()'s "dir" argument sstable: Make mark_for_removal bool_class sstable, table: Add sstable::snapshot() and use in table::take_snapshot sstable: Move _dir and _temp_dir on filesystem_storage sstable: Use sync_directory() method test, sstable: Use component_basename in test sstables: Move read_{digest\|checksum} on sstable	2022-12-18 17:29:35 +02:00
Anna Stuchlik	6a8eb33284	docs: add the new upgade guide 2022.1 to 2022.2 to the index and the toctree	2022-12-16 17:13:50 +01:00
Anna Stuchlik	36f4ef2446	docs: add the index file for the new upgrage guide from 2022.1 to 2022.2	2022-12-16 17:11:25 +01:00
Anna Stuchlik	8d8983e029	docs: add the metrics update file to the upgrade guide 2022.1 to 2022.2	2022-12-16 17:09:21 +01:00
Anna Stuchlik	252c2139c2	docs: add the upgrade guide for ScyllaDB Enterprise from 2022.1 to 2022.2	2022-12-16 17:07:00 +01:00
Michał Chojnowski	b52bd9ef6a	db: commitlog: remove unused max_active_writes() Dead and misleading code. Closes #12327	2022-12-16 10:23:03 +02:00
Nadav Har'El	327539b15d	Merge 'test.py: fix cql failure handling' from Alecco Fix a bug in failure handling and log level. Closes #12336 * github.com:scylladb/scylladb: test.py: convert param to str test.py: fix error level for CQL tests	2022-12-16 09:29:21 +02:00
Botond Dénes	cc03becf82	Merge 'tasks: get task's type with method' from Aleksandra Martyniuk Type of operation is related to a specific implementation of a task. Then, it should rather be access with a virtual method in tasks::task_manager::task::impl than be its attribute. Closes #12326 * github.com:scylladb/scylladb: api: delete unused type parameter from task_manager_test api tasks: repair: api: remove type attribute from task_manager::task::status tasks: add type() method to task_manager::task::impl repair: add reason attribute to repair_task	2022-12-16 09:20:26 +02:00
Aleksandra Martyniuk	f81ad2d66a	repair: make shard tasks internal Shard tasks should not be visible to users by default, thus they are made internal. Closes #12325	2022-12-16 09:05:30 +02:00
Aleksandra Martyniuk	bae887da3b	tasks: add virtual destructor to task_manager::module When an object of a class inheriting from task_manager::module is destroyed, destructor of the derived class should be called. Closes #12324	2022-12-16 08:59:26 +02:00
Raphael S. Carvalho	e6fb3b3a75	compaction: Delete atomically off-strategy input sstables After commit `a57724e711`, off-strategy no longer races with view building, therefore deletion code can be simplified and piggyback on mechanism for deleting all sstables atomically, meaning a crash midway won't result in some of the files coming back to life, which leads to unnecessary work on restart. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12245	2022-12-16 08:15:49 +02:00
Alejo Sanchez	9b65448d38	test.py: convert param to str The format_unidiff() function takes str, not pathlib PosixPath, so convert it to str. This prevented diff output of unexpected result to be shown in the log file. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-12-15 20:46:35 +01:00
Alejo Sanchez	5142d80bb1	test.py: fix error level for CQL tests If the test fails, use error log level. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-12-15 20:45:44 +01:00
Botond Dénes	64903ba7d5	test/cql-pytest: use pytest site-packages workaround Recently, the pytest script shipped by Fedora started invoking python with the `-s` flag, which disables python considering user site packages. This caused problems for our tests which install the cassandra driver in the user site packages. This was worked around in `e5e7780f32` by providing our own pytest interposer launcher script which does not pass the above mentioned flag to python. Said patch fixed test.py but not the run.py in cql-pytest. So if the cql-pytest suite is ran via test.py it works fine, but if it is invoked via the run script, it fails because it cannot find the cassandra driver. This patch patches run.py to use our own pytest launcher script, so the suite can be run via the run script as well. Since run.py is shared with the alternator pytest suite, this patch also fixes said test suite too. Closes #12253	2022-12-15 16:05:31 +02:00
Benny Halevy	639e247734	test: cql-pytest: test_describe: test_table_options_quoting: USE test_keyspace Without that, I often (but not always) get the following error: ``` __________________________ test_table_options_quoting __________________________ cql = <cassandra.cluster.Session object at 0x7f1aafb10650> test_keyspace = 'cql_test_1671103335055' def test_table_options_quoting(cql, test_keyspace): type_name = f"some_udt; DROP KEYSPACE {test_keyspace}" column_name = "col''umn -- @quoting test!!" comment = "table''s comment test!\"; DESC TABLES --quoting test" comment_plain = "table's comment test!\"; DESC TABLES --quoting test" #without doubling "'" inside comment > cql.execute(f"CREATE TYPE \"{type_name}\" (a int)") test/cql-pytest/test_describe.py:623: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cassandra/cluster.py:2699: in cassandra.cluster.Session.execute ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="No keyspace has been specified. USE a keyspace, or explicitly specify keyspace.tablename" ``` CQL driver in use ise the scylla driver version 3.25.10. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12329	2022-12-15 14:35:33 +02:00
Aleksandra Martyniuk	f0b2b00a15	api: delete unused type parameter from task_manager_test api	2022-12-15 10:50:30 +01:00
Aleksandra Martyniuk	5bc09daa7a	tasks: repair: api: remove type attribute from task_manager::task::status	2022-12-15 10:49:09 +01:00
Aleksandra Martyniuk	8d5377932d	tasks: add type() method to task_manager::task::impl	2022-12-15 10:41:58 +01:00
Aleksandra Martyniuk	329176c7bc	repair: add reason attribute to repair_task As a preparation to creating a type() method in task_manager::task::impl a streaming::stream_reason is kept in repair_task.	2022-12-15 10:38:38 +01:00
Botond Dénes	9713a5c314	tool/scylla-sstable: move documentation online The inline-help of operations will only contain a short summary of the operation and the link to the online documentation. The move is not a straightforward copy-paste. First and foremost because we move from simple markdown to RST. Informal references are also replaced with proper RST links. Some small edits were also done on the texts. The intent is the following: * the inline help serves as a quick reference for what the operation does and what flags it has; * the online documentation serves as the full reference manual, explaining all details;	2022-12-15 04:10:21 -05:00
Botond Dénes	3cf7afdf95	docs: scylla-sstable.rst: add sstable content section Provides a link to the architecture/sstable page for more details on the sstable format itself. It also describes the mutation-fragment stream, the parts of it that is relevant to the sstable operations. The purpose of this section is to provide a target for links that want to point to a common explanation on the topic. In particular, we will soon move the detailed documentation of the scylla-sstable operations into this file and we want to have a common explanation of the mutation fragment stream that these operations can point to.	2022-12-15 04:10:21 -05:00
Botond Dénes	641fb4c8bb	docs: scylla-{sstable,types}.rst: drop Syntax section In both files, the section hierarchy is as follows: Usage Syntax Sections with actual content This scheme uses up 3 levels of hierarchy, leaving not much room to expand the sections with actual content with subsections of their own. Remove the Syntax level altogether, directly embedding the sections with content under the Usage section.	2022-12-15 04:03:00 -05:00
Botond Dénes	8f8284783a	Merge 'Fix handling of non-full clustering keys in the read path' from Tomasz Grabiec This PR fixes several bugs related to handling of non-full clustering keys. One is in trim_clustering_row_ranges_to(), which is broken for non-full keys in reverse mode. It will trim the range to position_in_partition_view::after_key(full_key) instead of position_in_partition_view::before_key(key), hence it will include the key in the resulting range rather than exclude it. Fixes #12180 after_key() was creating a position which is after all keys prefixed by a non-full key, rather than a position which is right after that key. This will issue will be caught by cql_query_test::test_compact_storage in debug mode when mutation_partition_v2 merging starts inserting sentinels at position after_key() on preemption. It probably already causes problems for such keys as after_key() is used in various parts in the read path. Refs #1446 Closes #12234 * github.com:scylladb/scylladb: position_in_partition: Make after_key() work with non-full keys position_in_partition: Introduce before_key(position_in_partition_view) db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order types: Fix comparison of frozen sets with empty values	2022-12-15 10:47:12 +02:00
Pavel Emelyanov	6d10a3448b	sstable, storage: Mark dir/temp_dir private Now all storage access via sstable happens with the help of storage class API so its internals can be finally made private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	6296ca3438	sstable: Remove get_dir() (well, almost) The sstable::get_dir() is now gone, no callers know that sstable lives in any path on a filesystem. There are only few callers left. One is several places in code that need sstable datafile, toc and index paths to print them in logs. The other one is sstable_directory that is to be patched separately. For both there's a storage.prefix() method that prepends component name with where the sstable is "really" located. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	7402787d16	sstable: Add quarantine() method to storage Moving sstable to quarantine has some specific -- if the sstable is in staging/ directory it's anyway moved into root/quarantine dir, not into the quarantine subdir of its current location. Encapsulate this feture in storage class method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	f507271578	sstable: Use absolute/relative path marking for snapshot() The snapshotting code uses full paths to files to manipulate snapshotted sstables. Until this code is patched to use some proper snapshotting API from sstable/ module, it will continue doing so. Nowever, to remove the get_dir() method from sstable() the seal_sstable() needs to put relative "backup" directory to storage::snapshot() method. This patch adds a temporary bool_class for this distinguishing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	a46d378bee	sstable: Remove temp_... stuff from sstable There's a bunch of helpers around XFS-specific temp-dir sitting in publie sstable part. Drop it altogether, no code needs it for real. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	adba24d8ae	sstable: Move open_component() on storage Obtaining a class file object to read/write sstable from/to is now storage-specific. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	4c22831d23	sstable: Mark rename_new_sstable_component_file() const It's in fact such. Next patch will need it const to call this method via const sstable reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	6bf3e3a921	sstable: Print filename(type) on open-component error The file path is going to disappear soon, so print the filename() on error. For now it's the same, but the meaning of the filename() returning string is changing to become "random label for the log reader". Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	dc72bce6d7	sstable: Reorganize new_sstable_component_file() The helper consists of three stages: 1. open a file (probably in a temp dir) 2. decorate it with extentions and checked_file 3. optionally rename a file from temp dir The latter is done to trigger XFS allocate this file in separate block group if the file was created in temp dir on step 1. This patch swaps steps 2 and 3 to keep filesystem-specific opening next to each other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	e55c740f49	sstable: Mark filename() private From now on no callers should use this string to access anything on disk Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	5f579eb405	sstable: Introduce index_filename() Currently the sstable::filename(Index) is used in several places that get the filename as a printable or throwable string and don't treat is as a real location of any file. For those, add the index_filename() helper symmetrical to toc_filename() and (in some sense) the get_filename() one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	bbbbd6dbfc	tests: Disclosure private filename() calls The sstable::filename() is going to become private method. Lots of tests call it, but tests do call a lot of other sstable private methods, that's OK. Make the sstable::filename() yet another one of that kind in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	4a91f3d443	sstable: Move wipe_storage() on storage Now when the filesystem cleaning code is sitting in one method, it can finally be made the storage class one. Exception-safe allocation of toc_name (spoiler: it's copied anyway one step later, so it's "not that safe" actually) is moved into storage as well. The caller is left with toc_filename() call in its exception handler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	c92d45eaa9	sstable: Remove temp dir in wipe_storage() When unlinking an sstable for whatever reason it's good to check if the temp dir is handing around. In some cases it's not (compaction), but keeping the whole wiping code together makes it easier to move it on storage class in one go. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	88ede71320	sstable: Move unlink parts into wipe_storage Just move the code. This is to make the next patch smaller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	0336cb3bdd	sstable: Remove get_temp_dir() Only one private called of it left, it's better to open-code it there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	3326063b8b	sstable: Move write_toc() to storage This method initiates the sstable creation. Effectively it's the first step in sstable creation transaction implemented on top of rename() call. Thus this method is moved onto storage under respective name. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	636d49f1c1	sstable: Shuffle open_sstable() When an sstable is prepared to be written on disk the .write_toc() is called on it which created temporary toc file. Prior to this, the writer code calls generate_toc() to collect components on the sstable. This patch adds the .open_sstable() API call that does both. This prepares the write_toc() part to be moved to storage, because it's not just "write data into TOC file", it's the first step in transaction implemeted on top of rename()s. The test need care -- there's rewrite_toc_without_scylla_component() thing in utils that doesn't want the generate_toc() part to be called. It's not patched here and continues calling .write_toc(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	d3216b10d6	sstable: Move touch_temp_dir() to storage The continuation of the previously moved remove_temp_dir() one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Pavel Emelyanov	1a34cb98fc	sstable: Move move() to storage The sstable can be "moved" in two cases -- to move from staging or to move to quarantine. Both operation are sstable API ones, but the implementation is storage-specific. This patch makes the latter a method of storage class. One thing to note is that only quarantine() touched the target directly. Now also the move_to_new_dir() happenning on load also does it, but that's harmless. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:47 +03:00
Pavel Emelyanov	18f6165993	sstable: Move create_links() to storage This method is currently used in two places: sstable::snapshot() and sstable::seal_sstable(). The latter additionally touches the target backup/ subdir. This patch moves the whole thing on storage and adds touch for all the cases. For snapshots this might be excessive, but harmless. Tests get their private-disclosure way to access sstable._storage in few places to call create_links directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	136a8681e0	sstable: Move seal_sstable() to storage Now the sstable sealing is split into storage part, internal-state part and the seal-with-backup kick. This move makes remove_temp_dir() private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	334d231f56	sstable: Tossing internals of seal_sstable() There are two of them -- one API call and the other one that just "seals" it. The latter one also changes the _marked_for_deletion bit on the sstable. This patch makes the latter method prepared to be moved onto storage, because sealing means comitting TOC file on disk with the help of rename system call which is purely storage thing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	ce3a8a4109	sstable: Move remove_temp_dir() to storage This one is simple, it just accesses _temp_dir thing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	9027d137d2	sstable: Move create_links_common() to storage Same as previous patch. This move makes the previously moved check_create_links_replay() a private method of the storage class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	990032b988	sstable: Move check_create_links_replay() to storage It needs to get sstable const reference to get the filename(s) from it. Other than that it's pure filesystem-accessing method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	041a8c80ad	sstable: Remove one of create_links() overloads There are two -- one that accepts generation and the other one that does not. The latter is only called by the former, so no need in keeping both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	f1558b6988	sstable: Remove create_links_and_mark_for_removal() There's only one user of it, it can document its "and mark for removal" intention via dedicated bool_class argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	65f40b28e6	sstable: Indentation fix after prevuous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	428adda4a9	sstable: Coroutinize create_links_common() Looks much shorter and easier-to-patch this way. The dst_dir argument is made value from const reference, old code copied it with do_with() anyway. Indentation is deliberately left broken until next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	ab13a99586	sstable: Rename create_links_common()'s "dir" argument The whole method is going to move onto newly introduced filesystem_storage that already has field of the same name onboard. To avoid confusion, rename the argument to dst_dir. No functional changes, _just_ s/dir/dst_dir/g throughout the method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	4977c73163	sstable: Make mark_for_removal bool_class Its meaning is comment-documented anyway. Also, next patches will remove the create_links_and_mark_for_removal() so callers need some verbose meaning of this boolean in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:45 +03:00
Pavel Emelyanov	f53d6804a6	sstable, table: Add sstable::snapshot() and use in table::take_snapshot The replica/ code now "knows" that snapshotting an sstable means creating a bunch of hard-links on disk. Abstract that via sstable::snapshot() method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:44 +03:00
Pavel Emelyanov	2803dcda6d	sstable: Move _dir and _temp_dir on filesystem_storage Those two fields define the way sstable is stored as collection of on-disk files. First step towards making the storage access abstract is in moving the paths onto filesystem_storage embedded class. Both are made public for now, the rest of the code is patched to access them via _storage.<smth>. The rest of the set moves parts of sstable:: methods into the filesystem_storage, then marks the paths private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:44 +03:00
Pavel Emelyanov	17c8ba6034	sstable: Use sync_directory() method The sstable::write_toc() executes sync_directory() by hand. Better to use the method directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:44 +03:00
Pavel Emelyanov	e934f42402	test, sstable: Use component_basename in test One case gets full sstable datafile path to get the basename from it. There's already the basename helper on the class sstable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:44 +03:00
Pavel Emelyanov	376915d406	sstables: Move read_{digest\|checksum} on sstable These methods access sstables as files on disk, in order to hide the "path on filesystem" meaning of sstables::filename() the whole method should be made sstable:: one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:13:44 +03:00
Pavel Emelyanov	d561495f0d	Merge 'topology: get rid of pending state' from Benny Halevy Now, with `a44ca06906`, is_normal_token_owner that replaced is_member does not rely anymore on the pending status of endpoints in topology. With that we can get rid of this state and just keep all endpoints we know about in the topology. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12294 * github.com:scylladb/scylladb: topology: get rid of pending state topology: debug log update and remove endpoint	2022-12-14 19:28:35 +03:00
Benny Halevy	bdb6550305	view: row_locker: add latency_stats_tracker Refactor the existing stats tracking and updating code into struct latency_stats_tracker and while at it, count lock_acquisitions only on success. Decrement operations_currently_waiting_for_lock in the destructor so it's always balanced with the uncoditional increment in the ctor. As for updating estimated_waiting_for_lock, it is always updated in the dtor, both on success and failure since the wait for the lock happened, whether waiting timed out or not. Fixes #12190 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12225	2022-12-14 17:37:22 +02:00
Avi Kivity	9ee78975b7	Merge 'Fix topology mismatch on read-repair handler creation' from Pavel Emelyanov The schedule_repair() receives a bunch of endpoint:mutations pairs and tries to create handlers for those. When creating the handlers it re-obtains topology from schema->ks->effective_replication_map chain, but this new topology can be outdated as compared to the list of endpoints at hand. The fix is to carry the e.r.m. pointer used by read executor reconciliation all the way down to repair handlers creation. This requires some manipulations with mutate_internal() and mutate_prepare() argument lists. fixes: #12050 (it was the same problem) Closes #12256 * github.com:scylladb/scylladb: proxy: Carry replication map with repair mutation(s) proxy: Wrap read repair entries into read_repair_mutation proxy: Turn ref to forwardable ref in mutations iterator	2022-12-14 17:33:43 +02:00
Tomasz Grabiec	23e4c83155	position_in_partition: Make after_key() work with non-full keys This fixes a long standing bug related to handling of non-full clustering keys, issue #1446. after_key() was creating a position which is after all keys prefixed by a non-full key, rather than a position which is right after that key. This will issue will be caught by cql_query_test::test_compact_storage in debug mode when mutation_partition_v2 merging starts inserting sentinels at position after_key() on preemption. It probably already causes problems for such keys.	2022-12-14 14:47:33 +01:00
Botond Dénes	16c50bed5e	Merge 'sstables: coroutinize update_info_for_opened_data' from Avi Kivity A complicated function (in continuation style) that benefits from this simplification. Closes #12289 * github.com:scylladb/scylladb: sstables: update_info_for_opened_data: reindent sstables: update_info_for_opened_data: coroutinize	2022-12-14 15:12:22 +02:00
Nadav Har'El	92d03be37b	materialized view: fix bug in some large modifications to base partitions Sometimes a single modification to a base partition requires updates to a large number of view rows. A common example is deletion of a base partition containing many rows. A large BATCH is also possible. To avoid large allocations, we split the large amount of work into batch of 100 (max_rows_for_view_updates) rows each. The existing code assumed an empty result from one of these batches meant that we are done. But this assumption was incorrect: There are several cases when a base-table update may not need a view update to be generated (see can_skip_view_updates()) so if all 100 rows in a batch were skipped, the view update stopped prematurely. This patch includes two tests showing when this bug can happen - one test using a partition deletion with a USING TIMESTAMP causing the deletion to not affect the first 100 rows, and a second test using a specially-crafed large BATCH. These use cases are fairly esoteric, but in fact hit a user in the wild, which led to the discovery of this bug. The fix is fairly simple: To detect when build_some() is done it is no longer enough to check if it returned zero view-update rows; Rather, it explicitly returns whether or not it is done as an std::optional. The patch includes several tests for this bug, which pass on Cassandra, failed on Scylla before this patch, and pass with this patch. Fixes #12297. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12305	2022-12-14 14:50:38 +02:00
Botond Dénes	e7d8855675	Merge 'Revert accidental submodule updates' from Benny Halevy The abseil and tools/java submodules were accidentally updated in `71bc12eecc` (merged to master in `51f867339e`) This series reverts those changes. Closes #12311 * github.com:scylladb/scylladb: Revert accidental update of tools/java submodule Revert accidental update of abseil submodule	2022-12-14 13:20:08 +02:00
Benny Halevy	865193f99a	Revert accidental update of tools/java submodule The tools/java submodule was accidentally updated in `71bc12eecc` Revert this change. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-14 13:06:30 +02:00
Benny Halevy	9911ba195b	Revert accidental update of abseil submodule The abseil module was accidentally updated in `71bc12eecc` Revert this change. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-14 13:05:04 +02:00
Pavel Emelyanov	ab8fc0e166	proxy: Carry replication map with repair mutation(s) The create_write_response_handler() for read repair needs the e.r.m. from the caller, because it effectively accepts list of endpoints from it. So this patch equips all read_repair_mutation-s with the e.r.m. pointer so that the handler creation can use it. It's the same for all mutations, so it's a waste of space, but it's not bad -- there's typically few mutations in this range and the entry passed there is temporary, so even lots of them won't occupy lots of memory for long. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-14 14:03:39 +03:00
Pavel Emelyanov	140f373e15	proxy: Wrap read repair entries into read_repair_mutation The schedule_repair() operates on a map of endpoint:mutations pairs. Next patch will need to extend this entry and it's going to be easier if the entry is wrapped in a helper structure in advance. This is where the forwardable reference cursor from the previous patch gets its user. The schedule_repair() produces a range of rvalue wrappers, but the create_write_response_handler accepting it is OK, it copies mutations anyway. The printing operator is added to facilitate mutations logging from mutate_internal() method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-14 14:01:12 +03:00
Pavel Emelyanov	014b563ef1	proxy: Turn ref to forwardable ref in mutations iterator The mutate_prepare() is iterating over range of mutation with 'auto&' cursor thus accepting only lvalues. This is very restrictive, the caller of mutate_prepare() may as well provide rvalues if the target create_write_response_handler() or lambda accepts it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-14 14:00:10 +03:00
Avi Kivity	3fa230fee4	Merge 'cql3: expr: make it possible to prepare and evaluate conjunctions' from Jan Ciołek This PR implements two things: * Getting the value of a conjunction of elements separated by `AND` using `expr::evaluate` * Preparing conjunctions using `prepare_expression` --- `NULL` is treated as an "unkown value" - maybe `true` maybe `false`. `TRUE AND NULL` evaluates to `NULL` because it might be `true` but also might be `false`. `FALSE AND NULL` evaluates to `FALSE` because no matter what value `NULL` acts as, the result will still be `FALSE`. Unset and empty values are not allowed. Usually in CQL the rule is that when `NULL` occurs in an operation the whole expression becomes `NULL`, but here we decided to deviate from this behavior. Treating `NULL` as an "unkown value" is the standard SQL way of handing `NULLs` in conjunctions. It works this way in MySQL and Postgres so we do it this way as well. The evaluation short-circuits. Once `FALSE` is encountered the function returns `FALSE` immediately without evaluating any further elements. It works this way in Postgres as well, for example: `SELECT true AND NULL AND 1/0 = 0` will throw a division by zero error, but `SELECT false AND 1/0 = 0` will successfully evaluate to `FALSE`. Closes #12300 * github.com:scylladb/scylladb: expr_test: add unit tests for prepare_expression(conjunction) cql3: expr: make it possible to prepare conjunctions expr_test: add tests for evaluate(conjunction) cql3: expr: make it possible to evaluate conjunctions	2022-12-14 09:48:26 +02:00
Botond Dénes	122b267478	Merge 'repair: coroutinize to_repair_rows_list' from Avi Kivity Simplify a somewhat complicated function. Closes #12290 * github.com:scylladb/scylladb: repair: to_repair_rows_list: reindent repair: to_repair_rows_list: coroutinize	2022-12-14 09:39:47 +02:00
Avi Kivity	c09583bcef	storage_proxy: coroutinize send_truncate_blocking Not particularly important, but a small simplification. Closes #12288	2022-12-14 09:39:33 +02:00
Tomasz Grabiec	132d5d4fa1	messaging: Shutdown on stop() if it wasn't shut down earlier All rpc::client objects have to be stopped before they are destroyed. Currently this is done in messaging_service::shutdown(). The cql_test_env does not call shutdown() currently. This can lead to use-after-free on the rpc::client object, manifesting like this: Segmentation fault on shard 0. Backtrace: column_mapping::~column_mapping() at schema.cc:? db::cql_table_large_data_handler::internal_record_large_cells(sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long) const at ./db/large_data_handler.cc:180 operator() at ./db/large_data_handler.cc:123 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long>(std::__invoke_other, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const&&, column_definition const&, unsigned long&&, unsigned long&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<seastar::future<void>, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long>, seastar::future<void> >::type std::__invoke_r<seastar::future<void>, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long>(db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const&&, column_definition const&, unsigned long&&, unsigned long&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:114 (inlined by) std::_Function_handler<seastar::future<void> (sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long), db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1>::_M_invoke(std::_Any_data const&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const&&, column_definition const&, unsigned long&&, unsigned long&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:290 std::function<seastar::future<void> (sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long)>::operator()(sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:591 (inlined by) db::cql_table_large_data_handler::record_large_cells(sstables::sstable const&, sstables::key const&, clustering_key_prefix const, column_definition const&, unsigned long, unsigned long) const at ./db/large_data_handler.cc:175 seastar::rpc::log_exception(seastar::rpc::connection&, seastar::log_level, char const, std::__exception_ptr::exception_ptr) at ./build/release/seastar/./seastar/src/rpc/rpc.cc:109 operator() at ./build/release/seastar/./seastar/src/rpc/rpc.cc:788 operator() at ./build/release/seastar/./seastar/include/seastar/core/future.hh:1682 (inlined by) void seastar::futurize<seastar::future<void> >::satisfy_with_result_of<seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14>(seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14>(seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&) const::{lambda()#1}&&) at ./build/release/seastar/./seastar/include/seastar/core/future.hh:2134 (inlined by) operator() at ./build/release/seastar/./seastar/include/seastar/core/future.hh:1681 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14>(seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/future.hh:781 seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2319 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2756 seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2925 seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2808 seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:265 seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:156 operator() at ./build/release/seastar/./seastar/src/testing/test_runner.cc:75 (inlined by) void std::__invoke_impl<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>(std::__invoke_other, seastar::testing::test_runner::start_thread(int, char)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>, void>::type std::__invoke_r<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>(seastar::testing::test_runner::start_thread(int, char)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:111 (inlined by) std::_Function_handler<void (), seastar::testing::test_runner::start_thread(int, char)::$_0>::_M_invoke(std::_Any_data const&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:290 std::function<void ()>::operator()() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:591 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:73 Fix by making sure that shutdown() is called prior to destruction. Fixes #12244 Closes #12276	2022-12-14 10:28:26 +03:00
Tzach Livyatan	7cd613fc08	Docs: Improve wording on the os-supported page v2 Closes #11871	2022-12-14 08:59:26 +02:00
Botond Dénes	31fcfe62e1	Merge 'doc: add the description of AzureSnitch to the documentation' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/11712 Updates added with this PR: - Added a new section with the description of AzureSnitch (similar to others + examples and language improvements). - Fixed the headings so that they render properly. - Replaced "Scylla" with "ScyllaDB". Closes #12254 * github.com:scylladb/scylladb: docs: replace Scylla with ScyllaDB on the Snitches page docs: fix the headings on the Snitches page doc: add the description of AzureSnitch to the documentation	2022-12-14 08:58:48 +02:00
Lubos Kosco	3f9dca9c60	doc: print out the generated UUID for sending to support Closes #12176	2022-12-14 08:57:54 +02:00
guy9	a329fcd566	Updated University monitoring lesson link Closes #11906	2022-12-14 08:50:26 +02:00
Jan Ciolek	9afa9f0e50	expr_test: add unit tests for prepare_expression(conjunction) Add unit tests which ensure that preparing conjunctions works as expected. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-12-13 20:23:17 +01:00
Jan Ciolek	dde86a2da6	cql3: expr: make it possible to prepare conjunctions prepare_expression used to throw an error when encountering a conjunction. Now it's possible to use prepare_expression to prepare an expression that contains conjunctions. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-12-13 20:23:17 +01:00
Jan Ciolek	5f5b1c4701	expr_test: add tests for evaluate(conjunction) Add unit tests which ensure that evaluating a conjunction behaves as expected. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-12-13 20:23:17 +01:00
Jan Ciolek	b3c16f6bc8	cql3: expr: make it possible to evaluate conjunctions Previously it was impossible to use expr::evaluate() to get the value of a conjunction of elements separated by ANDs. Now it has been implemented. NULL is treated as an "unkown value" - maybe true maybe false. `TRUE AND NULL` evaluates to NULL because it might be true but also might be false. `FALSE AND NULL` evaluates to FALSE because no matter what value NULL acts as, the result will still be FALSE. Unset and empty values are not allowed. Usually in CQL the rule is that when NULL occurs in an operation the whole expression becomes NULL, but here we decided to deviate from this behavior. Treating NULL as an "unkown value" is the standard SQL way of handing NULLs in conjunctions. It works this way in MySQL and Postgres so we do it this way as well. The evaluation short-circuits. Once FALSE is encountered the function returns FALSE immediately without evaluating any further elements. It works this way in Postgres as well, for example: `SELECT true AND NULL AND 1/0 = 0` will throw a division by zero error but `SELECT false AND 1/0 = 0` will successfully evaluate to FALSE. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-12-13 20:23:08 +01:00
Benny Halevy	e9e66f3ca7	database: drop_table_on_all_shards: limit truncated_at time The infinetely high time_point of `db_clock::time_point::max()` used in `ba42852b0e` is too high for some clients that can't represent that as a date_time string. Instead, limit it to 9999-12-31T00:00:00+0000, that is practically sufficient to ensure truncation of all sstables and should be within the clients' limits. Fixes #12239 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12273	2022-12-13 16:46:20 +02:00
Avi Kivity	919888fe60	Merge 'docs/dev: Add backport instructions for contributors' from Jan Ciołek Add instructions on how to backport a feature to on older version of Scylla. It contains a detailed step-by-step instruction so that people unfamiliar with intricacies of Scylla's repository organization can easily get the hang of it. This is the guide I wish I had when I had to do my first backport. I put it in backport.md because that looks like the file responsible for this sort of information. For a moment I thought about `CONTRIBUTING.md`, but this is a really short file with general information, so it doesn't really fit there. Maybe in the future there will be some sort of unification (see #12126) Closes #12138 * github.com:scylladb/scylladb: dev/docs: add additional git pull to backport docs docs/dev: add a note about cherry-picking individual commits docs/dev: use 'is merged into' instead of 'becomes' docs/dev: mention that new backport instructions are for the contributor docs/dev: Add backport instructions for contributors	2022-12-13 16:27:04 +02:00
Pavel Emelyanov	fe4cf231bc	snitch: Check http response codes to be OK Several snitch drivers make http requests to get region/dc/zone/rack/whatever from the cloud provider. They blindly rely on the response being successfull and read response body to parse the data they need from. That's not nice, add checks for requests finish with http OK statuses. refs: #12185 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12287	2022-12-13 14:49:18 +02:00
Benny Halevy	68141d0aac	topology: get rid of pending state Now, with `a44ca06906`, is_normal_token_owner that replaced is_member does not rely anymore on the pending status of endpoints in topology. With that we can get rid of this state and just keep all endpoints we know about in the topology. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-13 14:17:18 +02:00
Benny Halevy	f2753eba30	topology: debug log update and remove endpoint Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-13 14:17:13 +02:00
Avi Kivity	c7cee0da40	Merge 'storage_service: handle_state_normal: always update_topology before update_normal_tokens' from Benny Halevy update_normal_tokens checks that that the endpoint is in topology. Currently we call update_topology on this path only if it's not a normal_token_owner, but there are paths when the endpoint could be a normal token owner but still be pending in topology so always update it, just in case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12080 * github.com:scylladb/scylladb: storage_service: handle_state_normal: always update_topology before update_normal_tokens storage_service: handle_state_normal: delete outdated comment regarding update pending ranges race	2022-12-13 13:41:10 +02:00
Avi Kivity	75e469193b	Merge 'Use Host ID as Raft ID' from Kamil Braun Thanks to #12250, Host IDs uniquely identify nodes. We can use them as Raft IDs which simplifies the code and makes reasoning about it easier, because Host IDs are always guaranteed to be present (while Raft IDs may be missing during upgrade). Fixes: https://github.com/scylladb/scylladb/issues/12204 Closes #12275 * github.com:scylladb/scylladb: service/raft: raft_group0: take `raft::server_id` parameter in `remove_from_group0` gms, service: stop gossiping and storing RAFT_SERVER_ID Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round" service: use HOST_ID instead of RAFT_SERVER_ID during replace service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map main: use Host ID as Raft ID	2022-12-13 13:39:41 +02:00
Andrii Patsula	cd2e786d72	Report a warning when a server's IP cannot be found in ping. Fixes #12156 Closes #12206	2022-12-13 11:18:59 +01:00
Botond Dénes	51f867339e	Merge 'Docs: cleanup add-node-to-cluster' from Benny Halevy This series improves the add-node-to-cluster document, in particular around the documentation for the associated cleanup procedure, and the prerequisite steps. It also removes information about outdated releases. Closes #12210 * github.com:scylladb/scylladb: docs: operating-scylla: add-node-to-cluster: deleted instructions for unsupported releases docs: operating-scylla: add-node-to-cluster: cleanup: move tips to a note docs: operating-scylla: add-node-to-cluster: improve wording of cleanup instructions docs: operating-scylla: prerequisites: system_auth is a keyspace, not a table docs: operating-scylla: prerequisites: no Authetication status is gathered docs: operating-scylla: prerequisites: simplify grep commands docs: operating-scylla: add-node-to-cluster: prerequisites: number sub-sections docs: operating-scylla: add-node-to-cluster: describe other nodes in plural	2022-12-13 10:54:05 +02:00
Botond Dénes	4122854ae7	Merge 'repair: coroutinize repair_range' from Avi Kivity Nicer and simpler, but essentially cosmetic. Closes #12235 * github.com:scylladb/scylladb: repair: reindent repair_range repair: coroutinize repair_range	2022-12-13 08:16:05 +02:00
Avi Kivity	96890d4120	repair: to_repair_rows_list: reindent	2022-12-12 22:54:07 +02:00
Avi Kivity	e482cb1764	repair: to_repair_rows_list: coroutinize Simplifying a complicated function. It will also be a little faster due to fewer allocations, but not significantly.	2022-12-12 22:52:12 +02:00
Avi Kivity	c728de8533	sstables: update_info_for_opened_data: reindent Recover much-needed indent levels for future use.	2022-12-12 22:38:07 +02:00
Avi Kivity	eace9a226c	sstables: update_info_for_opened_data: coroutinize Nothing special, just simplifying a complicated function.	2022-12-12 22:35:46 +02:00
Michał Jadwiszczak	5985f22841	version: Reverse version increase Revert version change made by PR #11106, which increased it to `4.0.0` to enable server-side describe on latest cqlsh. Turns out that our tooling some way depends on it (eg. `sstableloader`) and it breaks dtests. Reverting only the version allows to leave the describe code unchanged and it fixes the dtests. cqlsh 6.0.0 will return a warning when running `DESC ...` commands. Closes #12272	2022-12-12 18:45:32 +02:00
Kamil Braun	a26f62b37b	service/raft: raft_group0: take `raft::server_id` parameter in `remove_from_group0` We no longer need to translate from IP to Raft ID using the address map, because Raft ID is now equal to the Host ID - which is always available at the call site of `remove_from_group0`.	2022-12-12 15:23:05 +01:00
Kamil Braun	bf6679906f	gms, service: stop gossiping and storing RAFT_SERVER_ID It is equal to (if present) HOST_ID and no longer used for anything. The application state was only gossiped if `experimental-features` contained `raft`, so we can free this slot. Similarly, `raft_server_id`s were only persisted in `system.peers` if the `SUPPORTS_RAFT` cluster feature was enabled, which happened only when `experimental-features` contained `raft`. The `raft_server_id` field in the schema was also introduced recently in `master` and didn't get to be in a release yet. Given either of these reasons, we can remove this field safely.	2022-12-12 15:20:30 +01:00
Kamil Braun	5dbe236339	Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round" This reverts commit `60217d7f50`. We no longer need RAFT_SERVER_ID.	2022-12-12 15:20:20 +01:00
Kamil Braun	3e58da0719	service: use HOST_ID instead of RAFT_SERVER_ID during replace Makes the code simpler because we can assume that HOST_ID is always there.	2022-12-12 15:18:56 +01:00
Kamil Braun	32c56920b4	service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map With the earlier commit, if gossiped RAFT_SERVER_ID is not empty then it's the same as HOST_ID.	2022-12-12 15:16:56 +01:00
Calle Wilund	e99626dc10	config: Change wording of "none" in encryption options to maybe reduce user confusion Fixes /scylladb/scylla-enterprise/issues#1262 Changes the somewhat ambiguous "none" into "not set" to clarify that "none" is not an option to be written out, but an absense of a choice (in which case you also have made a choice). Closes #12270	2022-12-12 16:14:53 +02:00
Kamil Braun	f3243ff674	main: use Host ID as Raft ID The Host ID now uniquely identifies a node (we no longer steal it during node replace) and Raft is still experimental. We can reuse the Host ID of a node as its Raft ID. This will allow us to remove and simplify a lot of code. With this we can already remove some dead code in this commit.	2022-12-12 15:14:51 +01:00
Botond Dénes	d44c5f5548	scripts: add open-coredump.sh Script for "one-click" opening of coredumps. It extracts the build-id from the coredump, retrieves metadata for that build, downloads the binary package, the source code and finally launches the dbuild container, with everything ready to load the coredump. The script is idempotent: running it after the prepartory steps will re-use what is already donwloaded. The script is not trying to provide a debugging environment that caters to all the different ways and preferences of debugging. Instead, it just sets up a minimalistic environment for debugging, while providing opportunities for the user to customization according to their preferred. I'm not entirely sure, coredumps from master branch will work, but we can address this later when we confirm they don't. Example: $ ~/ScyllaDB/scylla/worktree0/scripts/open-coredump.sh ./core.scylla.113.bac3650b616f4f09a4d1ab160574b6a5.4349.1669185225000000000000 Build id: 5009658b834aaf68970135bfc84f964b66ea4dee Matching build is scylla-5.0.5 0.20221009.5a97a1060 release-x86_64 Downloading relocatable package from http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz Extracting package scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz Cloning scylla.git Downloading scylla-gdb.py Copying scylla-gdb.py from /home/bdenes/ScyllaDB/storage/11961/open-coredump.sh.dir/scylla.repo Launching dbuild container. To examine the coredump with gdb: $ gdb -x scylla-gdb.py -ex 'set directories /src/scylla' --core ./core.scylla.113.bac3650b616f4f09a4d1ab160574b6a5.4349.1669185225000000000000 /opt/scylladb/libexec/scylla See https://github.com/scylladb/scylladb/blob/master/docs/dev/debugging.md for more information on how to debug scylla. Good luck! [root@fedora workdir]# Closes #12223	2022-12-12 12:55:28 +02:00
Kamil Braun	dcba652013	Merge 'replacenode: do not inherit host_id' from Benny Halevy We want to always be able to distinguish between the replacing node and the replacee by using different, unique, host identifiers. This will allow us to use the host_id authoritatively to identify the node (rather then its endpoint ip address) for token mapping and node operations. Also, it will be used in the following patch to never allow the replaced node to rejoin the cluster, as its host_id should never be reused. This change does not affect #5523, the replaced node may still steal back its tokens if restarted. Refs #9839 Refs #12040 Closes #12250 * github.com:scylladb/scylladb: docs: replace-dead-node: update host_id of replacing node docs: replace-dead-node: fix alignment db: system_keyspace: change set_local_host_id to private set_local_random_host_id storage_service: do not inherit the host_id of a replaced a node	2022-12-12 11:00:42 +01:00
Benny Halevy	c6f05b30e1	task_manager: task: impl: add virtual destructor The generic task holds and destroyes a task::impl but we want the derived class's destructor to be called when the task is destroyed otherwise, for example, member like abort_source subscription will not be destroyed (and auto-unlinked). Fixes #12183 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12266	2022-12-11 22:10:59 +02:00
Benny Halevy	36a9f62833	repair: repair_module: use mutable capture for func It is moved into the async thread so the encapsulating function should be defined mutable to move the func rather thna copying it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12267	2022-12-11 22:10:28 +02:00
Nadav Har'El	0c26032e70	test/cql-pytest: translate more Cassandra tests This patch includes a translation of two more test files from Cassandra's CQL unit test directory cql3/validation/operations. All tests included here pass on Cassandra. Several test fail on Scylla and are marked "xfail". These failures discovered two previously-unknown bugs: #12243: Setting USING TTL of "null" should be allowed #12247: Better error reporting for oversized keys during INSERT And also added reproducers for two previously-known bugs: #3882: Support "ALTER TABLE DROP COMPACT STORAGE" #6447: TTL unexpected behavior when setting to 0 on a table with default_time_to_live Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12248	2022-12-11 21:42:57 +02:00
Nadav Har'El	09a3c63345	cross-tree: allow std::source_location in clang 14 We recently (commit `6a5d9ff261`) started to use std::source_location instead of std::experimental::source_location. However, this does not work on clang 14, because libc++ 12's <source_location> only works if __builtin_source_location, and that is not available on clang 14. clang 15 is just three months old, and several relatively-recent distributions still carry clang 14 so it would be nice to support it as well. So this patch adds a trivial compatibility header file, which, when included and compiled with clang 14, it aliases the functional std::experimental::source_location to std::source_location. It turns out it's enough to include the new header file from three headers that included <source_location> - I guess all other uses of source_location depend on those header files directly or indirectly. We may later need to include the compatibility header file in additional places, bug for now we don't. Refs #12259 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12265	2022-12-11 20:28:49 +02:00
Avi Kivity	e6ffc22053	Merge 'cql3: Server-side DESC statement' from Michał Jadwiszczak This PR adds server-side `DESCRIBE` statement, which is required in latest cqlsh version. The only change from the user perspective is the `DESC ...` statement can be used with cqlsh version >= 6.0. Previously the statement was executed from client side, but starting with Cassandra 4.0 and cqlsh 6.0, execution of describe was moved to server side, so the user was unable to do `DESC ...` with Scylla and cqlsh 6.0. Implemented describe statements: - `DESC CLUSTER` - `DESC [FULL] SCHEMA` - `DESC [ONLY] KEYSPACE` - `DESC KEYSPACES/TYPES/FUNCTIONS/AGGREGATES/TABLES` - `DESC TYPE/FUNCTION/AGGREGATE/MATERIALIZED VIEW/INDEX/TABLE` - `DESC` [Cassandra's implementation for reference](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/statements/DescribeStatement.java) Changes in this patch: - cql3::util: added `single_quite()` function - added `data_dictionary::keyspace_element` interface - implemented `data_dictionary::keyspace_element` for: - keyspace_metadata, - UDT, UDF, UDA - schema - cql3::functions: added `get_user_functions()` and `get_user_aggregates()` to get all UDFs/UDAs in specified keyspace - data_dictionary::user_types_metadata: added `has_type()` function - extracted `describe_ring()` from storage_service to standalone helper function in `locator/util.hh` - storage_proxy: added `describe_ring()` (implemented using helper function mentioned above) - extended CQL grammar to handle describe statement - increased version in `version.hh` to 4.0.0, so cqlsh will use server-side describe statement Referring: https://github.com/scylladb/scylla/issues/9571, https://github.com/scylladb/scylladb/issues/11475 Closes #11106 * github.com:scylladb/scylladb: version: Increasing version cql-pytest: Add tests for server-side describe statement cql-pytest: creating random elements for describe's tests cql3: Extend CQL grammar with server-side describe statement cql3:statements: server-side describe statement data_dictonary: add `get_all_keyspaces()` and `get_user_keyspaces()` storage_proxy: add `describe_ring()` method storage_service, locator: extract describe_ring() data_dictionary:user_types_metadata: add has_type() function cql3:functions: `get_user_functions()` and `get_user_aggregates()` implement `keyspace_element` interface data_dictionary: add `keyspace_element` interface cql3: single_quote() util function view: row_lock: lock_ck: reindent test/topology: enable replace tests service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0` service: handle replace correctly with Raft enabled gms/gossiper: fetch RAFT_SERVER_ID during shadow round service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace	2022-12-11 18:29:36 +02:00
Michał Jadwiszczak	8d88c9721e	version: Increasing version The `current()` version in version.hh has to be increased to at least 4.0.0, so server-side describe will be used. Otherwise, cqlsh returns warning that client-side describe is not supported.	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	3ddde7c5ad	cql-pytest: Add tests for server-side describe statement	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	f91d05df43	cql-pytest: creating random elements for describe's tests Add helper functions to create random elements (keyspaces, tables, types) to increase the coverage of describe statment's tests. This commit also adds `random_seed` fixture. The fixture should be always used when using random functions. In case of test's failure, the seed will be present in test's signature and the case can be easili recreated. After the test finishes, the fixture restores state of `random` to before-test state.	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	c563b2133c	cql3: Extend CQL grammar with server-side describe statement	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	e572d5f111	cql3:statements: server-side describe statement Starting from cqlsh 6.0.0, execution of the describe statement was moved from the client to the server. This patch implements server-side describe statement. It's done by simply fetching all needed keyspace elements (keyspace/table/index/view/UDT/UDF/UDA) and generating the desired description or list of names of all elements. The description of any element has to respect CQL restrictions(like name's quoting) to allow quickly recreate the schema by simply copy-pasting the descritpion.	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	673393d88a	data_dictonary: add `get_all_keyspaces()` and `get_user_keyspaces()` Adds functions to `data_dictionary::database` in order to obtain names of all keyspaces/all user keyspaces.	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	360dbf98f1	storage_proxy: add `describe_ring()` method In order to execute `DESC CLUSTER`, there has to be a way to describe ring. `storage_service` is not available at query execution. This patch adds `describe_ring()` as a method of `storage_proxy()` (using helper function from `locator/util.hh`).	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	dd46a92e23	storage_service, locator: extract describe_ring() `describe_ring()` was implemented as a method of `storage_service`. This patch extracts it from there to a standalone helper function in `locator/util.hh`.	2022-12-10 12:51:05 +01:00
Michał Jadwiszczak	51a02e3bd7	data_dictionary:user_types_metadata: add has_type() function Adds `has_type()` function to `user_types_metadata`. The functions determins whether UDT with given name exists.	2022-12-10 12:50:52 +01:00
Michał Jadwiszczak	06cd03d3cd	cql3:functions: `get_user_functions()` and `get_user_aggregates()` Helper functions to obtain UDFs/UDAs for certain keyspace.	2022-12-10 12:36:59 +01:00
Michał Jadwiszczak	29ad5a08a8	implement `keyspace_element` interface This patch implements `data_dictionary::keyspace_element` interfece in: `keyspace_metadata`, `user_type_impl`, `user_function`, `user_aggregate` and schema.	2022-12-10 12:34:09 +01:00
Michał Jadwiszczak	f30378819d	data_dictionary: add `keyspace_element` interface A common interace for all keyspace elements, which are: keyspace, UDT, UDF, UDA, tables, views, indexes. The interface is to have a unified way to describe those elements.	2022-12-10 12:27:38 +01:00
Michał Jadwiszczak	0589116991	cql3: single_quote() util function `single_quote()` takes a string and transforms it to a string which can be safely used in CQL commands. Single quoting involves wrapping the name in single-quotes ('). A sigle-quote character itself is quoted by doubling it. Single quoting is necessary for dates, IP addresses or string literals.	2022-12-10 12:27:22 +01:00
Benny Halevy	9c2a5a755f	view: row_lock: lock_ck: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-10 12:27:22 +01:00
Kamil Braun	c43e64946a	test/topology: enable replace tests Also add some TODOs for enhancing existing tests.	2022-12-10 12:27:22 +01:00
Kamil Braun	b01cba8206	service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0` Also simplify the code and improve logging in general. The previous code did this: search for the ID in the address map. If it couldn't be found, perform a read barrier and search again. If it again couldn't be found, return. This algorithm depended on the fact that IP addresses were stored in group 0 configuration. The read barrier was used to obtain the most recent configuration, and if the IP was not a part of address map after the read barrier, that meant it's simply not a member of group 0. This logic no longer applies so we can simplify the code. Furthermore, when I was fixing the replace operation with Raft enabled, at some point I had a "working" solution with all tests passing. But I was suspicious and checked if the replaced node got removed from group 0. It wasn't. So the replace finished "successfully", but we had an additional (voting!) member of group 0 which didn't correspond to a token ring member. The last version of my fixes ensure that the node gets removed by the replacing node. But the system is fragile and nothing prevents us from breaking this again. At least log an error for now. Regression tests will be added later.	2022-12-10 12:27:22 +01:00
Kamil Braun	c65f4ae875	service: handle replace correctly with Raft enabled We must place the Raft ID obtained during the shadow round in the address map. It won't be placed by the regular gossiping route if we're replacing using the same IP, because we override the application state of the replaced node. Even if we replace a node with a different IP, it is not guaranteed that background gossiping manages to update the address map before we need it, especially in tests where we set ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round, on the other hand, performs a synchronous request (and if it fails during bootstrap, bootstrap will fail - because we also won't be able to obtain the tokens and Host ID of the replaced node). Fetch the Raft ID of the replaced node in `prepare_replacement_info`, which runs the shadow round. Return it in `replacement_info`. Then `join_token_ring` passes it to `setup_group0`, which stores it in the address map. It does that after `join_group0` so the entry is non-expiring (the replaced node is a member of group 0). Later in the replace procedure, we call `remove_from_group0` for the replaced node. `remove_from_group0` will be able to reverse-translate the IP of the replaced node to its Raft ID using the address map.	2022-12-10 12:27:22 +01:00
Kamil Braun	60217d7f50	gms/gossiper: fetch RAFT_SERVER_ID during shadow round During the replace operation we need the Raft ID of the replaced node. The shadow round is used for fetching all necessary information before the replace operation starts.	2022-12-10 12:27:22 +01:00
Kamil Braun	b424cc40fa	service: storage_service: sleep 2ring_delay instead of BROADCAST_INTERVAL before replace Most of the sleeps related to gossiping are based on `ring_delay`, which is configurable and can be set to lower value e.g. during tests. But for some reason there was one case where we slept for a hardcoded value, `service::load_broadcaster::BROADCAST_INTERVAL` - 60 seconds. Use `2 get_ring_delay()` instead. With the default value of `ring_delay` (30 seconds) this will give the same behavior.	2022-12-10 12:27:22 +01:00
Anna Stuchlik	8d1050e834	docs: replace Scylla with ScyllaDB on the Snitches page	2022-12-09 13:34:18 +01:00
Anna Stuchlik	5cb191d5b0	docs: fix the headings on the Snitches page	2022-12-09 13:26:36 +01:00
Anna Stuchlik	a699904374	doc: add the description of AzureSnitch to the documentation	2022-12-09 13:22:01 +01:00
Nadav Har'El	e47794ed98	test/cql-pytest: regression test for index scan with start token When we have a table with partition key p and an indexed regular column v, the test included in this patch checks the query SELECT p FROM table WHERE v = 1 AND TOKEN(p) > 17 This can work and not require ALLOW FILTERING, because the secondary index posting-list of "v=1" is ordered in p's token order (to allow SELECT with and without an index to return the same order - this is explained in issue #7443). So this test should pass, and indeed it does on both current Scylla, and Cassandra. However, it turns out that this was a bug - issue #7043 - in older versions of Scylla, and only fixed in Scylla 4.6. In older versions, the SELECT wasn't accepted, claiming it requires ALLOW FILTERING, and if ALLOW FILTERING was added, the TOKEN(p) > 17 part was silently ignored. The fix for issue #7043 actually included regression tests, C++ tests in test/boost/secondary_index_test.cc. But in this patch we also add a Python test in test/cql-pytest. One of the benefits of cql-pytest is that we can (and I did) run the same test on Cassandra to verify we're not implementing a wrong feature. Another benefit is that we can run a new test on an old version, and not even require re-compilation: You can run this new test on any existing installation of Scylla to check if it still has issue #7043. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12237	2022-12-09 09:33:16 +02:00
Benny Halevy	018dedcc0c	docs: replace-dead-node: update host_id of replacing node The replacing node no longer assumes the host_id of the replacee. It will continue to use a random, unique host_id. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-09 08:23:31 +02:00
Benny Halevy	37d75e5a21	docs: replace-dead-node: fix alignment	2022-12-09 08:23:31 +02:00
Benny Halevy	89920d47d6	db: system_keyspace: change set_local_host_id to private set_local_random_host_id Now that the local host_id is never changed externally (by the storage_service upon replace-node), the method can be made private and be used only for initializing the local host_id to a random one. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-09 08:23:31 +02:00
Benny Halevy	9942c60d93	storage_service: do not inherit the host_id of a replaced a node We want to always be able to distinguish between the replacing node and the replacee by using different, unique, host identifiers. This will allow us to use the host_id authoritatively to identify the node (rather then its endpoint ip address) for token mapping and node operations. Also, it will be used in the following patch to never allow the replaced node to rejoin the cluster, as its host_id should never be reused. Refs #9839 Refs #12040 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-09 08:23:31 +02:00
Pavel Emelyanov	7197757750	broadcast_tables: Forward-declare storage_proxy in lang.hh Currently the header includes storage_proxy.hh and spreads this over the code via raft_group0_client.hh -> group0_state_machine.hh -> lang.hh Forward declaring proxy class it eliminates ~100 indirect dependencies on storage_proxy.hh via this chain. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12241	2022-12-09 01:23:51 +02:00
Pavel Emelyanov	6075e01312	test/lib: Remove sstable_utils.hh from simple_schema.hh The latter is pretty popular test/lib header that disseminates the former one over whole lot of unit tests. The former, in turn, naturally includes sstables.hh thus making tons of unrelated tests depend on sstables class unused by them. However, simple removal doesn't work, becase of local_shard_only bool class definition in sstable_utils.hh used in simple_schema.hh. This thing, in turn, is used in keys making helpers that don't belong to sstable utils, so these are moved into simple_schema as well. When done, this affects the mutation_source_test.hh, which needs the local_shard_only bool class (and helps spreading the sstables.hh throughout more unrelated tests) and a bunch of .cc test sources that used sstable_utils.hh to indirectly include various headers of their demand. After patching, sstables.hh touches 2x times less tests. As a side effect the sstables_manager.hh also becomes 2x times less dependent on by tests. Continuation of `9bdea110a6` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12240	2022-12-08 15:37:33 +02:00
Tomasz Grabiec	4e7ddb6309	position_in_partition: Introduce before_key(position_in_partition_view)	2022-12-08 13:41:28 +01:00
Tomasz Grabiec	536c0ab194	db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order trim_clustering_row_ranges_to() is broken for non-full keys in reverse mode. It will trim the range to position_in_partition_view::after_key(full_key) instead of position_in_partition_view::before_key(key), hence it will include the key in the resulting range rather than exclude it. Fixes #12180 Refs #1446	2022-12-08 13:41:28 +01:00
Tomasz Grabiec	232ce699ab	types: Fix comparison of frozen sets with empty values A frozen set can be part of the clustering key, and with compact storage, the corresponding key component can have an empty value. Comparison was not prepared for this, the iterator attempts to deserialize the item count and will fail if the value is empty. Fixes #12242	2022-12-08 13:41:11 +01:00
Nadav Har'El	4cdaba778d	Merge 'Secondary indexes on static columns' from Piotr Dulikowski This pull request introduces support for global secondary indexes based on static columns. Local secondary indexes based on secondary columns are not planned to be supported and are explicitly forbidden. Because there is only one static row per partition and local indexes require full partition key when querying, such indexes wouldn't be very useful and would only waste resources. The index table for secondary indexes on static columns, unlike other secondary indexes, do not contain clustering keys from the base table. A static column's value determines a set of full partitions, so the clustering keys would only be unnecessary. The already existing logic for querying using secondary indexes works after introducing minimal notifications. The view update generation path now works on a common representation of static and clustering rows, but the new representation allowed to keep most of the logic intact. New cql-pytests are added. All but one of the existing tests for secondary indexes on static columns - ported from Cassandra - now work and have their `xfail` marks lifted; the remaining test requires support for collection indexing, so it will start working only after #2962 is fixed. Materialized view with static rows as a key are __not__ implemented in this PR. Fixes: #2963 Closes #11166 * github.com:scylladb/scylladb: test_materialized_view: verify that static columns are not allowed test_secondary_index: add (currently failing) test for static index paging test_secondary_index: add more tests for secondary indexes on static columns cassandra_tests: enable existing tests for static columns create_index_statement: lift restriction on secondary indexes on static rows db/view: fetch and process static rows when building indexes gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature create_index_statement: disallow creation of local indexes with static columns select_statement: prepare paging for indexes on static columns select_statement: do not attempt to fetch clustering columns from secondary index's table secondary_index_manager: don't add clustering key columns to index table of static column index replica/table: adjust the view read-before-write to return static rows when needed db/view: process static rows in view_update_builder::on_results db/view: adjust existing view update generation path to use clustering_or_static_row column_computation: adjust to use clustering_or_static_row db/view: add clustering_or_static_row deletable_row: add column_kind parameter to is_live view_info: adjust view_column to accept column_kind db/view: base_dependent_view_info: split non-pk columns into regular and static	2022-12-08 09:54:05 +02:00
Konstantin Osipov	02c30ab5d6	build: fix link error (abseil) on ubuntu toolchain with clang 15 abseil::hash depends on abseil::city and declareds CityHash32 as an external symbol. The city library static library, however, precedes hash in the link list, which apparently makes the linker simply drop it from the object list, since its symbols are not used elsewhere. Fix the linker ordering to help linker see that CityHash32 is used. Closes #12231	2022-12-08 09:47:16 +02:00
Avi Kivity	d6457778f1	Merge 'Coroutinize some table functions in preparation to static compaction groups' from Raphael "Raph" Carvalho Extracted from https://github.com/scylladb/scylladb/pull/12139 Closes #12236 * github.com:scylladb/scylladb: replica: table: Fix indentation replica: coroutinize table::discard_sstables() replica: Coroutinize table::flush()	2022-12-08 09:29:58 +02:00
Piotr Dulikowski	4883e43677	test_materialized_view: verify that static columns are not allowed Adds a test which verifies that static columns are not allowed in materialized views. Although we added support for static columns in secondary indexes, which share a lot of code with materialized views, static columns in materialized views are not yet ready to use.	2022-12-08 07:41:33 +01:00
Piotr Dulikowski	f864944dcb	test_secondary_index: add (currently failing) test for static index paging Currently, when executing queries accelerated by an index on a static column, paging is unable to break base table partitions across pages and is forced to return them in whole. This will cause problems if such a query must return a very large base table partition because it will have to be loaded into memory. Fixing this issue will require a more sophisticated approach than what was done in the PR. For the time being, an xfailing pytest is added which should start passing after paging is improved.	2022-12-08 07:41:33 +01:00
Piotr Dulikowski	4f836115fd	test_secondary_index: add more tests for secondary indexes on static columns Adds cql-pytests which test the secondary index on static columns feature.	2022-12-08 07:41:32 +01:00
Botond Dénes	897b501ba3	Merge 'doc: update the 5.1 upgrade guide with the mode-related information' from Anna Stuchlik This PR adds the link to the KB article about updating the mode after the upgrade to the 5.1 upgrade guide. In addition, I have: - updated the KB article to include the versions affected by that change. - fixed the broken link to the page about metric updates (it is not related to the KB article, but I fixed it in the same PR to limit the number of PRs that need to be backported). Related: https://github.com/scylladb/scylladb/pull/11122 Closes #12148 * github.com:scylladb/scylladb: doc: update the releases in the KB about updating the mode after upgrade doc: fix the broken link in the 5.1 upgrade guide doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide	2022-12-08 07:32:10 +02:00
Tomasz Grabiec	992a73a861	row_cache: Destroy coroutine under region's allocator The reason is alloc-dealloc mismatch of position_in_partition objects allocated by cursors inside coroutine object stored in the update variable in row_cache::do_update() It is allocated under cache region, but in case of exception it will be destroyed under the standard allocator. If update is successful, it will be cleared under region allocator, so there is not problem in the normal case. Fixes #12068 Closes #12233	2022-12-07 21:44:21 +02:00
Raphael S. Carvalho	9ae0d8ba28	replica: table: Fix indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-07 15:53:22 -03:00
Raphael S. Carvalho	b9a33d5a91	replica: coroutinize table::discard_sstables() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-07 15:52:36 -03:00
Raphael S. Carvalho	192b64a5ac	replica: Coroutinize table::flush() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-07 15:52:27 -03:00
Benny Halevy	a076ceef97	view: row_lock: lock_ck: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 19:27:30 +02:00
Avi Kivity	909fbfdd2f	repair: reindent repair_range	2022-12-07 18:17:21 +02:00
Avi Kivity	796ec5996f	repair: coroutinize repair_range	2022-12-07 18:13:10 +02:00
Benny Halevy	78c5961114	docs: operating-scylla: add-node-to-cluster: deleted instructions for unsupported releases 2.3 and 2018.1 ended their life and are long gone. No need to have instructions for them in the master version of this document. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:07:35 +02:00
Benny Halevy	adeb03e60f	docs: operating-scylla: add-node-to-cluster: cleanup: move tips to a note And be more verbose about why the tips are recommended and their ramifications. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:07:18 +02:00
Benny Halevy	6e324137bd	docs: operating-scylla: add-node-to-cluster: improve wording of cleanup instructions "use `nodetool cleanup` cleanup command" repeats words, change to "run the `nodetool cleanup` command". Also, improve the description of the cleanup action and how it relate to the bootstrapping process. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:07:08 +02:00
Benny Halevy	eeed330647	docs: operating-scylla: prerequisites: system_auth is a keyspace, not a table Fix the phrase referring to it as a table respectively. Also, do some minor phrasing touch-ups in this area. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:06:54 +02:00
Benny Halevy	5d840d4232	docs: operating-scylla: prerequisites: no Authetication status is gathered Authetication status isn't gathered from scylla.yaml, only the authenticator, so change the caption respectively. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:06:48 +02:00
Benny Halevy	9cb7056d3e	docs: operating-scylla: prerequisites: simplify grep commands Writing `cat X \| grep Y` is both inefficient and somewhat unprofessional. The grep command works very well on a file argument so `grep Y X` will do the job perfectly without the need for a pipe. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:06:36 +02:00
Benny Halevy	71bc12eecc	docs: operating-scylla: add-node-to-cluster: prerequisites: number sub-sections To improve their readability. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:06:35 +02:00
Benny Halevy	16db7bea82	docs: operating-scylla: add-node-to-cluster: describe other nodes in plural Typically data will be streamed from multiple existing nodes to the new node, not from a single one. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 17:03:23 +02:00
Tomasz Grabiec	a46b2e4e4c	Merge 'Make node replace procedure work with Raft' from Kamil Braun We need to obtain the Raft ID of the replaced node during the shadow round and place it in the address map. It won't be placed by the regular gossiping route if we're replacing using the same IP, because we override the application state of the replaced node. Even if we replace a node with a different IP, it is not guaranteed that background gossiping manages update the address map before we need it, especially in tests where we set ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round, on the other hand, performs a synchronous request (and if it fails during bootstrap, bootstrap will fail - because we also won't be able to obtain the tokens and Host ID of the replaced node). Fetch the Raft ID of the replaced node in `prepare_replacement_info`, which runs the shadow round. Return it in `replacement_info`. Then `join_token_ring` passes it to `setup_group0`, which stores it in the address map. It does that after `join_group0` so the entry is non-expiring (the replaced node is a member of group 0). Later in the replace procedure, we call `remove_from_group0` for the replaced node. `remove_from_group0` will be able to reverse-translate the IP of the replaced node to its Raft ID using the address map. Also remove an unconditional 60 seconds sleep from the replace code. Make it dependent on ring_delay. Enable the replace tests. Modify some code related to removing servers from group 0 which depended on storing IP addresses in the group 0 configuration. Closes #12172 * github.com:scylladb/scylladb: test/topology: enable replace tests service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0` service: handle replace correctly with Raft enabled gms/gossiper: fetch RAFT_SERVER_ID during shadow round service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace	2022-12-07 15:30:27 +01:00
Pavel Emelyanov	9bdea110a6	code: Reduce fanout of sstables(_manager)?.hh over headers This change removes sstables.hh from some other headers replacing it with version.hh and shared_sstable.hh. Also this drops sstables_manager.hh from some more headers, because this header propagates sstables.hh via self. That change is pretty straightforward, but has a recochet in database.hh that needs disk-error-handler.hh. Without the patch touch sstables/sstable.hh results in 409 targets recompillation, with the patch -- 299 targets. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12222	2022-12-07 14:34:19 +02:00
Botond Dénes	57a4971962	Merge 'dirty_memory_manager: tidy up' from Avi Kivity Tidy up namespaces, move code to the right file, and move the whole thing to the replica module where it belongs. Closes #12219 * github.com:scylladb/scylladb: dirty_memory_manager: move implementaton from database.cc dirty_memory_manager: move to replica module test: dirty_memory_manager_test: disambiguate classes named 'test_region_group' dirty_memory_manager: stop using using namespace	2022-12-07 14:25:59 +02:00
Avi Kivity	f7f5700289	dirty_memory_manager: move implementaton from database.cc A few leftover method implementations were left in database.cc when dirty_memory_manager.cc was created, move them to their correct place now.	2022-12-06 22:28:54 +02:00
Avi Kivity	444de2831e	dirty_memory_manager: move to replica module It's a replica-side thing, so move it there. The related flush_permit and sstable_write_permit are moved alongside.	2022-12-06 22:24:17 +02:00
Avi Kivity	a038a35ad6	test: dirty_memory_manager_test: disambiguate classes named 'test_region_group' There are two similarly named classes: ::test_region_group and dirty_memory_manager_logalloc::test_region_group. Rename the former to ::raii_region_group (that's what it's for) and the latter to ::test_region_group, to reduce confusion.	2022-12-06 22:20:38 +02:00
Avi Kivity	dfdae5ffa9	dirty_memory_manager: stop using using namespace `using namespace` is pretty bad, especially in a header, as it pollutes the namespace for everyone. Stop using it and qualify names instead.	2022-12-06 21:37:38 +02:00
Avi Kivity	47a8fad2a2	Merge 'scylla-types: add serialize action' from Botond Dénes Serializes the value that is an instance of a type. The opposite of `deserialize` (previously known as `print`). All other actions operate on serialized values, yet up to now we were missing a way to go from human readable values to serialized ones. This prevented for example using `scylla types tokenof $pk` if one only had the human readable key value. Example: ``` $ scylla types serialize -t Int32Type -- -1286905132 b34b62d4 $ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b 16 0010d00819896f6b11ea00000000001c571b000400000010 $ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b 0010d00819896f6b11ea00000000001c571b ``` Closes #12029 * github.com:scylladb/scylladb: docs: scylla-types.rst: add mention of per-operation --help tools/scylla-types: add serialize operation tools/scylla-types: prepare for action handlers with string arguments tools/scylla-types: s/print/deserialize/ operation docs: scylla-types.rst: document tokenof and shardof docs: scylla-types.rst: fix typo in compare operation description	2022-12-06 19:27:15 +02:00
Nadav Har'El	f275bfd57b	Update CODEOWNERS file Update the CODEOWNERS file with some people who joined different parts of the project, and one person that left. Note that despite is name, CODEOWNERS does not list "ownership" in any strict sense of the word - it is more about who is willing and/or knowledgeable enough to participate in reviewing changes to particular files or directories. Github uses this file to automatically suggest who should review a pull request. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12216	2022-12-06 19:26:03 +02:00
Benny Halevy	5007ded2c1	view: row_lock: lock_ck: serialize partition and row locking The problematic scenario this patch fixes might happen due to unfortunate serialization of locks/unlocks between lock_pk and lock_ck, as follows: 1. lock_pk acquires an exclusive lock on the partition. 2.a lock_ck attempts to acquire shared lock on the partition and any lock on the row. both cases currently use a fiber returning a future<rwlock::holder>. 2.b since the partition is locked, the lock_partition times out returning an exceptional future. lock_row has no such problem and succeeds, returning a future holding a rwlock::holder, pointing to the row lock. 3.a the lock_holder previously returned by lock_pk is destroyed, calling `row_locker::unlock` 3.b row_locker::unlock sees that the partition is not locked and erases it, including the row locks it contains. 4.a when_all_succeeds continuation in lock_ck runs. Since the lock_partition future failed, it destroyes both futures. 4.b the lock_row future is destroyed with the rwlock::holder value. 4.c ~holder attempts to return the semaphore units to the row rwlock, but the latter was already destroyed in 3.b above. Acquiring the partition lock and row lock in parallel doesn't help anything, but it complicates error handling as seen above, This patch serializes acquiring the row lock in lock_ck after locking the partition to prevent the above race. This way, erasing the unlocked partition is never expected to happen while any of its rows locks is held. Fixes #12168 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12208	2022-12-06 16:29:46 +02:00
Botond Dénes	f017e9f1c6	docs: document the reader concurrency semaphore diagnostics dump The diagnostics dumped by the reader concurrency semaphore are pretty common-sight in logs, as soon as a node becomes problematic. The reason is that the reader concurrency semaphore acts as the canary in the coal mine: it is the first that starts screaming when the node or workload is unhealthy. This patch adds documentation of the content of the diagnostics and how to diagnose common problems based on it. Fixes: #10471 Closes #11970	2022-12-06 16:24:44 +02:00
Botond Dénes	c35cee7e2b	docs: scylla-types.rst: add mention of per-operation --help	2022-12-06 14:47:28 +02:00
Botond Dénes	4f9799ce4f	tools/scylla-types: add serialize operation Takes human readable values and converts them to serialized hex encoded format. Only regular atomic types are supported for now, no collection/UDT/tuple support, not even in frozen form.	2022-12-06 14:46:53 +02:00
Botond Dénes	7c87655b4b	tools/scylla-types: prepare for action handlers with string arguments Currently all action handlers have bytes arguments, parsed from hexadecimal string representations. We plan on adding a serialize command which will require raw string arguments. Prepare the infrastructure for supporting both types of action handlers.	2022-12-06 14:45:30 +02:00
Botond Dénes	15452730fb	tools/scylla-types: s/print/deserialize/ operation Soon we will have a serialize operation. Rename the current print operation to deserialize in preparation to that. We want the two operations (serialize and deserialize) to reflect their relation in their names too.	2022-12-06 14:45:30 +02:00
Botond Dénes	f98e6552b4	docs: scylla-types.rst: document tokenof and shardof These new actions were added recently but without the accompanying documentation change. Make up for this now.	2022-12-06 14:45:30 +02:00
Botond Dénes	30c047cae6	docs: scylla-types.rst: fix typo in compare operation description	2022-12-06 14:45:23 +02:00
Piotr Dulikowski	680423ad9d	cassandra_tests: enable existing tests for static columns Removes the "xfail" marker from the now-passing tests related to secondary indexes on static columns.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	cc3af3190d	create_index_statement: lift restriction on secondary indexes on static rows Secondary indexes on static columns should work now. This commit lifts the existing restriction after the cluster is fully upgraded to a version which supports such indexes.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	86dad30b66	db/view: fetch and process static rows when building indexes This commit modifies the view builder and its consumer so that static rows are always fetched and properly processed during view build. Currently, the view builder will always fetch both static and clustering rows, regardless of the type of indexes being built. For indexes on static columns this is wasteful and could be improved so that only the types of rows relevant to indexes being built are fetched - however, doing this sounds a bit complicated and I would rather start with something simpler which has a better chance of working.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	25fec0acce	gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature The new feature will prevent secondary indexes on static columns from being created unless the whole cluster is ready to support them.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	9f14f0ac09	create_index_statement: disallow creation of local indexes with static columns Local indexes on static columns don't make sense because there is only one static row per partition. It's always better to just run SELECT DISTINCT on the base table. Allowing for such an index would only make such queries slower (due to double lookup), would take unnecessary space and could pose potential consistency problems, so this commit explicitly forbids them.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	8c4cdfc2db	select_statement: prepare paging for indexes on static columns When performing a query on a table which is accelerated by a secondary index, the paging state returned along with the query contains a partition key and a clustering key of the secondary index table. The logic wasn't prepared to handle the case of secondary indexes on static columns - notably, it tried to put base table's clustering key columns into the paging state which caused problems in other places. This commit fixes the paging logic so that the PK and CK of a secondary index table is calculated correctly. However, this solution has a major drawback: because it is impossible to encode clustering key of the base table in the paging state, partitions returned by queries accelerated by secondary indexes on static columns will _not_ be split by paging. This can be problematic in case there are large partitions in the base table. The main advantage of this fix is that it is simple. Moreover, the problem described above is not unique to static column indexes, but also happens e.g. in case of some indexes on clustering columns (see case 2 of scylladb/scylla#7432). Fixing this issue will require a more sophisticated solution and may affect more than only secondary indexes on static columns, so this is left for a followup.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	ba390072c5	select_statement: do not attempt to fetch clustering columns from secondary index's table The previous commit made sure that the index table for secondary indexes on static tables don't have columns corresponding to clustering rows in the base table - therefore, we must make sure that we don't try to fetch them when querying the index table.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	983b440a81	secondary_index_manager: don't add clustering key columns to index table of static column index The implementation of secondary indexes on static columns relies on the fact that the index table only includes partition key columns of the base table, but not clustering key columns. A static column's value determines a set of full partitions, so including the clustering key would only be redundant. It would also generate more work as a single static column update would require a large portion of the index to be updated. This commit makes sure that clustering columns are not included in the index table for indexes based on a static column.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	6ab41d76e6	replica/table: adjust the view read-before-write to return static rows when needed Adjusts the read-before-write query issued in `table::do_push_view_replica_updates` so that, when needed, requests static columns and makes sure that the static row is present.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	18be90b1e6	db/view: process static rows in view_update_builder::on_results The `view_update_builder::on_results()` function is changed to react to static rows when comparing read-before-write results with the base table mutation.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	2dd95d76f1	db/view: adjust existing view update generation path to use clustering_or_static_row The view update path is modified to use `clustering_or_static_row` instead of just `clustering_row`.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	b0a31bb7a7	column_computation: adjust to use clustering_or_static_row Adjusts the column_computation interface so that it is able to accept both clustering and static rows through the common db::view::clustering_or_static_row interface.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	986ab6034c	db/view: add clustering_or_static_row Adds a `clustering_or_static_row`, which is a common, immutable representation of either a static or clustering row. It will allow to handle view update generation based on static or clustering rows in a uniform way.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	05d4328f02	deletable_row: add column_kind parameter to is_live While deletable_row is used to hold regular columns of a clustering row, its name or implementation doesn't suggest that it is a requirement. In fact, some of its methods already take a column_kind parameter which is used to interpret the kind of columns held in the row. This commit removes the assumption about the column kind from the `deletable_row::is_live` method.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	27c81432cd	view_info: adjust view_column to accept column_kind The `view_info::view_column()` and `view_column` in view.cc allow to get a view's column definition which corresponds to given base table's column. They currently assume that the given column id corresponds to a regular column. In preparation for secondary indexes based on static columns, this commit adjusts those functions so that they accept other kinds of columns, including static columns.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	f7b7724eaf	db/view: base_dependent_view_info: split non-pk columns into regular and static Currently, `base_dependent_view_info::_base_non_pk_columns_in_view_pk` field keeps a list of non-primary-key columns from the base table which are a part of the view's primary key. Because the current code does not allow indexes on static columns yet, the columns kept in the aforementioned field are always assumed to be regular columns of the base table and are kept as `column_id`s which do not contain information about the column kind. This commit splits the `_base_non_pk_columns_in_view_pk` field into two, one for regular columns and the other for static columns, so that it is possible to keep both kinds of columns in `base_dependent_view_info` and the structure can be used for secondary indexes on static columns.	2022-12-06 11:21:16 +01:00
Botond Dénes	681bd62424	Update tools/java submodule * tools/java ecab7cf7d6...1c4e1e7a7d (2): > Merge "Cqlsh serverless v2" from Karol Baryla > Update Java Driver version to 3.11.2.4	2022-12-06 09:06:09 +02:00
Botond Dénes	6a1dbffaaa	Merge 'compaction_manager: coroutinize postponed_compactions_reevaluation' from Avi Kivity Three lambdas were removed, simplifying the code. Closes #12207 * github.com:scylladb/scylladb: compaction_manager: reindent postponed_compactions_reevaluation() compaction_manager: coroutinize postponed_compactions_reevaluation() compaction_manager: make postponed_compactions_reevaluation() return a future	2022-12-06 08:08:36 +02:00
Avi Kivity	2339a3fa06	database: remove continuation for updating statistics update_write_metrics() is a continuation added solely for updating statistics. Fold it into do_update to reduce an allocation in the write path. ```console $ ./artifacts/before --write --smp 1 2<&1 \| grep insn 189930.77 tps ( 57.2 allocs/op, 13.2 tasks/op, 50994 insns/op, 0 errors) 189954.18 tps ( 57.2 allocs/op, 13.2 tasks/op, 51086 insns/op, 0 errors) 188623.86 tps ( 57.2 allocs/op, 13.2 tasks/op, 51083 insns/op, 0 errors) 190115.01 tps ( 57.2 allocs/op, 13.2 tasks/op, 51092 insns/op, 0 errors) 190173.71 tps ( 57.2 allocs/op, 13.2 tasks/op, 51083 insns/op, 0 errors) median 189954.18 tps ( 57.2 allocs/op, 13.2 tasks/op, 51086 insns/op, 0 errors) ``` vs ```console $ ./artifacts/after --write --smp 1 2<&1 \| grep insn 190358.38 tps ( 56.2 allocs/op, 12.2 tasks/op, 50754 insns/op, 0 errors) 185222.78 tps ( 56.2 allocs/op, 12.2 tasks/op, 50789 insns/op, 0 errors) 184508.09 tps ( 56.2 allocs/op, 12.2 tasks/op, 50842 insns/op, 0 errors) 142099.47 tps ( 56.2 allocs/op, 12.2 tasks/op, 50825 insns/op, 0 errors) 190447.22 tps ( 56.2 allocs/op, 12.2 tasks/op, 50811 insns/op, 0 errors) ``` One allocation and ~300 cycles saved. update_write_metrics() is still called from other call sites, so it is not removed. Closes #12108	2022-12-06 07:04:17 +02:00
Botond Dénes	6daa1e973f	Merge 'alternator: fix hangs related to TTL scanning' from Nadav Har'El The first patch in this small series fixes a hang during shutdown when the expired-item scanning thread can hang in a retry loop instead of quitting. These hangs were seen in some test runs (issue #12145). The second patch is a failsafe against additional bugs like those solved by the first patch: If any bugs causes the same page fetch to repeatedly time out, let's stop the attempts after 10 retries instead of retrying for ever. When we stop the retries, a warning will be printed to the log, Scylla will wait until the next scan period and start a new scan from scratch - from a random position in the database, instead of hanging potentially-forever waiting for the same page. Closes #12152 * github.com:scylladb/scylladb: alternator ttl: in scanning thread, don't retry the same page too many times alternator: fix hang during shutdown of expiration-scanning thread	2022-12-06 06:44:22 +02:00
Botond Dénes	c5da96e6f7	Merge 'cql3: batch_statement: coroutinize get_mutations()' from Avi Kivity As it has a do_with(), coroutinizing it is an automatic win. Closes #12195 * github.com:scylladb/scylladb: cql3: batch_statement: reindent get_mutations() cql3: batch_statement: coroutinize get_mutations()	2022-12-06 06:41:44 +02:00
Avi Kivity	d2b1d2f695	compaction_manager: reindent postponed_compactions_reevaluation()	2022-12-05 22:02:27 +02:00
Avi Kivity	1669025736	compaction_manager: coroutinize postponed_compactions_reevaluation() So much nicer.	2022-12-05 22:01:41 +02:00
Avi Kivity	d2c44cba77	compaction_manager: make postponed_compactions_reevaluation() return a future postponed_compactions_reevaluation() runs until compaction_manager is stopped, checking if it needs to launch new compactions. Make it return a future instead of stashing its completion somewhere. This makes is easier to convert it to a coroutine.	2022-12-05 21:58:48 +02:00
Avi Kivity	fe4d7fbdf2	Update abseil submodule * abseil 7f3c0d78...4e5ff155 (125): > Add a compilation test for recursive hash map types > Add AbslStringify support for enum types in Substitute. > Use a c++14-style constexpr initialization if c++14 constexpr is available. > Move the vtable into a function to delay instantiation until the function is called. When the variable is a global the compiler is allowed to instantiate it more aggresively and it might happen before the types involved are complete. When it is inside a function the compiler can't instantiate it until after the functions are called. > Cosmetic reformatting in a test. > Reorder base64 unescape methods to be below the escaping methods. > Fixes many compilation issues that come from having no external CI coverage of the accelerated CRC implementation and some differences bewteen the internal and external implementation. > Remove static initializer from mutex.h. > Import of CCTZ from GitHub. > Remove unused iostream include from crc32c.h > Fix MSVC builds that reject C-style arrays of size 0 > Remove deprecated use of absl::ToCrc32c() > CRC: Make crc32c_t as a class for explicit control of operators > Convert the full parser into constexpr now that Abseil requires C++14, and use this parser for the static checker. This fixes some outstanding bugs where the static checker differed from the dynamic one. Also, fix `%v` to be accepted with POSIX syntax. > Write (more) directly into the structured buffer from StringifySink, including for (size_t, char) overload. > Avoid using the non-portable type __m128i_u. > Reduce flat_hash_{set,map} generated code size. > Use ABSL_HAVE_BUILTIN to fix -Wundef __has_builtin warning > Add a TODO for the deprecation of absl::aligned_storage_t > TSAN: Remove report_atomic_races=0 from CI now that it has been fixed > absl: fix Mutex TSan annotations > CMake: Remove trailing commas in `AbseilDll.cmake` > Fix AMD cpu detection. > CRC: Get CPU detection and hardware acceleration working on MSVC x86(_64) > Removing trailing period that can confuse a url in str_format.h. > Refactor btree iterator generation code into a base class rather than using ifdefs inside btree_iterator. > container.h: fix incorrect comments about the location of <numeric> algorithms. > Zero encoded_remaining when a string field doesn't fit, so that we don't leave partial data in the buffer (all decoders should ignore it anyway) and to be sure that we don't try to put any subsequent operands in either (there shouldn't be enough space). > Improve error messages when comparing btree iterators when generations are enabled. > Document the WebSafe* and WithPadding variants more concisely, as deltas from Base64Encode. > Drop outdated comment about LogEntry copyability. > Release structured logging. > Minor formatting changes in preparation for structured logging... > Update absl::make_unique to reflect the C++14 minimum > Update Condition to allocate 24 bytes for MSVC platform pointers to methods. > Add missing include > Refactor "RAW: " prefix formatting into FormatLogPrefix. > Minor formatting changes due to internal refactoring > Fix typos > Add a new API for `extract_and_get_next()` in b-tree that returns both the extracted node and an iterator to the next element in the container. > Use AnyInvocable in internal thread_pool > Remove absl/time/internal/zoneinfo.inc. It was used to guarantee availability of a few timezones for "time_test" and "time_benchmark", but (file-based) zoneinfo is now secured via existing Bazel data/env attributes, or new CMake environment settings. > Updated documentation on use of %v Also updated documentation around FormatSink and PutPaddedString > Use the correct Bazel copts in crc targets > Run the //absl/time timezone tests with a data dependency on, and a matching ${TZDIR} setting for, //absl/time/internal/cctz:zoneinfo. > Stop unnecessary clearing of fields in ~raw_hash_set. > Fix throw_delegate_test when using libc++ with shared libraries > CRC: Ensure SupportsArmCRC32PMULL() is defined > Improve error messages when comparing btree iterators. > Refactor the throw_delegate test into separate test cases > Replace std::atomic_flag with std::atomic<bool> to avoid the C++20 deprecation of ATOMIC_FLAG_INIT. > Add support for enum types with AbslStringify > Release the CRC library > Improve error messages when comparing swisstable iterators. > Auto increase inlined capacity whenever it does not affect class' size. > drop an unused dep > Factor out the internal helper AppendTruncated, which is used and redefined in a couple places, plus several more that have yet to be released. > Fix some invalid iterator bugs in btree_test.cc for multi{set,map} emplace{_hint} tests. > Force a conservative allocation for pointers to methods in Condition objects. > Fix a few lint findings in flags' usage.cc > Narrow some _MSC_VER checks to not catch clang-cl. > Small cleanups in logging test helpers > Import of CCTZ from GitHub. > Merge pull request abseil/abseil-cpp#1287 from GOGOYAO:patch-1 > Merge pull request abseil/abseil-cpp#1307 from KindDragon:patch-1 > Stop disabling some test warnings that have been fixed > Support logging of user-defined types that implement `AbslStringify()` > Eliminate span_internal::Min in favor of std::min, since Min conflicts with a macro in a third-party library. > Fix -Wimplicit-int-conversion. > Improve error messages when dereferencing invalid swisstable iterators. > Cord: Avoid leaking a node if SetExpectedChecksum() is called on an empty cord twice in a row. > Add a warning about extract invalidating iterators (not just the iterator of the element being extracted). > CMake: installed artifacts reflect the compiled ABI > Import of CCTZ from GitHub. > Import of CCTZ from GitHub. > Support empty Cords with an expected checksum > Move internal details from one source file to another more appropriate source file. > Removes `PutPaddedString()` function > Return uint8_t from CappedDamerauLevenshteinDistance. > Remove the unknown CMAKE_SYSTEM_PROCESSOR warning when configuring ABSL_RANDOM_RANDEN_COPTS > Enforce Visual Studio 2017 (MSVC++ 15.0) minumum > `absl::InlinedVector::swap` supports non-assignable types. > Improve b-tree error messages when dereferencing invalid iterators. > Mutex: Fix stall on single-core systems > Document Base64Unescape() padding > Fix sign conversion warnings in memory_test.cc. > Fix a sign conversion warning. > Fix a truncation warning on Windows 64-bit. > Use btree iterator subtraction instead of std::distance in erase_range() and count(). > Eliminate use of internal interfaces and make the test portable and expose it to OSS. > Fix various warnings for _WIN32. > Disables StderrKnobsDefault due to order dependency > Implement btree_iterator::operator-, which is faster than std::distance for btree iterators. > Merge pull request abseil/abseil-cpp#1298 from rpjohnst:mingw-cmake-build > Implement function to calculate Damerau-Levenshtein distance between two strings. > Change per_thread_sem_test from size medium to size large. > Support stringification of user-defined types in AbslStringify in absl::Substitute. > Fix "unsafe narrowing" warnings in absl, 12/12. > Revert change to internal 'Rep', this causes issues for gdb > Reorganize InlineData into an inner Rep structure. > Remove internal `VLOG_xxx` macros > Import of CCTZ from GitHub. > `absl::InlinedVector` supports move assignment with non-assignable types. > Change Cord internal layout, which reduces store-load penalties on ARM > Detects accidental multiple invocations of AnyInvocable<R(...)&&>::operator()&& by producing an error in debug mode, and clarifies that the behavior is undefined in the general case. > Fix a bug in StrFormat. This issue would have been caught by any compile-time checking but can happen for incorrect formats parsed via ParsedFormat::New. Specifically, if a user were to add length modifiers with 'v', for example the incorrect format string "%hv", the ParsedFormat would incorrectly be allowed. > Adds documentation for stringification extension > CMake: Remove check_target calls which can be problematic in case of dependency cycle > Changes mutex unlock profiling > Add static_cast<void> to the sources for trivial relocations to avoid spurious -Wdynamic-class-memaccess errors in the presence of other compilation errors. > Configure ABSL_CACHE_ALIGNED for clang-like and MSVC toolchains. > Fix "unsafe narrowing" warnings in absl, 11/n. > Eliminate use of internal interfaces > Merge pull request abseil/abseil-cpp#1289 from keith:ks/fix-more-clang-deprecated-builtins > Merge pull request abseil/abseil-cpp#1285 from jun-sheaf:patch-1 > Delete LogEntry's copy ctor and assignment operator. > Make sinks provided to `AbslStringify()` usable with `absl::Format()`. > Cast unused variable to void > No changes in OSS. > No changes in OSS > Replace the kPower10ExponentTable array with a formula. > CMake: Mark absl::cord_test_helpers and absl::spy_hash_state PUBLIC > Use trivial relocation for transfers in swisstable and b-tree. > Merge pull request abseil/abseil-cpp#1284 from t0ny-peng:chore/remove-unused-class-in-variant > Removes the legacy spellings of the thread annotation macros/functions by default. Closes #12201	2022-12-05 21:07:16 +02:00
Eliran Sinvani	5a5514d052	cql server: Only parallelize relevant cql requests The cql server uses an execution stage to process and execute queries, however, processing stage is best utilized when having a recurrent flow that needs to be called repeatedly since it better utilizes the instruction cache. Up until now, every request was sent through the processing stage, but most requests are not meant to be executed repeatedly with high volume. This change processes and executes the data queries asynchronously, through an execution stage, and all of the rest are processed one by one, only continuing once the request has been done end to end. Tests: Unit tests in dev and debug. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #12202	2022-12-05 21:06:58 +02:00
Takuya ASADA	b7851ab1ec	docker: fix locale on SSH shell `4ecc08c` broke locale settings on SSH shell, since we dropped "update-locale". To fix this without installing locales package, we need to manually specify LANG=C.UTF-8 in /etc/default/locale. see https://github.com/scylladb/scylla-cluster-tests/pull/5519 Closes #12197	2022-12-05 20:02:18 +02:00
Avi Kivity	6f2d060d12	Merge 'Make sstable_directory call sstable_manager for sstables' components' from Pavel Emelyanov This PR hits two goals for "object storage" effort 1. Sstables loader "knows" that sstables components are stored in a Linux directory and uses utils/lister to access it. This is not going to work with sstables over object storage, the loader should be abstracted from the underlying storage. 2. Currently class keyspace and class column_family carry "datadir" and "all_datadirs" on board which are path on local filesystem where sstable files are stored (those usually started with /var/lib/scylla/data). The paths include subsdirs like "snapshots", "staging", etc. This is not going to look nice for obejct storage, the /var/lib/ prefix is excessive and meaningless in this case. Instead, ks and cf should know their "location" and some other component should know the directory where in which the files are stored. Said that, this PR prepares distributed_loader and sstables_directly to stop using Linux paths explicitly by making both call sstables_manager to list and open sstables object. After it will be possible to teach manager to list sstables from object storage. Also this opens the way to removing paths from keyspace and column_family classes and replacing those with relative "location"s. Closes #12128 * github.com:scylladb/scylladb: sstable_directory: Get components lister from manager sstable_directory: Extract directory lister sstable_directory: Remove sstable creation callback sstable_directory: Call manager to make sstables sstable_directory: Keep error handler generator sstable_directory: Keep schema_ptr sstable_directory: Use directory semaphore from manager sstable_directory: Keep reference on manager tests: Use sstables creation helper in some cases sstables_manager: Keep directory semaphore reference sstables, code: Wrap directory semaphore with concurrency	2022-12-05 18:54:17 +02:00
Gleb Natapov	022a825b33	raft: introduce not_a_member error and return it when non member tries to do add/modify_config Currently if a node that is outside of the config tries to add an entry or modify config transient error is returned and this causes the node to retry. But the error is not transient. If a node tries to do one of the operations above it means it was part of the cluster at some point, but since a node with the same id should not be added back to a cluster if it is not in the cluster now it will never be. Return a new error not_a_member to a caller instead. Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>	2022-12-05 17:11:04 +01:00
Benny Halevy	c61083852c	storage_service: handle_state_normal: calculate candidates_for_removal when replacing tokens We currently try to detect a replaced node so to insert it to endpoints_to_remove when it has no owned tokens left. However, for each token we first generate a multimap using get_endpoint_to_token_map_for_reading(). There are 2 problems with that: 1. unless the replaced node owns a single token, this map will not be empty after erasing one token out of it, since the token metadata has not changed yet (this is done later with update_normal_tokens(owned_tokens, endpoint)). 2. generating this map for each token is inefficient, turning this algorithm complexity to quadratic in the number of tokens... This change copies the current token_to_endpoint map to temporary map and erases replaced tokens from it, while maintaining a set of candidates_for_removal. After traversing all replaced tokens, we check again the `token_to_endpoint_map` erasing from `candidates_for_removal` any endpoint that still owns tokens. The leftover candidates are endpoints the own no tokens and so they are added to `hosts_to_remove`. Fixes #12082 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12141	2022-12-05 16:17:18 +01:00
Botond Dénes	3d620378d4	Merge 'view: coroutinize maybe_mark_view_as_built' from Avi Kivity Simplifying it a little. Closes #12171 * github.com:scylladb/scylladb: view: reindent maybe_mark_view_as_built view: coroutinize maybe_mark_view_as_built	2022-12-05 13:43:34 +02:00
Kamil Braun	3f8aaeeab9	test/topology: enable replace tests Also add some TODOs for enhancing existing tests.	2022-12-05 11:50:07 +01:00
Kamil Braun	ee19411783	service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0` Also simplify the code and improve logging in general. The previous code did this: search for the ID in the address map. If it couldn't be found, perform a read barrier and search again. If it again couldn't be found, return. This algorithm depended on the fact that IP addresses were stored in group 0 configuration. The read barrier was used to obtain the most recent configuration, and if the IP was not a part of address map after the read barrier, that meant it's simply not a member of group 0. This logic no longer applies so we can simplify the code. Furthermore, when I was fixing the replace operation with Raft enabled, at some point I had a "working" solution with all tests passing. But I was suspicious and checked if the replaced node got removed from group 0. It wasn't. So the replace finished "successfully", but we had an additional (voting!) member of group 0 which didn't correspond to a token ring member. The last version of my fixes ensure that the node gets removed by the replacing node. But the system is fragile and nothing prevents us from breaking this again. At least log an error for now. Regression tests will be added later.	2022-12-05 11:50:07 +01:00
Kamil Braun	4429885543	service: handle replace correctly with Raft enabled We must place the Raft ID obtained during the shadow round in the address map. It won't be placed by the regular gossiping route if we're replacing using the same IP, because we override the application state of the replaced node. Even if we replace a node with a different IP, it is not guaranteed that background gossiping manages to update the address map before we need it, especially in tests where we set ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round, on the other hand, performs a synchronous request (and if it fails during bootstrap, bootstrap will fail - because we also won't be able to obtain the tokens and Host ID of the replaced node). Fetch the Raft ID of the replaced node in `prepare_replacement_info`, which runs the shadow round. Return it in `replacement_info`. Then `join_token_ring` passes it to `setup_group0`, which stores it in the address map. It does that after `join_group0` so the entry is non-expiring (the replaced node is a member of group 0). Later in the replace procedure, we call `remove_from_group0` for the replaced node. `remove_from_group0` will be able to reverse-translate the IP of the replaced node to its Raft ID using the address map.	2022-12-05 11:50:07 +01:00
Kamil Braun	45bb5bfb52	gms/gossiper: fetch RAFT_SERVER_ID during shadow round During the replace operation we need the Raft ID of the replaced node. The shadow round is used for fetching all necessary information before the replace operation starts.	2022-12-05 11:50:07 +01:00
Kamil Braun	7222c2f9a1	service: storage_service: sleep 2ring_delay instead of BROADCAST_INTERVAL before replace Most of the sleeps related to gossiping are based on `ring_delay`, which is configurable and can be set to lower value e.g. during tests. But for some reason there was one case where we slept for a hardcoded value, `service::load_broadcaster::BROADCAST_INTERVAL` - 60 seconds. Use `2 get_ring_delay()` instead. With the default value of `ring_delay` (30 seconds) this will give the same behavior.	2022-12-05 11:50:07 +01:00
Pavel Emelyanov	b5ede873f2	sstable_directory: Get components lister from manager For now this is almost a no-op because manager just calls sstables_directory code back to create the lister. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	3f9b8c855d	sstable_directory: Extract directory lister Currently the utils/lister.cc code is in use to list regular files in a directory. This patch wraps the lister into more abstract components lister class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	abd3602b10	sstable_directory: Remove sstable creation callback It's no longer used. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	3d559391df	sstable_directory: Call manager to make sstables Now the directory code has everyhting it needs to create sstable object and can stop using the external lambda. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	db657a8d1c	sstable_directory: Keep error handler generator Yet another continuation to previous patch -- IO error handlers generator is also needed to create sstables. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	4281f4af42	sstable_directory: Keep schema_ptr Continuation of one-before-previous patch. In order to create sstable without external lambda the directory code needs schema. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	8df1bcb907	sstable_directory: Use directory semaphore from manager After previous patch sstables_directory code may no longer require for semaphore argument, because it can get one from manager. This makes the directory API shorter and simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	4da941e159	sstable_directory: Keep reference on manager The sstables_directly accesses /var/lib/scylla/data in two ways -- lists files in it and opens sstables. The latter is abdtracted with the help of lambdas passed around, but the former (listing) is done by using directory liters from utils. Listing sstables components with directlry lister won't work for object storage, the directory code will need to call some abstraction layer instead. Opening sstables with the help of a lambda is a bit of overkill, having sstables manager at hand could make it much simpler. Said that, this patch makes sstables_directly reference sstables_manager on start. This change will also simplify directory semaphore usage (next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	784d78810a	tests: Use sstables creation helper in some cases Several test cases push sstables creation lambda into with_sstables_directory helper. There's a ready to use helper class that does the same. Next patch will make additional use of that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	5e13ce2619	sstables_manager: Keep directory semaphore reference Preparational patch. The semaphore will be used by sstables_directory in next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:18 +03:00
Pavel Emelyanov	be8512d7cc	sstables, code: Wrap directory semaphore with concurrency Currently this is a sharded<semaphore> started/stopped in main and referenced by database in order to be fed into sstables code. This semaphore always comes with the "concurrency" parameter that limits the parallel_for_each parallelizm. This patch wraps both together into directory_semaphore class. This makes its usage simpler and will allow extending it in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 11:59:30 +03:00
Asias He	c6087cf3a0	repair: Reduce repair reader eviction with diff shard count When repair master and followers have different shard count, the repair followers need to create multi-shard readers. Each multi-shard reader will create one local reader on each shard, N (smp::count) local readers in total. There is a hard limit on the number of readers who can work in parallel. When there are more readers than this limit. The readers will start to evict each other, causing buffers already read from disk to be dropped and recreating of readers, which is not very efficient. To optimize and reduce reader eviction overhead, a global reader permit is introduced which considers the multi-shard reader bloats. With this patch, at any point in time, the number of readers created by repair will not exceed the reader limit. Test Results: 1) with stream sem 10, repair global sem 10, 5 ranges in parallel, n1=2 shards, n2=8 shards, memory wanted =1 1.1) [asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2) [2022-11-23 17:45:24,770] Starting repair command #1, repairing 1 ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true) [2022-11-23 17:45:53,869] Repair session 1 [2022-11-23 17:45:53,869] Repair session 1 finished real 0m30.212s user 0m1.680s sys 0m0.222s 1.2) [asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1) [2022-11-23 17:46:07,507] Starting repair command #1, repairing 1 ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true) [2022-11-23 17:46:30,608] Repair session 1 [2022-11-23 17:46:30,608] Repair session 1 finished real 0m24.241s user 0m1.731s sys 0m0.213s 2) with stream sem 10, repair global sem no_limit, 5 ranges in parallel, n1=2 shards, n2=8 shards, memory wanted =1 2.1) [asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2) [2022-11-23 17:49:49,301] Starting repair command #1, repairing 1 ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true) [2022-11-23 17:52:01,414] Repair session 1 [2022-11-23 17:52:01,415] Repair session 1 finished real 2m13.227s user 0m1.752s sys 0m0.218s 2.2) [asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1) [2022-11-23 17:52:19,280] Starting repair command #1, repairing 1 ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true) [2022-11-23 17:52:42,387] Repair session 1 [2022-11-23 17:52:42,387] Repair session 1 finished real 0m24.196s user 0m1.689s sys 0m0.184s Comparing 1.1) and 2.1), it shows the eviction played a major role here. The patch gives 73s / 30s = 2.5X speed up in this setup. Comparing 1.1 and 1.2, it shows even if we limit the readers, starting on the lower shard is faster 30s / 24s = 1.25X (the total number of multishard readers is lower) Fixes #12157 Closes #12158	2022-12-05 10:47:36 +02:00
Botond Dénes	1e20095547	Update tools/java submodule * tools/java 1c06006447...ecab7cf7d6 (1): > Add VSCode files to gitignore	2022-12-05 09:54:51 +02:00
Botond Dénes	c4d72c8dd0	Merge 'cql3: select_statement: split and coroutinize process_results()' from Avi Kivity Split the simple (and common) case from the complex case, and coroutinize the latter. Hopefully this generates better code for the simple case, and it makes the complex case a little nicer. Closes #12194 * github.com:scylladb/scylladb: cql3: select_statement: reindent process_results_complex() cql3: select_statement: coroutinize process_results_complex() cql3: select_statement: split process_results() into fast path and complex path	2022-12-05 08:16:22 +02:00
Avi Kivity	a0a4711b74	snapshot: protect list operations against the lambda coroutine fiasco run_snapshot_list_operation() takes a continuation, so passing it a lambda coroutine without protection is dangerous. Protect the coroutine with coroutine::lambda so it doesn't lost its contents. Fixes #12192. Closes #12193	2022-12-05 08:14:39 +02:00
guy9	cb842b2729	Replacing the Docs top bar message from the LIVE event to the community forum announcement Closes #12189	2022-12-05 08:05:04 +02:00
Avi Kivity	6326be5796	cql3: batch_statement: reindent get_mutations()	2022-12-04 21:47:22 +02:00
Avi Kivity	2d74360de3	cql3: batch_statement: coroutinize get_mutations() It has a do_with(), so an automatic win.	2022-12-04 21:45:10 +02:00
Avi Kivity	0834bb0365	cql3: select_statement: reindent process_results_complex()	2022-12-04 21:36:17 +02:00
Avi Kivity	a63f98e3fc	cql3: select_statement: coroutinize process_results_complex() Not a huge gain, since it's just a do_with, but still a little better. Note the inner lambda is not a coroutine, so isn't susceptibe to the lambda coroutine fiasco.	2022-12-04 21:34:51 +02:00
Avi Kivity	7f29efa0ad	cql3: select_statement: split process_results() into fast path and complex path This will allow us to coroutinize the complex path without adding an allocation to the fast path.	2022-12-04 21:30:45 +02:00
Avi Kivity	02b66bb31a	Merge 'Mark sstable::<directory accessing methods> private' from Pavel Emelyanov One of the prerequisites to make sstables reside on object-storage is not to let the rest of the code "know" the filesystem path they are located on (because sometimes they will not be on any filesystem path). This patch makes the methods that can reveal this path back private so that later they can be abstracted out. Closes #12182 * github.com:scylladb/scylladb: sstable: Mark some methods private test: Don't get sstable dir when known test: Use move_to_quarantine() helper test: Use sstable::filename() overload without dir name sstables: Reimplement batch directory sync after move table, tests: Make use of move_to_new_dir() default arg sstables: Remove fsync_directory() helper table: Simplify take_snapshot()'s collecting sstables names	2022-12-04 17:45:37 +02:00
Kamil Braun	b551cd254c	test: test_raft_upgrade: fix test_recover_stuck_raft_upgrade flakiness The test enables an error injection inside the Raft upgrade procedure on one of the nodes which will cause the node to throw an exception before entering `synchronize` state. Then it restarts other nodes with Raft enabled, waits until they enter `synchronize` state, puts them in RECOVERY mode, removes the error-injected node and creates a new Raft group 0. As soon as the other nodes enter `synchronize`, the test disabled the error injection (the rest of the test was outside the `async with inject_error(...)` block). There was a small chance that we disabled the error injection before the node reached it. In that case the node also entered `synchronize` and the cluster managed to finish the upgrade procedure. We encountered this during next promotion. Eliminate this possibility by extending the scope of the `async with inject_error(...)` block, so that the RECOVERY mode steps on the other nodes are performed within that block. Closes #12162	2022-12-02 21:26:44 +01:00
Avi Kivity	94f18b5580	test: sstable_conforms_to_mutation_source: use do_with_async() where needed The test clearly needs a thread (it converts a reader to a mutation without waiting), so give it one. Closes #12178	2022-12-02 20:48:37 +01:00
Pavel Emelyanov	084522d9eb	sstable: Mark some methods private There are several class sstable methods that reveal internal directory path to caller. It's not object-storage-friendly. Fortunately, all the callers of those methods had been patched not to work with full paths, so these can be marked private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:15:02 +03:00
Pavel Emelyanov	fb63850f2c	test: Don't get sstable dir when known The sstable_move_test creates sstables in its own temp directories and the requests these dirs' paths back from sstables. Test can come with the paths it has at hand, no need to call sstables for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:13:58 +03:00
Pavel Emelyanov	4c742a658d	test: Use move_to_quarantine() helper Two places in tests move sstable to quarantine subdir by hand. There's the class sstable method that does the same, so use it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:13:19 +03:00
Pavel Emelyanov	d6244b7408	test: Use sstable::filename() overload without dir name The dir this place currently uses is the directory where the sstable was created, so dropping this argument would just render the same path. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:12:21 +03:00
Pavel Emelyanov	a702affd4d	sstables: Reimplement batch directory sync after move There's a table::move_sstables_from_staging() method that gets a bunch of sstables and moves them from staging subdit into table's root datadir. Not to flush the root dir for every sstable move, it asks the sstable::move_to_new_dir() not to flush, but collects staging dir names and flushes them and the root dir at the end altothether. In order to make it more friendly to object-storage and to remove one more caller of sstable::get_dir() the delayed_commit_changes struct is introduced. It collects _all_ the affected dir names in unordered_set, then allows flushing them. By default the move_to_new_dir() doesn't receive this object and flushes the directories instantly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:08:47 +03:00
Pavel Emelyanov	1b42d5fce3	table, tests: Make use of move_to_new_dir() default arg The method in question accepts boolean bit whether or not it should sync directories at the end. It's always true but in one case, so there's the default value for it. Make use of it. Anticipating the suggestion to replace bool with bool_class -- next patch will replace it with something else. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:07:16 +03:00
Pavel Emelyanov	339feb4205	sstables: Remove fsync_directory() helper The one effectively wraps existing seastar sync_directory() helper into two io_check-s. It's simpler just to call the latter directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:05:43 +03:00
Pavel Emelyanov	80f5d7393f	table: Simplify take_snapshot()'s collecting sstables names The method in question "snapshots" all sstables it can find, then writes their Datafile names into the manifest file. To get the list of file names it iterates over sstables list again and does silly conversion of full file path to file name with the help of the directory path length. This all can be made much simpler if just collecting component names directly at the time sstable is hardlinked. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:02:37 +03:00
Raphael S. Carvalho	d61b4f9dfb	compaction_manager: Delete compaction_state's move constructor compaction_state shouldn't be moved once emplaced. moving it could theoretically cause task's gate holder to have a dangling pointer to compaction_state's gate, but turns out gate's move ctor will actually fail under this assertion: assert(!_count && "gate reassigned with outstanding requests"); Cannot happen today, but let's make it more future proof. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12167	2022-12-02 20:56:57 +03:00
Tomasz Grabiec	1a6bf2e9ca	Merge 'service/raft: specialized verb for failure detector pinger' from Kamil Braun We used GOSSIP_ECHO verb to perform failure detection. Now we use a special verb DIRECT_FD_PING introduced for this purpose. There are multiple reasons to do so. One minor reason: we want to use the same connection as other Raft verbs: if we can't deliver Raft append_entries or vote messages somewhere, that endpoint should be marked dead; if we can, the endpoint should be marked alive. So putting pings on the same connection as the other Raft verbs is important when dealing with weird situations where some connections are available but others are not. Observe that in `do_get_rpc_client_idx`, we put the new verb in the right place. Another minor reason: we remove the awkward gossiper `echo_pinger` abstraction which required storing and updating gossiper generation numbers. This also removes one dependency from Raft service code to gossiper. Major reason 1: the gossip echo handler has a weird mechanism where a replacing node returns errors during the replace operation to some of the nodes. In Raft however, we want to mark servers as alive when they are alive, including a server running on a node that's replacing another node. Major reason 2, related to the previous one: when server B is replacing server A with the same IP, the failure detector will try to ping both servers. Both servers are mapped to the same IP by the address map, so pings to both servers will reach server B. We want server B to respond to the pings destined for server B, but not to pings destined for server A, so the sender can mark B alive but keep A marked dead. To do this, we include the destination's Raft ID in our RPCs. The destination compares the received ID with its own. If it's different, it returns a `wrong_destination` response, and the failure detector knows that the ping did not reach the destination (it reached someone else). Yet another reason: removes "Not ready to respond gossip echo message" log spam during replace. Closes #12107 * github.com:scylladb/scylladb: service/raft: specialized verb for failure detector pinger db: system_keyspace: de-staticize `{get,set}_raft_server_id` service/raft: make this node's Raft ID available early in group registry	2022-12-02 13:54:02 +01:00
Pavel Emelyanov	71179ff5ab	distributed_loader: Use coroutine::lambda in sleeping coroutine According to seastar/doc/lambda-coroutine-fiasco.md lambda that co_awaits once loses its capture frame. In distrobuted_loader code there's at least one of that kind. fixes: #12175 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12170	2022-12-02 13:06:33 +02:00
Pavel Emelyanov	1d91914166	sstables: Drop set_generation() method The method became unused since `70e5252a` (table: no longer accept online loading of SSTable files in the main directory) and the whole concept of reshuffling sstables was dropped later by `7351db7c` (Reshape upload files and reshard+reshape at boot). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12165	2022-12-01 22:17:10 +02:00
Avi Kivity	2978052113	view: reindent maybe_mark_view_as_built Several identation levels were harmed during the preparation of this patch.	2022-12-01 22:09:21 +02:00
Avi Kivity	ac2e2f8883	view: coroutinize maybe_mark_view_as_built Somewhat simplifies complicated logic.	2022-12-01 22:04:51 +02:00
Kamil Braun	cbdcc944b5	service/raft: specialized verb for failure detector pinger We used GOSSIP_ECHO verb to perform failure detection. Now we use a special verb DIRECT_FD_PING introduced for this purpose. There are multiple reasons to do so. One minor reason: we want to use the same connection as other Raft verbs: if we can't deliver Raft append_entries or vote messages somewhere, that endpoint should be marked dead; if we can, the endpoint should be marked alive. So putting pings on the same connection as the other Raft verbs is important when dealing with weird situations where some connections are available but others are not. Observe that in `do_get_rpc_client_idx`, we put the new verb in the right place. Another minor reason: we remove the awkward gossiper `echo_pinger` abstraction which required storing and updating gossiper generation numbers. This also removes one dependency from Raft service code to gossiper. Major reason 1: the gossip echo handler has a weird mechanism where a replacing node returns errors during the replace operation to some of the nodes. In Raft however, we want to mark servers as alive when they are alive, including a server running on a node that's replacing another node. Major reason 2, related to the previous one: when server B is replacing server A with the same IP, the failure detector will try to ping both servers. Both servers are mapped to the same IP by the address map, so pings to both servers will reach server B. We want server B to respond to the pings destined for server B, but not to pings destined for server A, so the sender can mark B alive but keep A marked dead. To do this, we include the destination's Raft ID in our RPCs. The destination compares the received ID with its own. If it's different, it returns a `wrong_destination` response, and the failure detector knows that the ping did not reach the destination (it reached someone else). Yet another reason: removes "Not ready to respond gossip echo message" log spam during replace.	2022-12-01 20:54:18 +01:00
Kamil Braun	02c64becdc	db: system_keyspace: de-staticize `{get,set}_raft_server_id` Part of the anti-globals war.	2022-12-01 20:54:18 +01:00
Kamil Braun	99fe580068	service/raft: make this node's Raft ID available early in group registry Raft ID was loaded or created late in the boot procedure, in `storage_service::join_token_ring`. Create it earlier, as soon as it's possible (when `system_keyspace` is started), pass it to `raft_group_registry::start` and store it inside `raft_group_registry`. We will use this Raft ID stored in group registry in following patches. Also this reduces the number of disk accesses for this node's Raft ID. It's now loaded from disk once, stored in `raft_group_registry`, then obtained from there when needed. This moves `raft_group_registry::start` a bit later in the startup procedure - after `system_keyspace` is started - but it doesn't make a difference.	2022-12-01 20:54:18 +01:00
Nadav Har'El	6fcb5302a6	alternator-test: xfail a flaky test exposing a known bug In a recent commit `757d2a4`, we removed the "xfail" mark from the test test_manual_requests.py::test_too_large_request_content_length because it started to pass on more modern versions of Python, with a urllib3 bug fixed. Unfortunately, the celebration was premature: It turns out that although the test now usually passes, it sometimes fails. This is caused by a Seastar bug scylladb/seastar#1325, which I opened #12166 to track in this project. So unfortunately we need to add the "xfail" mark back to this test. Note that although the test will now be marked "xfail", it will actually pass most of the time, so will appear as "xpass" to people run it. I put a note in the xfail reason string as a reminder why this is happening. Fixes #12143 Refs #12166 Refs scylladb/seastar#1325 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12169	2022-12-01 20:00:46 +02:00
Kamil Braun	3cd035d1b9	test/pylib: scylla_cluster: remove `ScyllaCluster.decommissioned` field The field was not used for anything. We can keep decommissioned server in `stopped` field. In fact it caused us a problem: since recently, we're using `ScyllaCluster.uninstall` to clean-up servers after test suite finishes (previously we were using `ScyllaServer.uninstall` directly). But `ScyllaCluster.uninstall` didn't look into the `decommissioned` field, so if a server got decommissioned, we wouldn't uninstall it, and it left us some unnecessary artifacts even for successful tests. This is now fixed. Closes #12163	2022-12-01 19:07:26 +02:00
Avi Kivity	a4b77a5691	Merge 'Cleanup sstables::test_env's manager usage' from Pavel Emelyanov Mainly this PR removes global db::config and feature service that are used by sstables::test_env as dependencies for embedded sstables_manager. Other than that -- drop unused methods, remove nested test_env-s and relax few cases that use two temp dirs at a time for no gain. Closes #12155 * github.com:scylladb/scylladb: test, utils: Use only one tempdir sstable_compaction_test: Dont create nested envs mutation_reader_test: Remove unused create_sstable() helper tests, lib: Move globals onto sstables::test_env tests: Use sstables::test_env.db_config() to access config features: Mark feature_config_from_db_config const sstable_3_x_test: Use env method to create sst sstable_3_x_test: Indentation fix after previous patch sstable_3_x_test: Use sstable::test_env test: Add config to sstable::test_env creation config: Add constexpr value for default murmur ignore bits	2022-12-01 17:47:25 +02:00
Pavel Emelyanov	4c6bfc078d	code: Use http::re(quest\|ply) instead of httpd:: ones Recent seastar update deprecated those from httpd namespace. fixes: #12142 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12161	2022-12-01 17:33:35 +02:00
Pavel Emelyanov	adc6ee7ea8	test, utils: Use only one tempdir There's a do_with_cloned_tmp_directory that makes two temp dirs to toss sstables between them. Make it go with just one, all the more so it would resemble existing manipulations aroung staging/ subdir Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:57 +03:00
Pavel Emelyanov	15a7b9cafa	sstable_compaction_test: Dont create nested envs The "compact" test case runs in sstables::test_env and additionally wraps it with another instance provided by do_with_tmp_directory helper. It's simpler to create the temp dir by hand and use outter env. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:56 +03:00
Pavel Emelyanov	69fe5fd054	mutation_reader_test: Remove unused create_sstable() helper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:54 +03:00
Pavel Emelyanov	400bc2c11d	tests, lib: Move globals onto sstables::test_env There's a bunch of objects that are used by test_env as sstables_manager dependencies. Now when no other code needs those globals they better sit on the test_env next to the manager Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:36 +03:00
Pavel Emelyanov	6a294b9ad6	tests: Use sstables::test_env.db_config() to access config Currently some places use global test config, but it's going to be removed soon, so switch to using config from environment Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:30 +03:00
Pavel Emelyanov	b4e31ad359	features: Mark feature_config_from_db_config const It's in fact such. Other than that, next patch will call it with const config at hand and fail to compile without this fix Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:27 +03:00
Pavel Emelyanov	8178845ef3	sstable_3_x_test: Use env method to create sst Just to make it shorter and conform to other sst env tests Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:19 +03:00
Pavel Emelyanov	8d5d05012e	sstable_3_x_test: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:09 +03:00
Pavel Emelyanov	6628d801f2	sstable_3_x_test: Use sstable::test_env There are several cases there that construct sstables_manager by hand with the help of a bunch of global dependencies. It's nicer to use existing wrapper. (indentation left broken until next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:38:46 +03:00
Pavel Emelyanov	1d8c76164f	test: Add config to sstable::test_env creation To make callers (tests) construct it with different options. In particular, one test will soon want to construct it with custom large data handler of its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:38:18 +03:00
Pavel Emelyanov	6d0c8fb6e2	config: Add constexpr value for default murmur ignore bits ... and use in some places of sstable_compaction_test. This will allow getting rid of global test_db_config thing later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:38:15 +03:00
Botond Dénes	dbd00fd3e9	Merge 'Task manager shard repair tasks' from Aleksandra Martyniuk The PR introduces shard_repair_task_impl which represents a repair task that spans over a single shard repair. repair_info is replaced with shard_repair_task_impl, since both serve similar purpose. Closes #12066 * github.com:scylladb/scylladb: repair: reindent repair: replace repair_info with shard_repair_task_impl repair: move repair_info methods to shard_repair_task_impl repair: rename methods of repair_module repair: change type of repair_module::_repairs repair: keep a reference to shard_repair_task_impl in row_level_repair repair: move repair_range method to shard_repair_task_impl repair: make do_repair_ranges a method of shard_repair_task_impl repair: copy repair_info methods to shard_repair_task_impl repair: corutinize shard task creation repair: define run for shard_repair_task_impl repair: add shard_repair_task_impl	2022-12-01 10:04:31 +02:00
Nadav Har'El	5eda8ce4fd	alternator ttl: in scanning thread, don't retry the same page too many times Since fixing issue #11737, when the expiration scanner times out reading a page of data, it retries asking for the same page instead of giving up on the scan and starting anew later. This retry was infinite - which can cause problems if we have a bug in the code or several nodes down, which can lead to getting hung in the same place in the scan for a very long (potentially infinite) time without making any progress. An example of such a bug was issue #12145, where we forgot to handle shutdowns, so on shutdown of the cluster we just hung forever repeating the same request that will never succeed. It's better in this case to just give up on the current scan, and start it anew (from a random position) later. Refs #12145 (that issue was already fixed, by a different patch which stops the iteration when shutting down - not waiting for an infinite number of iterations and not even one more). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-11-30 18:42:37 +02:00
Nadav Har'El	d08eef5a30	alternator: fix hang during shutdown of expiration-scanning thread The expiration-scanning thread is a long-running thread which can scan data for hours, but checks for its abort-source before fetching each page to allow for timely shutdown. Recently, we added the ability to retry the page fetching in case of timeout, for forgot to check the abort source in this new retry loop - which lead to an infinitely-long shutdown in some tests while the retry loop retries forever. In this patch we fix this bug by using sleep_abortable() instead of sleep(). sleep_abortable() will throw an exception if the abort source was triggered before or during the sleep - and this exception will stop the scan immediately. Fixes #12145 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-11-30 18:38:17 +02:00
Jan Ciolek	05ea0c1d60	dev/docs: add additional git pull to backport docs Botond noted that an additional git pull might be needed here: https://github.com/scylladb/scylladb/pull/12138#discussion_r1035857007 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-30 16:14:02 +01:00
Jan Ciolek	e74873408b	docs/dev: add a note about cherry-picking individual commits Some people prefer to cherry-pick individual commits so that they have less conflicts to resolve at once. Add a comment about this possibility. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-30 16:06:39 +01:00
Kamil Braun	0f9d0dd86e	Merge 'raft: support IP address change' from Konstantin Osipov This is the core of dynamic IP address support in Raft, moving out the IP address sourcing from Raft Group 0 configuration to gossip. At start of Raft, the raft id <> IP address translation map is tuned into the gossiper notifications and learns IP addresses of Raft hosts from them. The series intentionally doesn't contain the part which speeds up the initial cluster assembly by persisting the translation cache and using more sources besides gossip (discovery, RPC) to show correctness of the approach. Closes #12035 * github.com:scylladb/scylladb: raft: (rpc) do not throw in case of a missing IP address in RPC raft: (address map) actively maintain ip <-> raft server id map	2022-11-30 15:40:18 +01:00
Aleksandra Martyniuk	78a6193c01	repair: reindent	2022-11-30 13:53:52 +01:00
Aleksandra Martyniuk	b4ad914fe1	repair: replace repair_info with shard_repair_task_impl repair_info is deleted and all its attributes are moved to shard_repair_task_impl.	2022-11-30 13:53:52 +01:00
Aleksandra Martyniuk	f6ec2cec92	repair: move repair_info methods to shard_repair_task_impl	2022-11-30 13:53:18 +01:00
Jan Ciolek	32663e6adb	docs/dev: use 'is merged into' instead of 'becomes' The backport instructions said that after passing the tests next `becomes` master, but it's more exact to say that next `is merged into` master. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-30 13:25:10 +01:00
Jan Ciolek	28cf8a18de	docs/dev: mention that new backport instructions are for the contributor Previously the section was called: "How to backport a patch", which could be interpreted as instructions for the maintainer. The new title clearly states that these instructions are for the contributor in case the maintainer couldn't backport the patch by themselves. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-30 13:23:15 +01:00
Takuya ASADA	4ecc08c4fe	docker: switch default locale to C.UTF-8 Since we switched scylla-machine-image locale to C.UTF-8 because ubuntu-minimal image does not have en_US.UTF-8 by default, we should do same on our docker image to reduce image size. Verified #9570 does not occur on new image, since it is still UTF-8 locale. Closes #12122	2022-11-30 13:58:43 +02:00
Anna Stuchlik	15cc3ecf64	doc: update the releases in the KB about updating the mode after upgrade	2022-11-30 12:53:13 +01:00
Anna Stuchlik	242a3916f0	doc: fix the broken link in the 5.1 upgrade guide	2022-11-30 12:49:20 +01:00
Alejo Sanchez	f7aa08ef25	test.py: don't stop cluster's site if not started The site member is created in ScyllaCluster.start(), for startup failure this might not be initialized, so check it's present before stop()ing it. And delete it as it's not running and proper initialization should call ScyllaCluster.start(). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11939	2022-11-30 13:47:18 +02:00
Anna Stuchlik	1575d96856	doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide	2022-11-30 12:40:49 +01:00
Nadav Har'El	ce347f4b67	test/cql-pytest: add test for meaning of fetch_size with filtering A question was raised on what fetch_size (the requested page size in a paged scan) counts when there is a filter: does it count the rows before filtering (as scanned from disk) or after filter (as will be returned to the client)? This patch adds a test which demonstrates that Cassandra and Scylla behave differently in this respect: Cassandra counts post-filtering - so fetch_size results are actually returned, while Scylla currently counts pre-filtering. It is arguable which behavior is the "correct" one - we discuss this in issue #12102. But we have already had several users (such as #11340) who complained about Scylla's behavior and expected Cassandra's behavior, so if we decide to keep Scylla's behavior we should at least explain and justify this decision in our documentation. Until then, let's have this test which reminds us of this incompatibility. This test currently passes on Cassandra and fails (xfail) on Scylla. Refs #11340 Refs #12102 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12103	2022-11-30 12:27:06 +02:00
Nadav Har'El	8bd8ef3d03	test/cql-pytest: add regression test for old issue This patch adds a regression test for the old issue #65 which is about a multi-column (tuple) clustering-column relation in a SELECT when one these columns has reversed order. It turns out that we didn't notice, but this issue was already solved - but we didn't have a regression test for it. So this patch adds just a regression test. The test confirms that Scylla now behaves like was desired when that issue was opened. The test also passes on Cassandra, confirming that Scylla and Cassandra behave the same for such requests. Fixes #65 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12130	2022-11-30 12:22:21 +02:00
Michał Jadwiszczak	8e64e18b80	forward_service: add debug logs Adds a few debug logs to see what is happening in https://github.com/scylladb/scylladb/issues/11684 Wrapped `forward_result::printer` into `seastar::value_of` to lazy evaluate the printer Closes #12113	2022-11-30 12:15:26 +02:00
Yaniv Kaul	b66ca3407a	doc: Typo - then -> than Fix a typo. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes #12140	2022-11-30 12:03:56 +02:00
Botond Dénes	50aea9884b	Merge 'Improve the Raft upgrade procedure' from Kamil Braun Better logging, less code, a minor fix. Closes #12135 * github.com:scylladb/scylladb: service/raft: raft_group0: less repetitive logging calls service/raft: raft_group0: fix sleep_with_exponential_backoff	2022-11-30 11:24:20 +02:00
Avi Kivity	6a5d9ff261	treewide: use non-experimental std::source_location Now that we use libstdc++ 12, we can use the standardized source_location. Closes #12137	2022-11-30 11:06:43 +02:00
Jan Ciolek	56a802c979	docs/dev: Add backport instructions for contributors Add instructions on how to backport a feature to on older version of Scylla. It contains a detailed step-by-step instruction so that people unfamiliar with intricacies of Scylla's repository organization can easily get the hang of it. This is the guide I wish I had when I had to do my first backport. I put it in backport.md because that looks like the file responsible for this sort of information. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-29 22:10:27 +01:00
Konstantin Osipov	fbe7886cc0	raft: (rpc) do not throw in case of a missing IP address in RPC Remove raft_address_map::get_inet_address() While at it, coroutinize some rpc mehtods. To propagate up the event of missing IP address, use coroutine::exception( with a proper type (raft::transport_error) and a proper error message. This is a building block from removing raft_address_map::get_inet_address() which is too generic, and shifting the responsibility of handling missing addresses to the address map clients. E.g. one-way RPC shouldn't throw if an address is missing, but just drop the message. PS An attempt to use a single template function rendered to be too complex: - some functions require a gate, some don't - some return void, some future<> and some future<raft::data_type>	2022-11-29 19:55:48 +03:00
Konstantin Osipov	73e5298273	raft: (address map) actively maintain ip <-> raft server id map 1) make address map API flexible Before this patch: - having a mapping without an actual IP address was an internal error - not having a mapping for an IP address was an internal error - re-mapping to a new IP address wasn't allowed After this patch: - the address map may contain a mapping without an actual IP address, and the caller must be prepared for it: find() will return a nullopt. This happens when we first add an entry to Raft configuration and only later learn its IP address, e.g. via gossip. - it is allowed to re-map an existing entry to a new address; 2) subscribe to gossip notifications Learning IP addresses from gossip allows us to adjust the address map whenever a node IP address changes. Gossiper is also the only valid source of re-mapping, other sources (RPC) should not re-map, since otherwise a packet from a removed server can remap the id to a wrong address and impact liveness of a Raft cluster. 3) prompt address map state with app state Initialize the raft address map with initial gossip application state, specifically IPs of members of the cluster. With this, we no longer need to store these IPs in Raft configuration (and update them when they change). The obvious drawback of this approach is that a node may join Raft config before it propagates its IP address to the cluster via gossip - so the boot process has to wait until it happens. Gossip also doesn't tell us which IPs are members of Raft configuration, so we subscribe to Group0 configuration changes to mark the members of Raft config "non-expiring" in the address translation map. Thanks to the changes above, Raft configuration no longer stores IP addresses. We still keep the 'server_info' column in the raft_config system table, in case we change our mind or decide to store something else in there.	2022-11-29 19:55:43 +03:00
Kamil Braun	3dbcff435f	service/raft: raft_group0: less repetitive logging calls Some log messages in retry loops in the Raft upgrade procedure included a sentence like "sleeping before retrying..."; but not all of them. With the recently added `sleep_with_exponential_backoff` abstraction we can put this "sleeping..." message in a single place, and it's also easy to say how long we're going to sleep. I also enjoy using this `source_location` thing.	2022-11-29 17:42:43 +01:00
Nadav Har'El	c5121cf273	cql: fix column-name aliases in SELECT JSON The SELECT JSON statement, just like SELECT, allows the user to rename selected columns using an "AS" specification. E.g., "SELECT JSON v AS foo". This specification was not honored: We simply forgot to look at the alias in SELECT JSON's implementation (we did it correctly in regular SELECT). So this patch fixes this bug. We had two tests in cassandra_tests/validation/entities/json_test.py that reproduced this bug. The checks in those tests now pass, but these two tests still continue to fail after this patch because of two other unrelated bugs that were discovered by the same tests. So in this patch I also add a new test just for this specific issue - to serve as a regression test. Fixes #8078 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12123	2022-11-29 18:16:19 +02:00
Avi Kivity	faf11587fa	Update seastar submodule * seastar 4f4cc00660...3a5db04197 (16): > tls: add missing include <map> > Merge 'util/process: use then_unpack to help automatically unpack tuple.' from Jianyong Chen > HTTP: define formatter for status_type to fix build. > fsnotifier: move it into namespace experimental and add docs. > Move fsnotify.hh to the 'include' directory for public use. > Merge 'reactor: define make_pipe() and use make_pipe() in reactor::spawn()' from Kefu Chai > Merge 'Fix: error when compiling http_client_demo' from Amossss > util/process: using `data_sink_impl::put` > Merge 'dns: serialize UDP sends.' from Calle Wilund > build: use correct version when finding liburing > Merge 'Add simple http client' from Pavel Emelyanov > future: use invoke_result instead of nested requirements > Merge 'reactor: use separate calls in reactor and reactor_backend for read/write/sendmsg/recvmsg' from Kefu Chai > util, core: add spawn_process() helper > parallel utils: add note about shard-local parallelism > shared_mutex: return typed exceptional future in with_* error handlers Closes #12131	2022-11-29 18:10:06 +02:00
Kamil Braun	580bdec875	service/raft: raft_group0: fix sleep_with_exponential_backoff It was immediately jumping to _max_retry_period.	2022-11-29 16:27:59 +01:00
Nadav Har'El	6bc3075bbd	test/alternator: increase timeout on TTL tests Some of the tests in test/alternator/test_ttl.py need an expiration scan pass to complete and expire items. In development builds on developer machines, this usually takes less than a second (our scanning period is set to half a second). However, in debug builds on Jenkins each scan often takes up to 100 (!) seconds (this is the record we've seen so far). This is why we set the tests' timeout to 120. But recently we saw another test run failing. I think the problem is that in some case, we need not one, but two scanning passes to complete before the timeout: It is possible that the test writes an item right after the current scan passed it, so it doesn't get expired, and then we a second scan at a random position, possibly making that item we mention one of the last items to be considered - so in total we need to wait for two scanning periods, not one, for the item to expire. So this patch increases the timeout from 120 seconds to 240 seconds - more than twice the highest scanning time we ever saw (100 seconds). Note that this timeout is just a timeout, it's not the typical test run time: The test can finish much more quickly, as little as one second, if items expire quickly on a fast build and machine. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12106	2022-11-29 16:37:54 +03:00
Nadav Har'El	1f8adda4b2	Merge 'treewide: improve compatibility with gcc 12' from Avi Kivity Fix some issues found with gcc 12. Note we can't fully compile with gcc yet, due to [1]. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056 Closes #12121 * github.com:scylladb/scylladb: utils: observer: qualify seastar::noncopyable_function sstables: generation_type: forgo constexpr on hash of generation_type logalloc: disambiguate types and non-type members task_manager: disambiguate types and non-type members direct_failure_detector: don't change meaning of endpoint_liveness schema: abort on illegal per column computation kind database: abort on illegal per partition rate limit operation mutation_fragment: abort on illegal fragment type per_partition_rate_limit_options: abort on illegal operation type schema: drop unused lambda mutation_partition: drop unused lambda cql3: create_index_statement: remove unused lambda transport: prevent signed and unsigned comparison database: don't compare signed and unsigned types raft: don't compare signed and unsigned types compaction: don't compare signed and unsigned compaction counts bytes_ostream: don't take reference to packed variable	2022-11-29 13:57:24 +02:00
Avi Kivity	ea99750de7	test: give tests less-unique identifiers Test identifiers are very unique, but this makes them less useful in Jenkins Test Result Analyzer view. For example, counter_test can be counter_test.432 in one run and counter_test.442 in another. Jenkins considers them different and so we don't see a trend. Limit the id uniqueness within a test case, so that we'll have counter_test.{1, 2, 3} consistently. Those test will be grouped together so we can see pass/fail trends. Closes #11946	2022-11-29 13:14:14 +02:00
Yaniv Kaul	fef8e43163	doc: cluster management: Replace a misplaced period with a a bulleted list of items Signed-Off-By: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes #12125	2022-11-29 12:42:24 +02:00
Botond Dénes	e9fec761a2	Merge 'doc: document the procedure for updating the mode after upgrade' from Anna Stuchlik Fix https://github.com/scylladb/scylla-docs/issues/4126 Closes #11122 * github.com:scylladb/scylladb: doc: add info about the time-consuming step due to resharding doc: add the new KB to the toctree doc: doc: add a KB about updating the mode in perftune.yaml after upgrade	2022-11-29 12:41:46 +02:00
Avi Kivity	ea901fdb9d	cql3: expr: fold `null` into untyped_constant/constant Our `null` expression, after the prepare stage, is redundant with a `constant` expression containing the value NULL. Remove it. Its role in the unprepared stage is taken over by untyped_constant, which gains a new type_class enumeration to represent it. Some subtleties: - Usually, handling of null and untyped_constant, or null and constant was the same, so they are just folded into each other - LWT "like" operator now has to discriminate between a literal string and a literal NULL - prepare and test_assignment were folded into the corresponing untyped_constant functions. Some care had to be taken to preserve error messages. Closes #12118	2022-11-29 11:02:18 +02:00
Aleksandra Martyniuk	8bc0af9e34	repair: fix double start of data sync repair task Currently, each data sync repair task is started (and hence run) twice. Thus, when two running operations happen within a time frame long enough, the following situation may occur: - the first run finishes - after some time (ttl) the task is unregistered from the task manager - the second run finishes and attempts to finish the task which does not exist anymore - memory access causes a segfault. The second call to start is deleted. A check is added to the start method to ensure that each task is started at most once. Fixes: #12089 Closes #12090	2022-11-29 00:00:10 +02:00
Avi Kivity	9765b2e3bc	cql3: expr: drop remnants of `bool` component from expression In `ad3d2ee47d`, we replaced `bool` as an expression element (representing a boolean constant) with `constant`. But a comment and a concept continue to mention it. Remove the comment and the concept fragment. Closes #12119	2022-11-28 23:18:26 +02:00
Pavel Emelyanov	ae79669fd2	topology: Be less restrictive about missing endpoints Recent changes in topology restricted the get_dc/get_rack calls. Older code was trying to locate the endpoint in gossiper, then in system keyspace cache and if the endpoint was not found in both -- returned "default" location. New code generates internal error in this case. This approach already helped to spot several BUGs in code that had been eventually fixed, but echoes of that change still pop up. This patch relaxes the "missing endpoint" case by printing a warning in logs and returning back the "default" location like old code did. tests: update_cluster_layout_tests.py::* hintedhandoff_additional_test.py::TestHintedHandoff::test_hintedhandoff_rebalance bootstrap_test.py::TestBootstrap::test_decommissioned_wiped_node_can_join bootstrap_test.py::TestBootstrap::test_failed_bootstap_wiped_node_can_join materialized_views_test.py::TestMaterializedViews::test_decommission_node_during_mv_insert_4_nodes refs: #11900 refs: #12054 fixes: #11870 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12067	2022-11-28 22:01:09 +02:00
Avi Kivity	3a6eafa8c6	utils: observer: qualify seastar::noncopyable_function gcc checks name resolution eagerly, and can't find noncopyable_function as this header doesn't include "seastarx.hh". Qualify the name so it finds it.	2022-11-28 21:58:30 +02:00
Avi Kivity	5ae98ab3de	sstables: generation_type: forgo constexpr on hash of generation_type std::hash isn't constexpr, so gcc refuses to make hash of generation_type constexpr. It's pointless anyway since we never have a compile-time sstable generation.	2022-11-28 21:58:30 +02:00
Avi Kivity	a2d43bb851	logalloc: disambiguate types and non-type members logalloc::tracker has some members with the same names as types from namespace scope. gcc (rightfully) complains that this changes the meaning of the name. Qualify the types to disambiguate.	2022-11-28 21:58:30 +02:00
Avi Kivity	ed5da87930	task_manager: disambiguate types and non-type members task_manager has some members with the same names as types from namespace scope. gcc (rightfully) complains that this changes the meaning of the name. Qualify the types to disambiguate.	2022-11-28 21:58:30 +02:00
Avi Kivity	27be1670d1	direct_failure_detector: don't change meaning of endpoint_liveness It's used both as a type and as a member. Qualify the type so they have different names.	2022-11-28 21:58:30 +02:00
Avi Kivity	735c46cb63	schema: abort on illegal per column computation kind Without memory corruption it's not possible for the switch to fall through, and the compiler will error if we forget to add a case. The compiler however is obliged to consider that we might store some other value in the variable.	2022-11-28 21:58:30 +02:00
Avi Kivity	f73a51250c	database: abort on illegal per partition rate limit operation Without memory corruption it's not possible for the switch to fall through, and the compiler will error if we forget to add a case. The compiler however is obliged to consider that we might store some other value in the variable.	2022-11-28 21:58:30 +02:00
Avi Kivity	f469885b41	mutation_fragment: abort on illegal fragment type Without memory corruption it's not possible for the switch to fall through, and the compiler will error if we forget to add a case. The compiler however is obliged to consider that we might store some other value in the variable.	2022-11-28 21:58:30 +02:00
Avi Kivity	a3c89cedbd	per_partition_rate_limit_options: abort on illegal operation type Without memory corruption it's not possible for the switch to fall through, and the compiler will error if we forget to add a case. The compiler however is obliged to consider that we might store some other value in the variable.	2022-11-28 21:58:30 +02:00
Avi Kivity	7ec28a81bf	schema: drop unused lambda get_cell is defined but not used.	2022-11-28 21:58:30 +02:00
Avi Kivity	c493a2379a	mutation_partition: drop unused lambda should_purge_row_tombstone is defined but not used.	2022-11-28 21:58:30 +02:00
Avi Kivity	e25bf62871	cql3: create_index_statement: remove unused lambda throw_exception is defined but not used.	2022-11-28 21:58:30 +02:00
Avi Kivity	5dedf85288	transport: prevent signed and unsigned comparison This can lead to undefined behavior. Cast to unsigned, after we've verified the value is indeed positive.	2022-11-28 21:58:30 +02:00
Avi Kivity	77be69b600	database: don't compare signed and unsigned types gcc warns it can lead to undefined behavior, though 2G entries in a list of mutations are unlikely. Use the correct type for iteration.	2022-11-28 21:58:30 +02:00
Avi Kivity	fb6804e7a4	raft: don't compare signed and unsigned types gcc warns it can lead to undefined behavior, though 2G entries in a list of mutations are unlikely. Use the correct type for iteration.	2022-11-28 21:58:30 +02:00
Avi Kivity	f565db75ce	compaction: don't compare signed and unsigned compaction counts gcc warns as this can lead to incorrect results. Cast the threshold to an unsigned type (we know it's positive at this point) to avoid the warning.	2022-11-28 21:41:56 +02:00
Avi Kivity	23b94ac391	bytes_ostream: don't take reference to packed variable bytes_ostream is packed, so its _begin member is packed as well. gcc (correctly) disallows taking a reference to an unaligned variable in an aligned refernce, and complains. Make it happy by open-coding the exchange operation.	2022-11-28 21:40:18 +02:00
Nadav Har'El	5480211061	Merge 'test.py: support node replace operation' from Kamil Braun The `add_server` function now takes an optional `ReplaceConfig` struct (implemented using `NamedTuple`), which specifies the ID of the replaced server and whether to reuse the IP address. If we want to reuse the IP address, we don't allocate one using the host registry. This required certain refactors: moving the code responsible for allocation of IPs outside `ScyllaServer`, into `ScyllaCluster`. Add two tests, but they are now skipped: one of them is failing (unability for new node to join group 0) and both suffer from a hardcoded 60-second sleep in Scylla. Closes #12032 * github.com:scylladb/scylladb: test/topology: simple node replace tests (currently disabled) test/pylib: scylla_cluster: support node replace operation test/pylib: scylla_cluster: move members initialization to constructor test/pylib: scylla_cluster: (re)lease IP addr outside ScyllaServer test/pylib: scylla_cluster: refactor create_server parameters to a struct test.py: stop/uninstall clusters instead of servers when cleaning up test/pylib: artifact_registry: replace `Awaitable` type with `Coroutine` test.py: prepare for adding extra config from test when creating servers test/pylib: manager_client: convert `add_server` to use `put_json` test/pylib: rest_client: allow returning JSON data from `put_json` test/pylib: scylla_cluster: don't import from manager_client	2022-11-28 16:06:39 +02:00
Takuya ASADA	4d8fb569a1	install.sh: drop locale workaround from python3 thunk Since #7408 does not occur on current python3 version (3.11.0), let's drop the workarond. Closes #12097	2022-11-28 13:07:03 +02:00
Anna Stuchlik	452915cef6	doc: set the documentation version 5.1 as default (latest) Closes #12105	2022-11-28 12:02:13 +01:00
Avi Kivity	380da0586c	Update tools/python3 submodule (drop locale workaround) * tools/python3 773070e...548e860 (1): > install.sh: drop locale workaround from python3 thunk	2022-11-28 12:24:13 +02:00
Avi Kivity	0da66371a5	storage_proxy: coroutinize inner continuation of create_hint_sync_point() It is part of a coroutine::parallel_for_each(), which is safe for lambda coroutines. Closes #12057	2022-11-28 11:30:00 +02:00
Avi Kivity	d12d42d1a6	Revert "configure: temporarily disable wasm support for aarch64" This reverts commit `e2fe8559ca`. I ran all the release mode tests on aarch64 with it reverted, and it passes. So it looks like whatever problems we had with it were fixed. Closes #12072	2022-11-28 11:30:00 +02:00
Nadav Har'El	99a72a9676	Merge 'cql3: expr: make it possible to evaluate expr::binary_operator' from Jan Ciołek As a part of CQL rewrite we want to be able to perform filtering by calling `evaluate()` on an expression and checking if it evaluates to `true`. Currently trying to do that for a binary operator would result in an error. Right now checking if a binary operation like `col1 = 123` is true is done using `is_satisfied_by`, which is able to check if a binary operation evaluates to true for a small set of predefined cases. Eventually once the grammar is relaxed we will be able to write expressions like: `(col1 < col2) = (1 > ?)`, which doesn't fit with what `is_satisfied_by` is supposed to do. Additionally expressions like `1 = NULL` should evaluate to `NULL`, not `true` or `false`. `is_satsified_by` is not able to express that properly. The proper way to go is implementing `evaluate(binary_operator)`, which takes a binary operation and returns what the result of it would be. Implementing `prepare_expression` for `binary_operator` requires us to be able to evaluate it first. In the next PR I will add support for `prepare_expression`. Closes #12052 * github.com:scylladb/scylladb: cql-pytest: enable two unset value tests that pass now cql-pytest: reduce unset value error message cql3: expr: change unset value error messages to lowercase cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected cql3: expr: remove needless braces around switch cases cql3: move evaluation IS_NOT NULL to a separate function expr_test: test evaluating LIKE binary_operator expr_test: test evaluating IS_NOT binary_operator expr_test: test evaluating CONTAINS_KEY binary_operator expr_test: test evaluating CONTAINS binary_operator expr_test: test evaluating IN binary_operator expr_test: test evaluating GTE binary_operator expr_test: test evaluating GT binary_operator expr_test: test evaluating LTE binary_operator expr_test: test evaluating LT binary_operator expr_test: test evaluating NEQ binary_operator expr_test: test evaluating EQ binary_operator cql3: expr properly handle null in is_one_of() cql3: expr properly handle null in like() cql3: expr properly handle null in contains_key() cql3: expr properly handle null in contains() cql3: expr: properly handle null in limits() cql3: expr: remove unneeded overload of limits() cql3: expr: properly handle null in equality operators cql3: expr: remove unneeded overload of equal() cql3: expr: use evaluate(binary_operator) in is_satisfied_by cql3: expr: handle IS NOT NULL when evaluating binary_operator cql3: expr: make it possible to evaluate binary_operator cql3: expr: accept expression as lhs argument to like() cql3: expr: accept expression as lhs in contains_key cql3: expr: accept expression as lhs argument to contains()	2022-11-28 11:30:00 +02:00
Nadav Har'El	1e59c3f9ef	alternator: if TTL scan times out, continue immediately The Alternator TTL expiration scanner scans an entire table using many small pages. If any of those pages time out for some reason (e.g., an overload situation), we currently consider the entire scan to have failed and wait for the next scan period (which by default is 24 hours) when we start the scan from scratch (at a random position). There is a risk that if these timeouts are common enough to occur once or more per scan, the result is that we double or more the effective expiration lag. A better solution, done in this patch, is to retry from the same position if a single page timed out - immediately (or almost immediately, we add a one-second sleep). Fixes #11737 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12092	2022-11-28 11:30:00 +02:00
Avi Kivity	45a57bf22d	Update tools/java submodule (revert scylla-driver) scylla-driver causes dtests to fail randomly (likely due to incorrect handling of the USE statement). Revert it. * tools/java 73422ee114...1c06006447 (2): > Revert "Add Scylla Cloud serverless support" > Revert "Switch cqlsh to use scylla-driver"	2022-11-28 11:29:08 +02:00
Benny Halevy	8f584a9a80	storage_service: handle_state_normal: always update_topology before update_normal_tokens update_normal_tokens checks that that the endpoint is in topology. Currently we call update_topology on this path only if it's not a normal_token_owner, but there are paths when the endpoint could be a normal token owner but still be pending in topology so always update it, just in case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-28 11:25:36 +02:00
Benny Halevy	6b13fd108a	storage_service: handle_state_normal: delete outdated comment regarding update pending ranges race asias@scylladb.com said: > This comments was moved up to the wrong place when tmptr->update_topology was added. > There is no race now since we use the copy-update-replace method to update token_metadada. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-28 11:25:36 +02:00
Kefu Chai	af011aaba1	utils/variant_element: simplify is_variant_element with right fold for better readability than the recursive approach. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes #12091	2022-11-27 16:34:34 +02:00
Avi Kivity	78222ea171	Update tools/java submodule (cqlsh system_distributed_everywhere is a system keyspace) * tools/java 874e2d529b...73422ee114 (1): > Mark "system_distributed_everywhere" as system ks	2022-11-27 15:37:57 +02:00
Aleksandra Martyniuk	9a3d114349	tasks: move methods from task_manager to source file Methods from tasks::task_manager and nested classes are moved to source file. Closes #12064	2022-11-27 15:09:28 +02:00
Piotr Dulikowski	22fbf2567c	utils/abi: don't use the deprecated std::unexpected_handler Recently, clang started complaining about std::unexpected_handler being deprecated: ``` In file included from utils/exceptions.cc:18: ./utils/abi/eh_ia64.hh:26:10: warning: 'unexpected_handler' is deprecated [-Wdeprecated-declarations] std::unexpected_handler unexpectedHandler; ^ /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/exception:84:18: note: 'unexpected_handler' has been explicitly marked deprecated here typedef void (*_GLIBCXX11_DEPRECATED unexpected_handler) (); ^ /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2343:32: note: expanded from macro '_GLIBCXX11_DEPRECATED' ^ /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2334:46: note: expanded from macro '_GLIBCXX_DEPRECATED' ^ 1 warning generated. ``` According to cppreference.com, it was deprecated in C++11 and removed in C++17 (!). This commit gets rid of the warning by inlining the std::unexpected_handler typedef, which is defined as a pointer a function with 0 arguments, returning void. Fixes: #12022 Closes #12074	2022-11-27 12:25:20 +02:00
Alejo Sanchez	5ff4b8b5f8	pytest: catch rare exception for random tables test On rare occassions a SELECT on a DROPpped table throws cassandra.ReadFailure instead of cassandra.InvalidRequest. This could not be reproduced locally. Catch both exceptions as the table is not present anyway and it's correctly marked as a failure. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12027	2022-11-27 10:26:55 +02:00
Michał Chojnowski	a75e4e1b23	db: config: disable global index page caching by default Global index page caching, as introduced in 4.6 (`078a6e422b` and `9f957f1cf9`) has proven to be misdesigned, because it poses a risk of catastrophic performance regressions in common workloads by flooding the cache with useless index entries. Because of that risk, it should be disabled by default. Refs #11202 Fixes #11889 Closes #11890	2022-11-26 14:27:26 +02:00
Aleksandra Martyniuk	c2ea3f49e6	repair: rename methods of repair_module Methods of repair_module connected with repair_module::_repairs are renamed to match repair_module::_repairs type.	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	13dbd75ba8	repair: change type of repair_module::_repairs As a preparation to replacing repair_info with shard_repair_task_impl, type of _repairs in repair module is changed from std::unordered_map<int, lw_shared_ptr<repair_info>> to std::unordered_map<int, tasks::task_id>.	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	55c01a1beb	repair: keep a reference to shard_repair_task_impl in row_level_repair As a part of replacing repair_info with shard_repair_task_impl, instead of a reference to repair_info, row_level_repair keeps a reference to shard_repair_task_impl.	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	9b664570f0	repair: move repair_range method to shard_repair_task_impl	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	3ac5ba7b28	repair: make do_repair_ranges a method of shard_repair_task_impl Function do_repair_ranges is directly connected to shard repair tasks. Turning it into shard_repair_task_impl method enables an access to tasks' members with no additional intermediate layers.	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	a09dfcdacd	repair: copy repair_info methods to shard_repair_task_impl Methods of repair_info are copied to shard_repair_task_impl. They are not used yet, it's a preparation for replacing repair_info with shard_repair_task_impl.	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	a4b1bdb56c	repair: corutinize shard task creation	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	996c0f3476	repair: define run for shard_repair_task_impl Operations performed as a part of shard repair are moved to shard_repair_task_impl run method.	2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk	ba9770ea02	repair: add shard_repair_task_impl Create a task spanning over a repair performed on a given shard.	2022-11-25 16:40:49 +01:00
Anna Stuchlik	d5f676106e	doc: remove the LWT page from the index of Enterprise features Closes #12076	2022-11-24 21:59:05 +02:00
Aleksandra Martyniuk	dcc17037c7	repair: fix bad cast in tasks::task_id parsing In system_keyspace::get_repair_history value of repair_uuid is got from row as tasks::task_id. tasks::task_id is represented by an abstract_type specific for utils::UUID. Thus, since their typeids differ, bad_cast is thrown. repair_uuid is got from row as utils::UUID and then cast. Since no longer needed, data_type_for<tasks::task_id> is deleted. Fixes: #11966 Closes #12062	2022-11-24 19:37:44 +02:00
Jan Ciolek	77c7d8b8f6	cql-pytest: enable two unset value tests that pass now While implementing evaluate(binary_operator) missing checks for unset value were added for comparisons in filtering code. Because of that some tests for unset value started passing. There are still other tests for unset value that are failing because Scylla doesn't have all the checks that it should. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-24 17:07:17 +01:00
Jan Ciolek	5bc0bc6531	cql-pytest: reduce unset value error message When unset value appears in an invalid place both Cassandra and Scylla throw an error. The tests were written with Cassandra and thus the expected error messages were exactly the same as produced by Cassandra. Scylla produces different error messages, but both databases return messages with the text 'unset value'. Reduce the expected message text from the whole message to something that contains 'unset value'. It would be hard to mimic Cassandra's error messages in Scylla. There is no point in spending time on that. Instead it's better to modify the tests so that they are able to work with both Cassandra and Scylla. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-24 17:04:07 +01:00
Jan Ciolek	08f40a116d	cql3: expr: change unset value error messages to lowercase The messages used to contain UNSET_VALUE in capital letters, but the tests expect messages with 'unset value'. Change the message so that it can match the expected error text in tests. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-24 17:02:44 +01:00
Kamil Braun	fda6403b29	test/topology: simple node replace tests (currently disabled) Add two node replace tests using the freshly added infrastructure. One test replaces a node while using a different IP. It is disabled because the replace operation has an unconditional 60-seconds sleep (it doesn't depend on the ring_delay setting for some reason). The sleep needs to be fixed before we can enable this test. The other test replaces while reusing the replaced node's IP. Additionally to the sleep, the test fails because the node cannot join group 0; it's stuck in an infinite loop of trying to join: ``` INFO 2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found no local group 0. Discovering... INFO 2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader b7047f7e-03e6-4797-a723-24054201f91d INFO 2022-11-18 15:56:19,934 [shard 0] raft_group0 - Server 8de951fd-a528-4a82-ac54-592ea269537f is starting group 0 with id 25d2b050-6751-11ed-b534-c3c40c275dd3 WARN 2022-11-18 15:56:20,935 [shard 0] raft_group0 - failed to modify config at peer b7047f7e-03e6-4797-a723-24054201f91d: seastar::rpc::timeout_error (rpc call timed out). Retrying. INFO 2022-11-18 15:56:21,937 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408 WARN 2022-11-18 15:56:22,937 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying. INFO 2022-11-18 15:56:23,938 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408 WARN 2022-11-18 15:56:24,939 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying. ``` and so on.	2022-11-24 16:26:23 +01:00
Kamil Braun	2f60550ff3	test/pylib: scylla_cluster: support node replace operation The `add_server` function now takes an optional `ReplaceConfig` struct (implemented using `NamedTuple`), which specifies the ID of the replaced server and whether to reuse the IP address. If we want to reuse the IP address, we don't allocate one using the host registry. Since now multiple servers can have the same IP, introduce a `leased_ips` set to `ScyllaCluster` which is used when `uninstall`ing the cluster - to make sure we don't `release_host` the same host twice.	2022-11-24 16:26:23 +01:00
Kamil Braun	d80247f912	test/pylib: scylla_cluster: move members initialization to constructor Previously some members had to be initialized in `install` because that's when we first knew the IP address. Now we know the IP address during construction, which allows us to make the code a bit shorter and simpler, and establish invariants: some members (such as `self.config`) are now valid for the entire lifetime of the server object. `install()` is reduced to performing only side effects (creating directories, writing config files), all calculation is done inside the constructor.	2022-11-24 16:26:23 +01:00
Kamil Braun	3934eefd20	test/pylib: scylla_cluster: (re)lease IP addr outside ScyllaServer `ScyllaServer`s were constructed without IP addresses. They leased an IP address from `HostRegistry` and released them in `uninstall`. This responsibility was now moved into `ScyllaCluster`, which leases an IP address for a server before constructing it, and passes it to the constructor. It releases the addresses of its serverswhen uninstalling itself. This will allow the cluster to reuse the IP address of an existing server in that cluster when adding a new server which wants to replace the existing one. Instead of leasing a new address, it will pass the existing IP address to the new server's constructor. The refactor is also nice in that it establishes an invariant for `ScyllaServer`, simplifying reasoning about the class: now it has an `ip_addr` field at all times. `host_registry` was moved from `ScyllaServer` to `ScyllaCluster`.	2022-11-24 16:26:23 +01:00
Kamil Braun	9d5e1191da	test/pylib: scylla_cluster: refactor create_server parameters to a struct `ScyllaCluster` constructor takes a function `create_server` which itself takes 3 parameters now. Soon it will take a 4th. The list of parameters is repeated at the constructor definition and the call site of the constructor, with many parameters it begins being tiresome. Refactor the list of parameters to a `NamedTuple`.	2022-11-24 16:26:23 +01:00
Kamil Braun	d582666293	test.py: stop/uninstall clusters instead of servers when cleaning up `self.artifacts` was calling `ScyllaServer.stop` and `ScyllaServer.uninstall`. Now it calls `ScyllaCluster.stop` and `ScyllaCluster.uninstall`, which underneath stops/uninstalls servers in this cluster. We must be a bit more careful now in case installing/starting a server inside a cluster fails: there are no server cleanup artifacts, and a server is added to cluster's `running` map only after `install_and_start` finishes (until that happens, `ScyllaCluster.stop/uninstall` won't catch this server). So handle failures explicitly in `install_and_start`. This commit does not logically change how the tests are running - every started server belongs to some cluster, so it will be cleaned up - but it's an important refactor. It will allow us to move IP address (de)allocation code outside `ScyllaServer`, into `ScyllaCluster`, which in turn will allow us to implement node replace operation for the case where we want to reuse the replaced node's IP. Also, `ScyllaCluster.uninstall` was unused before this change, now it's used.	2022-11-24 16:26:17 +01:00
Avi Kivity	29a4b662f8	Merge 'doc: document the Alternator TTL feature as GA' from Anna Stuchlik Currently, TTL is listed as one of the experimental features: https://docs.scylladb.com/stable/alternator/compatibility.html#experimental-api-features This PR moves the feature description from the Experimental Features section to a separate section. I've also added some links and improved the formatting. @tzach I've relied on your release notes for RC1. Refs: https://github.com/scylladb/scylladb/issues/5060 Closes #11997 * github.com:scylladb/scylladb: Update docs/alternator/compatibility.md doc: update the link to Enabling Experimental Features doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph doc: update the links to the Enabling Experimental Features section doc: add the link to the Enabling Experimental Features section doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section	2022-11-24 17:22:05 +02:00
Nadav Har'El	2dedb5ea75	alternator: make Alternator TTL feature no longer "experimental" Until now, the Alternator TTL feature was considered "experimental", and had to be manually enabled on all nodes of the cluster to be usable. This patch removes this requirement and in essence GAs this feature. Even after this patch, Alternator TTL is still a "cluster feature", i.e., for this feature to be usable every node in the cluster needs to support it. If any of the nodes is old and does not yet support this feature, the UpdateTimeToLive request will not be accepted, so although the expiration-scanning threads may exist on the newer nodes, they will not do anything because none of the tables can be marked as having expiration enabled. This patch does not contain documentation fixes - the documentation still suggests that the Alternator TTL feature is experimental. The documentation patch will come separately. Fixes #12037 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12049	2022-11-24 17:21:39 +02:00
Tzach Livyatan	e96d31d654	docs: Add Authentication and Authorization as a prerequisite for Auditing. Closes #12058	2022-11-24 17:21:23 +02:00
Kamil Braun	df731a5b0c	test/pylib: artifact_registry: replace `Awaitable` type with `Coroutine` The `cleanup_before_exit` method of `ArtifactRegistry` calls `close()` on artifacts. mypy complains that `Awaitable` has no such method. In fact, the `artifact` objects that we pass to `ArtifactRegistry` (obtained by calling `async def` functions) do have a `close()` method, and they are a particular case of `Awaitable`s, but in general not all `Awaitable`s have `close()`. Replace `Awaitable` with one of its subtypes: `Coroutine`. `Coroutine`s have a `close()` method, and `async def` functions return objects of this type. mypy no longer complains.	2022-11-24 16:17:05 +01:00
Nadav Har'El	c6bb64ab0e	Merge 'Fix LWT insert crash if clustering key is null' from Gusev Petr [PR](https://github.com/scylladb/scylladb/pull/9314) fixed a similar issue with regular insert statements but missed the LWT code path. It's expected behaviour of `modification_statement::create_clustering_ranges` to return an empty range in this case, since `possible_lhs_values` it uses explicitly returns `empty_value_set` if it evaluates `rhs` to null, and it has a comment about it (All NULL comparisons fail; no column values match.) On the other hand, all components of the primary key are required to be set, this is checked at the prepare phase, in `modification_statement::process_where_clause`. So the only problem was `modification_statement::execute_with_condition` was not expecting an empty `clustering_range` in case of a null clustering key. Also this patch contains a fix for the problem with wrong column name in Scylla error messages. If `INSERT` or `DELETE` statement is missing a non-last element of the primary key, the error message generated contains an invalid column name. The problem occurs if the query contains a column with the list type, otherwise `statement_restrictions::process_clustering_columns_restrictions` checks that all the components of the key are specified. Closes #12047 * github.com:scylladb/scylladb: cql: refactor, inline modification_statement::validate_primary_key_restrictions cql: DELETE with null value for IN parameter should be forbidden cql: add column name to the error message in case of null primary key component cql: batch statement, inserting a row with a null key column should be forbidden cql: wrong column name in error messages modification_statement: fix LWT insert crash if clustering key is null	2022-11-24 16:15:27 +02:00
Nadav Har'El	6e9f739f19	Merge 'doc: add the links to the per-partition rate limit extension ' from Anna Stuchlik Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section. Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution. Related: https://github.com/scylladb/scylladb/pull/9810 Closes #12070 * github.com:scylladb/scylladb: doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections	2022-11-24 16:03:30 +02:00
Anna Stuchlik	8049670772	doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list	2022-11-24 14:42:11 +01:00
Anna Stuchlik	57a58b17a8	doc: enable publishing the documentation for version 5.1 Closes #12059	2022-11-24 13:55:25 +02:00
Benny Halevy	243dc2efce	hints: host_filter: check topology::has_endpoint if enabled_selectively Don't call get_datacenter(ep) without checking first has_endpoint(ep) since the former may abort on internal error if the endpoint is not listed in topology. Refs #11870 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12054	2022-11-24 14:33:06 +03:00
Anna Stuchlik	f158d31e24	doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections	2022-11-24 11:26:33 +01:00
Petr Gusev	b95305ae2b	cql: refactor, inline modification_statement::validate_primary_key_restrictions The function didn't add much value, just forwarded to _restrictions. Removed it and called _restrictions->validate_primary_key directly.	2022-11-23 21:56:12 +04:00
Petr Gusev	f9936bb0cb	cql: DELETE with null value for IN parameter should be forbidden If a DELETE statement contains an IN operator and the parameter value for it is NULL, this should also trigger an error. This is in line with how Cassandra behaves in this case.	2022-11-23 21:39:23 +04:00
Petr Gusev	c123f94110	cql: add column name to the error message in case of null primary key component It's more user-friendly and the error message corresponds to what Cassandra provides in this case.	2022-11-23 21:39:23 +04:00
Petr Gusev	7730c4718e	cql: batch statement, inserting a row with a null key column should be forbidden Regular INSERT statements with null values for primary key components are rejected by Scylla since #9286 and #9314. Batch statements missed a similar check, this patch fixes it. Fixes: #12060	2022-11-23 21:39:23 +04:00
Petr Gusev	89a5397d7c	cql: wrong column name in error messages If INSERT or DELETE statement is missing a non-last element of the primary key, the error message generated contains an invalid column name. The problem occurs if the query contains a column with the list type, otherwise statement_restrictions::process_clustering_columns_restrictions checks that all the components of the key are specified. Fixes: #12046	2022-11-23 21:39:16 +04:00
Benny Halevy	996eac9569	topology: add get_datacenters Returns an unordered set of datacenter names to be used by network_topology_replication_strategy and for ks_prop_defs. The set is kept in sync with _dc_endpoints. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12023	2022-11-23 18:39:36 +02:00
Takuya ASADA	9acdd3af23	dist: drop deprecated AMI parameters on setup scripts Since we moved all IaaS code to scylla-machine-image, we nolonger need AMI variable on sysconfig file or --ami parameter on setup scripts, and also never used /etc/scylla/ami_disabled. So let's drop all of them from Scylla core core. Related with scylladb/scylla-machine-image#61 Closes #12043	2022-11-23 17:56:13 +02:00
Avi Kivity	7c66fdcad1	Merge 'Simplify sstable_directory configuration' from Pavel Emelyanov When started the sstable_directory is constructed with a bunch of booleans that control the way its process_sstable_dir method works. It's shorter and simpler to pass these booleans into method directly, all the more so there's another flag that's already passed like this. Closes #12005 * github.com:scylladb/scylladb: sstable_directory: Move all RAII booleans onto flags sstable_directory: Convert sort-sstables argument to flags struct sstable_directory: Drop default filter	2022-11-23 16:16:04 +02:00
Avi Kivity	70bfa708f5	storage_proxy: coroutinize change_hints_host_filter() Trivial straight-line code, no performance implications. Closes #12056	2022-11-23 15:34:24 +02:00
Jan Ciolek	84501851eb	cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected Scylla doesn't support combining restrictions on token with other restrictions on partition key columns. Some pieces of code depend on the assumption that such combinations are allowed. In case they were allowed in the future these functions would silently start returning wrong results, and we would return invalid rows. Add a test that will start failing once this restriction is removed. It will warn the developer to change the functions that used to depend on the assumption. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 13:09:22 +01:00
Botond Dénes	602dfdaf98	Merge 'Task manager top level repair tasks' from Aleksandra Martyniuk The PR introduces top level repair tasks representing repair and node operations performed with repair. The actions performed as a part of these operations are moved to corresponding tasks' run methods. Also a small change to repair module is added. Closes #11869 * github.com:scylladb/scylladb: repair: define run for data_sync_repair_task_impl repair: add data_sync_repair_task_impl tasks: repair: add noexcept to task impl constructor repair: define run for user_requested_repair_task_impl repair: add user_requested_repair_task_impl repair: allow direct access to max_repair_memory_per_range	2022-11-23 14:02:30 +02:00
Jan Ciolek	338af848a8	cql3: expr: remove needless braces around switch cases Originally put braces around the cases because there were local variables that I didn't want to be shadowed. Now there are no variables so the braces can be removed without any problems. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:30 +01:00
Jan Ciolek	e8a46d34c2	cql3: move evaluation IS_NOT NULL to a separate function When evaluating a binary operation with operations like EQUAL, LESS_THAN, IN the logic of the operation is put in a separate function to keep things clean. IS_NOT NULL is the only exception, it has its evaluate implementation right in the evaluate(binary_operator) function. It would be cleaner to have it in a separate dedicated function, so it's moved to one. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:30 +01:00
Jan Ciolek	b6cf6e6777	expr_test: test evaluating LIKE binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:29 +01:00
Jan Ciolek	6774272fd6	expr_test: test evaluating IS_NOT binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:29 +01:00
Jan Ciolek	e6c78bb6c2	expr_test: test evaluating CONTAINS_KEY binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:29 +01:00
Jan Ciolek	4f250609ab	expr_test: test evaluating CONTAINS binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:29 +01:00
Jan Ciolek	3ca04cfcc2	expr_test: test evaluating IN binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:28 +01:00
Jan Ciolek	41f452b73f	expr_test: test evaluating GTE binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:28 +01:00
Jan Ciolek	1fe9a9ce2a	expr_test: test evaluating GT binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:28 +01:00
Jan Ciolek	ef2a77a3e0	expr_test: test evaluating LTE binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:28 +01:00
Jan Ciolek	3cbb2d44e8	expr_test: test evaluating LT binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:27 +01:00
Jan Ciolek	9feee70710	expr_test: test evaluating NEQ binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:27 +01:00
Jan Ciolek	e77dba0b0b	expr_test: test evaluating EQ binary_operator Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:27 +01:00
Jan Ciolek	63a89776a1	cql3: expr properly handle null in is_one_of() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:27 +01:00
Jan Ciolek	214dab9c77	cql3: expr properly handle null in like() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:26 +01:00
Jan Ciolek	2ce9c95a9d	cql3: expr properly handle null in contains_key() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:26 +01:00
Jan Ciolek	336ad61aa3	cql3: expr properly handle null in contains() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:26 +01:00
Jan Ciolek	e2223be1ec	cql3: expr: properly handle null in limits() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:26 +01:00
Jan Ciolek	d1abf2e168	cql3: expr: remove unneeded overload of limits() There is a more general version of limits() which takes expressions as both the lhs and rhs arguments. There is no need for a specialized overload. This specialized overload takes a tuple_constructor as lhs, but we call evaluate() on both sides of a binary operator before checking equality, so this won't be useful at all. Having multiple functions increases the risk that one of them has a bug, while giving dubious benfit. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:25 +01:00
Jan Ciolek	0609a425e6	cql3: expr: properly handle null in equality operators Expressions like: 123 = NULL NULL = 123 NULL = NULL NULL != 123 should be tolerated, but evaluate to NULL. The current code assumes that a binary operator can only evaluate to a boolean - true or false. Now a binary operator can also evaluate to NULL. This should happen in cases when one of the operator's sides is NULL. A special class is introduced to represent a value that can be one of three things: true, false or null. It's better than using std::optional<bool>, because optional has implicit conversions to bool that could cause confusion and bugs. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-23 12:44:22 +01:00
Aleksandra Martyniuk	a3016e652f	repair: define run for data_sync_repair_task_impl Operations performed as a part of data sync repair are moved to data_sync_repair_task_impl run method.	2022-11-23 10:44:19 +01:00
Aleksandra Martyniuk	42239c8fed	repair: add data_sync_repair_task_impl Create a task spanning over whole node operation. Tasks of that type are stored on shard 0.	2022-11-23 10:19:53 +01:00
Aleksandra Martyniuk	9e108a2490	tasks: repair: add noexcept to task impl constructor Add noexcept to constructor of tasks::task_manager::task::impl and inheriting classes.	2022-11-23 10:19:53 +01:00
Aleksandra Martyniuk	4a4e9c12df	repair: define run for user_requested_repair_task_impl Operations performed as a part of user requested repair are moved to user_requested_repair_task_impl run method.	2022-11-23 10:19:51 +01:00
Aleksandra Martyniuk	3800b771fc	repair: add user_requested_repair_task_impl Create a task spanning over whole user requested repair. Tasks of that type are stored on shard 0.	2022-11-23 10:11:09 +01:00
Aleksandra Martyniuk	0256ede089	repair: allow direct access to max_repair_memory_per_range Access specifier of constexpr value max_repair_memory_per_range in repair_module is changed to public and its getter is deleted.	2022-11-23 10:11:09 +01:00
Anna Stuchlik	16e2b9acd4	Update docs/alternator/compatibility.md Co-authored-by: Daniel Lohse <info@asapdesign.de>	2022-11-23 09:51:04 +01:00
Avi Kivity	d7310fd083	gdb: messaging: print tls servers too Many systems have most traffic on tls servers, so print them. Closes #12053	2022-11-23 07:59:02 +02:00
Avi Kivity	aec9faddb1	Merge 'storage_proxy: use erm topology' from Benny Halevy When processing a query, we keep a pointer to an effective_replication_map. In a couple places we used the latest topology instead of the one held by the effective_replication_map that the query uses and that might lead to inconsistencies if, for example, a node is removed from topology after decommission that happens concurrently to the query. This change gets the topology& from the e_r_m in those cases. Fixes #12050 Closes #12051 * github.com:scylladb/scylladb: storage_proxy: pass topology& to sort_endpoints_by_proximity storage_proxy: pass topology& to is_worth_merging_for_range_query	2022-11-22 20:04:41 +02:00
Botond Dénes	49ec7caf27	mutation_fragment_stream_validator: avoid allocation when stream is correct Currently the ctor of said class always allocates as it copies the provided name string and it creates a new name via format(). We want to avoid this, now that the validator is used on the read path. So defer creating the formatted name to when we actually want to log something, which is either when log level is debug or when an error is found. We don't care about performance in either case, but we do care about it on the happy path. Further to the above, provide a constructor for string literal names and when this is used, don't copy the name string, just save a view to it. Refs: #11174 Closes #12042	2022-11-22 19:19:18 +02:00
Nadav Har'El	ce7c1a6c52	Merge 'alternator: fix wrong 'where' condition for GSI range key' from Marcin Maliszkiewicz Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages). Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800) Closes #11926 * github.com:scylladb/scylladb: alternator: add maybe_quote to secondary indexes 'where' condition test/alternator: correct xfail reason for test_gsi_backfill_empty_string test/alternator: correct indentation in test_lsi_describe alternator: fix wrong 'where' condition for GSI range key	2022-11-22 17:46:52 +02:00
Pavel Emelyanov	22133a3949	sstable_directory: Move all RAII booleans onto flags There's a bunch of booleans that control the behavior of sstable directory scanning. Currently they are described as verbose bool_class<>-es and are put into sstable_directory construction time. However, these are not used outside of .process_sstable_dir() method and moving them onto recently added flags struct makes the code much shorter (29 insertions(+), 121 deletions(-)) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-22 18:30:00 +03:00
Pavel Emelyanov	7ca5e143d7	sstable_directory: Convert sort-sstables argument to flags struct The sstable_directory::process_sstable_dir() accepts a boolean to control its behavior when collecting sstables. Turn this boolean into a structure of flags. The intention is to extend this flags set in the future (next patch). This boolean is true all the time, but one place sets it to true in a "verbose" manner, like this: bool sort_sstables_according_to_owner = false; process_sstable_dir(directory, sort_sstables_according_to_owner).get(); the local variable is not used anymore. Using designated initializers solves the verbosity in a nicer manner. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-22 18:19:23 +03:00
Pavel Emelyanov	7c7017d726	sstable_directory: Drop default filter It's used as default argument for .reshape() method, but callers specify it explicitly. At the same time the filter is simple enough and is only used in one place so that the caller can just use explicit lambda. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-22 18:19:23 +03:00
Jan Ciolek	6be142e3a0	cql3: expr: remove unneeded overload of equal() There is a more general version of equal() which takes expressions as both the lhs and rhs arguments. There is no need for a specialized overload. This specialized overload takes a tuple_constructor as lhs, but we call evaluate() on both sides of a binary operator before checking equality, so this won't be useful at all. Having multiple functions increases the risk that one of them has a bug, while giving dubious benfit. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-22 14:28:10 +01:00
Benny Halevy	731a74c71f	storage_proxy: pass topology& to sort_endpoints_by_proximity It mustn't use the latest topology that may differ from the one used by the query as it may be missing nodes (e.g. after concurrent decommission). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-22 15:02:40 +02:00
Benny Halevy	ab3fc1e069	storage_proxy: pass topology& to is_worth_merging_for_range_query It mustn't use the latest topology that may differ from the one used by the query as it may be missing nodes (e.g. after concurrent decommission). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-22 15:01:58 +02:00
Petr Gusev	0d443dfd16	modification_statement: fix LWT insert crash if clustering key is null PR #9314 fixed a similar issue with regular insert statements but missed the LWT code path. It's expected behaviour of modification_statement::create_clustering_ranges to return an empty range in this case, since possible_lhs_values it uses explicitly returns empty_value_set if it evaluates rhs to null, and it has a comment about it (All NULL comparisons fail; no column values match.) On the other hand, all components of the primary key are required to be set, this is checked at the prepare phase, in modification_statement::process_where_clause. So the only problem was modification_statement::execute_with_condition was not expecting an empty clustering_range in case of a null clustering key. Fixes: #11954	2022-11-22 16:45:16 +04:00
Marcin Maliszkiewicz	2bf2ffd3ed	alternator: add maybe_quote to secondary indexes 'where' condition This bug doesn't affect anything, the reason is descibed in the commit: 'alternator: fix wrong 'where' condition for GSI range key'. But it's theoretically correct to escape those key names and the difference can be observed via CQL's describe table. Before the patch 'where' condition is missing one double quote in variable name making it mismatched with corresponding column name.	2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz	4389baf0d9	test/alternator: correct xfail reason for test_gsi_backfill_empty_string Previously cited issue is closed already.	2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz	59eca20af1	test/alternator: correct indentation in test_lsi_describe Otherwise I think assert is not executed in a loop. And I am not sure why lsi variable can be bound to anything. As I tested it was pointing to the last element in lsis...	2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz	d6d20134de	alternator: fix wrong 'where' condition for GSI range key This bug doesn't manifest in a visible way to the user. Adding the index to an existing table via GlobalSecondaryIndexUpdates is not supported so we don't need to consider what could happen for empty values of index range key. After the index is added the only interesting value user can set is omitting the value (null or empty are not allowed, see test_gsi_empty_value and test_gsi_null_value). In practice no matter of 'where' condition the underlaying materialized view code is skipping row updates with missing keys as per this comment: 'If one of the key columns is missing, set has_new_row = false meaning that after the update there will be no view row'. Thats why the added test passes both before and after the patch. But it's still usefull to include it to exercise those code paths. Fixes #11800	2022-11-22 11:08:23 +01:00
Nadav Har'El	ff617c6950	cql-pytest: translate a few small Cassandra tests This patch includes a translation of several additional small test files from Cassandra's CQL unit test directory cql3/validation/operations. All tests included here pass on both Cassandra and Scylla, so they did not discover any new Scylla bugs, but can be useful in the future as regression tests. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12045	2022-11-22 07:54:13 +02:00
Botond Dénes	f3eecb47f6	Merge 'Optimize cleanup compaction get ranges for invalidation' from Benny Halevy Take advantage of the facts that both the owned ranges and the initial non_owned_ranges (derived from the set of sstables) are deoverlapped and sorted by start token to turn the calculation of the final non_owned_ranges from quadratic to linear. Fixes #11922 Closes #11903 * github.com:scylladb/scylladb: dht: optimize subtract_ranges compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys	2022-11-22 06:45:01 +02:00
Jan Ciolek	a1407ef576	cql3: expr: use evaluate(binary_operator) in is_satisfied_by is_satisfied_by has to check if a binary_operator is satisfied by some values. It used to be impossible to evaluate a binary_operator, so is_satisfied had code to check if its satisfied for a limited number of cases occuring when filtering queries. Now evaluate(binary_operator) has been implemented and is_satisfied_by can use it to check if a binary_operator evaluates to true. This is cleaner and reduces code duplication. Additionally cql tests will test the new evalute() implementation. There is one special case with token(). When is_satisfied_by sees a restriction on token it assumes that it's satisfied because it's sure that these token restrictions were used to generate partition ranges. I had to leave this special case in because it's impossible to evaluate(token). Once this is implemented I will remove the special case because it's risky and prone to cause bugs. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-21 20:40:06 +01:00
Jan Ciolek	9c4889ecc3	cql3: expr: handle IS NOT NULL when evaluating binary_operator The code to evaluate binary operators was copied from is_satisfied_by. is_satisfied_by wasn't able to evaluate IS NOT NULL restrictions, so when such restriction is encountered it throws an exception. Implement proper handling for IS NOT NULL binary operators. The switch ensures that all variants of oper_t are handled, otherwise there would be a compilation error. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-21 20:40:00 +01:00
Avi Kivity	bf2e54ff85	Merge 'Move deletion log code to sstable_directory.cc' from Pavel Emelyanov In order to support different storage kinds for sstable files (e.g. -- s3) it's needed to localize all the places that manipulate files on a POSIX filesystem so that custom storage could implement them in its own way. This set moves the deletion log manipulations to the sstable_directory.cc, which already "knows" that it works over a directory. Closes #12020 * github.com:scylladb/scylladb: sstables: Delete log file in replay_pending_delete_log() sstables: Move deletion log manipulations to sstable_directory.cc sstables: Open-code delete_sstables() call sstables: Use fs::path in replay_pending_delete_log() sstables: Indentation fix after previous patch sstables: Coroutinize replay_pending_delete_log sstables: Read pending delete log with one line helper sstables: Dont write pending log with file_writer	2022-11-21 21:22:59 +02:00
Jan Ciolek	b4cc92216b	cql3: expr: make it possible to evaluate binary_operator evaluate() takes an expression and evaluates it to a constant value. It wasn't possible to evalute binary operators before, so it's added. The code is based on is_satisfied_by, which is currently used to check whether a binary operator evaluates to true or false. It looks like is_satisfied_by and evalate() do pretty much the same thing, one could be implemented using the other. In the future they might get merged into a single function. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-21 17:48:23 +01:00
Jan Ciolek	8d81eaa68f	cql3: expr: accept expression as lhs argument to like() like() used to only accept column_value as the lhs to evaluate. Changed it to accept any generic expression. This will allow to evaluate a more diverse set of binary operators. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-21 16:33:18 +01:00
Jan Ciolek	b1a12686dc	cql3: expr: accept expression as lhs in contains_key contains_key() used to only accept column_value as the lhs to evaluate. Changed it to accept any generic expression. This will allow to evaluate a more diverse set of binary operators. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-21 16:33:02 +01:00
Jan Ciolek	79cd9cd956	cql3: expr: accept expression as lhs argument to contains() contains() used to only accept column_value as the lhs to evaluate. Changed it to accept any generic expression. This will allow to evaluate a more diverse set of binary operators. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-21 16:32:44 +01:00
Benny Halevy	57ff3f240f	dht: optimize subtract_ranges Take advantage of the fact that both ranges and ranges_to_subtract are deoverlapped and sorted by to reduce the calculation complexity from quadratic to linear. Fixes #11922 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-21 15:48:28 +02:00
Benny Halevy	8b81635d95	compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation The algorithm is generic and can be used elsewhere. Add a unit test for the function before it gets optimized in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-21 15:48:26 +02:00
Benny Halevy	7c6f60ae72	compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys Currently, the function is inefficient in two ways: 1. unnecessary copy of first/last keys to automatic variables 2. redecorating the partition keys with the schema passed to needs_cleanup. We canjust use the tokens from the sstable first/last decorated keys. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-21 15:44:32 +02:00
Pavel Emelyanov	2f9b7931af	sstables: Delete log file in replay_pending_delete_log() It's natural that the replayer cleans up after itself Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:16:22 +03:00
Pavel Emelyanov	bdc47b7717	sstables: Move deletion log manipulations to sstable_directory.cc The deletion log concept uses the fact that files are on a POSIX filesystem. Support for another storage type will have to reimplement this place, so keep the FS-specific code in _directory.cc file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:16:21 +03:00
Pavel Emelyanov	865c51c6cf	sstables: Open-code delete_sstables() call It's no used by any other code, but to be used it requires the caller to tranform TOC file names by prepending sstable directory to them. Things get shorter and simpler if merging the helper code into the caller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	a61c96a627	sstables: Use fs::path in replay_pending_delete_log() It's called by a code that has fs::path at hand and internally uses helpers that need fs::path too, so no need to convert it back and forth. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	f5684bcaf0	sstables: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	85a73ca9c6	sstables: Coroutinize replay_pending_delete_log Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	6f3fd94162	sstables: Read pending delete log with one line helper There's one in seastar since recently Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	2dedf4d03a	sstables: Dont write pending log with file_writer It's a wrapper over output_stream with offset tracking and the tracking is not needed to generate a log file. As a bonus of switching back we get a stream.write(sstring) sugar. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:24 +03:00
Botond Dénes	2d4439a739	Merge 'doc: add a troubleshooting article about the missing configuration files' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/11598 This PR adds the troubleshooting article submitted by @syuu1228 in the deprecated _scylla-docs_ repo, with https://github.com/scylladb/scylla-docs/pull/4152. I copied and reorganized the content and rewritten it a little according to the RST guidelines so that the page renders correctly. @syuu1228 Could you review this PR to make sure that my changes didn't distort the original meaning? Closes #11626 * github.com:scylladb/scylladb: doc: apply the feedback to improve clarity doc: add the link to the new Troubleshooting section and replace Scylla with ScyllaDB doc: add the new page to the toctree doc: add a troubleshooting article about the missing configuration files	2022-11-21 12:02:31 +02:00
Kamil Braun	135eb4a041	test.py: prepare for adding extra config from test when creating servers We will use this for replace operations to pass the IP of replaced node.	2022-11-21 10:57:03 +01:00
Kamil Braun	ac91e9d8be	test/pylib: manager_client: convert `add_server` to use `put_json` We shall soon pass some JSON data into these requests.	2022-11-21 10:57:03 +01:00
Kamil Braun	82eb9af80d	test/pylib: rest_client: allow returning JSON data from `put_json` We'll use `put_json` for requests which want to pass JSON data into the call and also return JSON.	2022-11-21 10:57:03 +01:00
Kamil Braun	4fef2d099b	test/pylib: scylla_cluster: don't import from manager_client There's a logical dependency from `manager_client` to `scylla_cluster` (`ManagerClient` defined in `manager_client` talks to `ScyllaClusterManager` defined in `scylla_cluster` over RPC). There is no such dependency in the other way. Do not introduce it accidentally. We can import these types from the `internal_types` module.	2022-11-21 10:57:03 +01:00
Nadav Har'El	757d2a4c02	test/alternator: un-xfail a test which passes on modern Python We had an xfailing test that reproduced a case where Alternator tried to report an error when the request was too long, but the boto library didn't see this error and threw a "Broken Pipe" error instead. It turns out that this wasn't a Scylla bug but rather a bug in urllib3, which overzealously reported a "Broken Pipe" instead of trying to read the server's response. It turns out this issue was already fixed in https://github.com/urllib3/urllib3/pull/1524 and now, on modern installations, the test that used to fail now passes and reports "XPASS". So in this patch we remove the "xfail" tag, and skip the test if running an old version of urllib3. Fixes #8195 Closes #12038	2022-11-21 08:10:10 +02:00
Botond Dénes	ffc3697f2f	Merge 'storage_service api: handle dropped tables' from Benny Halevy Gracefully skip tables that were removed in the background. Fixes #12007 Closes #12013 * github.com:scylladb/scylladb: api: storage_service: fixup indentation api: storage_service: add run_on_existing_tables api: storage_service: add parse_table_infos api: storage_service: log errors from compaction related handlers api: storage_service: coroutinize compaction related handlers	2022-11-21 07:56:27 +02:00
Avi Kivity	994603171b	Merge 'Add validator to the mutation compactor' from Botond Dénes Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written. This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too. This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign. Fixes: https://github.com/scylladb/scylladb/issues/11174 Closes #11532 * github.com:scylladb/scylladb: mutation_compactor: add validator mutation_fragment_stream_validator: add a 'none' validation level test/boost/mutation_query_test: test_partition_limit: sort input data querier: consume_page(): use partition_start as the sentinel value treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{} treewide: use ::for_partition_start() instead of ::partition_start_tag_t{} position_in_partition: add for_partition_{start,end}()	2022-11-20 20:33:26 +02:00
Avi Kivity	779b01106d	Merge 'cql3: expr: add unit tests for prepare_expression' from Jan Ciołek Adds unit tests for the function `expr::prepare_expression`. Three minor bugs were found by these tests, both fixed in this PR. 1. When preparing a map, the type for tuple constructor was taken from an unprepared tuple, which has `nullptr` as its type. 2. Preparing an empty nonfrozen list or set resulted in `null`, but preparing a map didn't. Fixed this inconsistency. 3. Preparing a `bind_variable` with `nullptr` receiver was allowed. The `bind_variable` ended up with a `nullptr` type, which is incorrect. Changed it to throw an exception, Closes #11941 * github.com:scylladb/scylladb: test preparing expr::usertype_constructor expr_test: test that prepare_expression checks style_type of collection_constructor expr_test: test preparing expr::collection_constructor for map prepare_expr: make preparing nonfrozen empty maps return null prepare_expr: fix a bug in map_prepare_expression expr_test: test preparing expr::collection_constructor for set expr_test: test preparing expr::collection_constructor for list expr_test: test preparing expr::tuple_constructor expr_test: test preparing expr::untyped_constant expr_test_utils: add make_bigint_raw/const expr_test_utils: add make_tinyint_raw/const expr_test: test preparing expr::bind_variable cql3: prepare_expr: forbid preparing bind_variable without a receiver expr_test: test preparing expr::null expr_test: test preparing expr::cast expr_test_utils: add make_receiver expr_test_utils: add make_smallint_raw/const expr_test: test preparing expr::token expr_test: test preparing expr::subscript expr_test: test preparing expr::column_value expr_test: test preparing expr::unresolved_identifier expr_test_utils: mock data_dictionary::database	2022-11-20 20:03:54 +02:00
Nadav Har'El	2ba8b8d625	test/cql-pytest: remove "xfail" from passing test testIndexOnFrozenCollectionOfUDT We had a test that used to fail because of issue #8745. But this issue was alread fixed, and we forgot to remove the "xfail" marker. The test now passes, so let's remove the xfail marker. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12039	2022-11-20 19:54:59 +02:00
Avi Kivity	40f61db120	Merge 'docs: describe the Raft upgrade and recovery procedures' from Kamil Braun Add new guide for upgrading 5.1 to 5.2. In this new upgrade doc, include additional steps for enabling Raft using the `consistent_cluster_management` flag. Note that we don't have this flag yet but it's planned to replace the experimental flag in 5.2. In the "Raft in ScyllaDB" document, add sections about: - enabling Raft in existing clusters in Scylla 5.2, - verifying that the internal Raft upgrade procedure finishes successfully, - recovering from a stuck Raft upgrade procedure or from a majority loss situation. Fix some problems in the documentation, e.g. it is not possible to enable Raft in an existing cluster in 5.0, but the documentation claimed that it is. Follow-up items: - if we decide for a different name for `consistent_cluster_management`, use that name in the docs instead - update the warnings in Scylla to link to the Raft doc - mention Enterprise versions once we know the numbers - update the appropriate upgrade docs for Enterprise versions once they exist Closes #11910 * github.com:scylladb/scylladb: docs: describe the Raft upgrade and recovery procedures docs: add upgrade guide 5.1 -> 5.2	2022-11-20 19:00:23 +02:00
Avi Kivity	15ee8cfc05	Merge 'reader_concurrency_semaphore: fix waiter/inactive race' from Botond Dénes We recently (in `7fbad8de87`) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate. This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do. It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction. Fixes: #11923 Refs: #11770 Closes #12026 * github.com:scylladb/scylladb: reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach	2022-11-20 18:51:34 +02:00
Avi Kivity	895d721d5e	Merge 'scylla-sstable: data-dump improvements' from Botond Dénes This series contains a mixed bag of improvements to `scylla sstable dump-data`. These improvements are mostly aimed at making the json output clearer, getting rid of any ambiguities. Closes #12030 * github.com:scylladb/scylladb: tools/scylla-sstable: traverse sstables in argument order tools/scylla-sstable: dump-data docs: s/clustering_fragments/clustering_elements tools/scylla-sstable: dump-data/json: use Null instead of "<unknown>" tools/scylla-sstable: dump-data/json: use more uniform format for collections tools/scylla-sstable: dump-data/json: make cells easier to parse	2022-11-20 17:02:27 +02:00
Avi Kivity	2f9c53fbe4	Merge 'test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address' from Kamil Braun Since recently the framework uses a separate set of unique IDs to identify servers, but the log file and workdir is still named using the last part of the IP address. This is confusing: the test logs sometimes don't provide the IP addr (only the ID), and even if they do, the reader of the test log may not know that they need to look at the last part of the IP to find the node's log/workdir. Also using ID will be necessary if we want to reuse IP addresses (e.g. during node replace, or simply not to run out of IP addresses during testing). So use the ID instead to name the workdir and log file. Also, when starting a test case, print the used cluster. This will make it easier to map server IDs to their IP addresses when browsing through the test logs. Closes #12018 * github.com:scylladb/scylladb: test/pylib: manager_client: print used cluster when starting test case test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address	2022-11-20 16:56:19 +02:00
Avi Kivity	14218d82d6	Update tools/java submodule (serverless) * tools/java caf754f243...874e2d529b (2): > Add Scylla Cloud serverless support > Switch cqlsh to use scylla-driver	2022-11-20 16:41:36 +02:00
Tomasz Grabiec	c8e983b4aa	test: flat_mutation_reader_assertions: Use fatal BOOST_REQUIRE_EQUAL instead of BOOST_CHECK_EQUAL BOOST_CHECK_EQUAL is a weaker form of assertion, it reports an error and will cause the test case to fail but continues. This makes the test harder to debug because there's no obvious way to catch the failure in GDB and the test output is also flooded with things which happen after the failed assertion. Message-Id: <20221119171855.2240225-1-tgrabiec@scylladb.com>	2022-11-20 16:14:26 +02:00
Nadav Har'El	2d2034ea28	Merge 'cql3: don't ignore other restrictions when a multi column restriction is present during filtering' from Jan Ciołek When filtering with multi column restriction present all other restrictions were ignored. So a query like: `SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;` would ignore the restriction `regular_col = 0`. This was caused by a bug in the filtering code: `2779a171fc/cql3/selection/selection.cc (L433-L449)` When multi column restrictions were detected, the code checked if they are satisfied and returned immediately. This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied. This code was introduced back in 2019, when fixing #3574. Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct. Fixes: #6200 Fixes: #12014 Closes #12031 * github.com:scylladb/scylladb: cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works boost/restrictions-test: uncomment part of the test that passes now cql-pytest: enable test for filtering combined multi column and regular column restrictions cql3: don't ignore other restrictions when a multi column restriction is present during filtering	2022-11-20 11:50:38 +02:00
Benny Halevy	ec5707a4a8	api: storage_service: fixup indentation	2022-11-20 09:14:45 +02:00
Benny Halevy	cc63719782	api: storage_service: add run_on_existing_tables Gracefully skip tables that were removed in the background. Fixes #12007 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-20 09:14:29 +02:00
Benny Halevy	9ef9b9d1d9	api: storage_service: add parse_table_infos The table UUIDs are the same on all shards so we might as well get them on shard 0 (as we already do) and reuse them on other shards. It is more efficient and accurate to lookup the table eventually on the shard using its uuid rather than its name. If the table was dropped and recreated using the same name in the background, the new table will have a new uuid and do the api function does not apply to it anymore. A following change will handle the no_such_column_family cases. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-20 09:14:21 +02:00
Benny Halevy	9b4a9b2772	api: storage_service: log errors from compaction related handlers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-20 09:03:25 +02:00
Benny Halevy	a47f96bc05	api: storage_service: coroutinize compaction related handlers Before we improve parsing tables lists and handling of no_such_column_family errors. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-20 09:03:25 +02:00
Jan Ciolek	286f182a8c	cql-pytest: add a reproducer for #12014 , verify that filtering multi column and regular restrictions works In issue #12014 a user has encountered an instance of #6200. When filtering a WHERE clause which contained both multi-column and regular restrictions, the regular restrictions were ignored. Add a test which reproduces the issue using a reproducer provided by the user. This problem is tested in another similar test, but this one reproduces the issue in the exact way it was found by the user. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-18 15:27:42 +01:00
Jan Ciolek	63fb2612c3	boost/restrictions-test: uncomment part of the test that passes now A part of the test was commented out due to #6200. Now #6200 has been fixed and it can be uncommented. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-18 15:14:32 +01:00
Jan Ciolek	99e1032e34	cql-pytest: enable test for filtering combined multi column and regular column restrictions The test test_multi_column_restrictions_and_filtering was marked as xfail, because issue #6200 wasn't fixed. Now that filtering multi column and other restrictions together has been fixed the test passes. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-18 15:14:32 +01:00
Jan Ciolek	b974d4adfb	cql3: don't ignore other restrictions when a multi column restriction is present during filtering When filtering with multi column restriction present all other restrictions were ignored. So a query like: `SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;` would ignore the restriction `regular_col = 0`. This was caused by a bug in the filtering code: `2779a171fc/cql3/selection/selection.cc (L433-L449)` When multi column restrictions were detected, the code checked if they are satisfied and returned immediately. This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied. This code was introduced back in 2019, when fixing #3574. Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct. Fixes: #6200 Fixes: #12014 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-18 15:14:16 +01:00
Botond Dénes	30597f17ed	tools/scylla-sstable: traverse sstables in argument order In the order the user passed them on the command-line.	2022-11-18 15:58:37 +02:00
Botond Dénes	e337b25aa9	tools/scylla-sstable: dump-data docs: s/clustering_fragments/clustering_elements The usage of clustering_fragments is a typo, the output contains clustering_elements.	2022-11-18 15:58:36 +02:00
Botond Dénes	c39408b394	tools/scylla-sstable: dump-data/json: use Null instead of "<unknown>" The currently used "<unknown>" marker for invalid values/types is undistinguishable from a normal value in some cases. Use the much more distinct and unique json Null instead.	2022-11-18 15:58:36 +02:00
Botond Dénes	1dfceb5716	tools/scylla-sstable: dump-data/json: use more uniform format for collections Instead of trying to be clever and switching the output on the type of collection, use the same format always: a list of objects, where the object has a key and value attribute, containing to the respective collection item key and values. This makes processing much easier for machines (and humans too since the previous system wasn't working well).	2022-11-18 15:58:36 +02:00
Botond Dénes	f89acc8df7	tools/scylla-sstable: dump-data/json: make cells easier to parse There are several slightly different cell types in scylla: regular cells, collection cells (frozen and non-frozen) and counter cells (update and shards). In C++ code the type of the cell is always available for code wishing to make out exactly what kind of cell a cell is. In the JSON output of the dump-data this is currently really hard to do as there is not enough information to disambiguate all the different cell types. We wish to make the JSON output self-sufficient so in this patch we introduce a "type" field which contains one of: * regular * counter-update * counter-shards * frozen-collection * collection Furthermore, we bring the different types closer by also printing the counter shards under the 'value' key, not under the 'shards' key as before. The separate 'shards' is no longer needed to disambiguate. The documentation and the write operation is also updated to reflect the changes.	2022-11-18 15:58:36 +02:00
Petr Gusev	41629e97de	test.py: handle --markers parameter Some tests may take longer than a few seconds to run. We want to mark such tests in some way, so that we can run them selectively. This patch proposes to use pytest markers for this. The markers from the test.py command line are passed to pytest as is via the -m parameter. By default, the marker filter is not applied and all tests will be run without exception. To exclude e.g. slow tests you can write --markers 'not slow'. The --markers parameter is currently only supported by Python tests, other tests ignore it. We intend to support this parameter for other types of tests in the future. Another possible improvement is not to run suites for which all tests have been filtered out by markers. The markers are currently handled by pytest, which means that the logic in test.py (e.g., running a scylla test cluster) will be run for such suites. Closes #11713	2022-11-18 12:36:20 +01:00
Avi Kivity	7da12c64bc	Revert "Revert "Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity"" This reverts commit `22f13e7ca3`, and reinstates commit `df8e1da8b2` ("Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity"). The original commit was reverted due to failures in debug mode on aarch64, but after commit `224a2877b9` ("build: disable -Og in debug mode to avoid coroutine asan breakage"), it works again. Closes #12021	2022-11-18 12:44:00 +02:00
Kamil Braun	d7649a86c4	Merge 'Build up to support of dynamic IP address changes in Raft' from Konstantin Osipov We plan to stop storing IP addresses in Raft configuration, and instead use the information disseminated through gossip to locate Raft peers. Implement patches that are building up to that: * improve Raft API of configuration change notifications * disseminate raft host id in Gossip * avoid using Raft addresses from Raft configuraiton, and instead consistently use the translation layer between raft server id <-> IP address Closes #11953 * github.com:scylladb/scylladb: raft: persist the initial raft address map raft: (upgrade) do not use IP addresses from Raft config raft: (and gossip) begin gossiping raft server ids raft: change the API of conf change notifications	2022-11-18 11:38:19 +01:00
Botond Dénes	437fcdeeda	Merge 'Make use of enum_set in directory lister' from Pavel Emelyanov The lister accepts sort of a filter -- what kind of entries to list, regular, directories or both. It currently uses unordered_set, but enum_set is shorter and better describes the intent. Closes #12017 * github.com:scylladb/scylladb: lister: Make lister::dir_entry_types an enum_set database: Avoid useless local variable	2022-11-18 12:15:26 +02:00
Botond Dénes	b39ca29b3c	reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly The semaphore should admit readers as soon as it can. So at any point in time there should be either no waiters, or the semaphore shouldn't be able to admit new reads. Otherwise something went wrong. Detect this when queuing up reads and dump the diagnostics if detected. Even though tests should ensure this should never happen, recently we've seen a race between eviction and enqueuing producing such situations. This is very hard to write tests for, so add built-in detection and protection instead. Detecting this is very cheap anyway.	2022-11-18 11:35:47 +02:00
Botond Dénes	ca7014ddb8	reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot Said method has a protection against concurrent (recursive more like) calls to itself, by setting a flag `_evicting` and returning early if this flag is set. The evicting loop however has at least one preemption point between deciding there is nothing more to evict and resetting said flag. This window provides opporunity for new inactive reads or waiters to be queued without this loop noticing, while denying any other concurrent invocations at that time from reacting too. Eliminate this by using repeat() instead of do_until() and setting `_evicting = false` the moment the loop's run condition becomes false.	2022-11-18 11:35:47 +02:00
Botond Dénes	892f52c683	reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach Currently this method detaches the inactive read from the handle and notifies the permit, calls the notify handler if any and does some stat bookkeeping. Extend it to do a complete detach: unlink the entry from the inactive reads list and also cancel the ttl timer. After this, all that is left to the caller is to destroy the entry. This will prevent any recursive eviction from causing assertion failure. Although recursive eviction shouldn't happen, it shouldn't trigger an assert.	2022-11-18 11:35:43 +02:00
Pavel Emelyanov	a44ca06906	Merge 'token_metadata: Do not use topology info for is_member check' from Asias He Since commit `a980f94` (token_metadata: impl: keep the set of normal token owners as a member), we have a set, _normal_token_owners, which contains all the nodes in the ring. We can use _normal_token_owners to check if a node is part of the ring directly instead of going through the _toplogy indirectly. Fixes #11935 Closes #11936 * github.com:scylladb/scylladb: token_metadata: Rename is_member to is_normal_token_owner token_metadata: Add docs for is_member token_metadata: Do not use topology info for is_member check token_metadata: Check node is part of the topology instead of the ring	2022-11-18 11:54:07 +03:00
Asias He	4571fcf9e7	token_metadata: Rename is_member to is_normal_token_owner The name is_normal_token_owner is more clear than is_member. The is_normal_token_owner reflects what it really checks.	2022-11-18 09:29:20 +08:00
Asias He	965097cde5	token_metadata: Add docs for is_member Make it clear, is_member checks if a node is part of the token ring and checks nothing else.	2022-11-18 09:28:56 +08:00
Asias He	a495b71858	token_metadata: Do not use topology info for is_member check Since commit `a980f94` (token_metadata: impl: keep the set of normal token owners as a member), we have a set, _normal_token_owners, which contains all the nodes in the ring. We can use _normal_token_owners to check if a node is part of the ring directly instead of going through the _toplogy indirectly. Fixes #11935	2022-11-18 09:28:56 +08:00
Asias He	f2ca790883	token_metadata: Check node is part of the topology instead of the ring update_normal_tokens is the way to add a new node into the ring. We should not require a new node to already be in the ring to be able to add it to the ring. The current code works accidentally because is_member is checking if a node is in the topology We should use _topology.has_endpoint to check if a node is part of the topology explicitly.	2022-11-18 09:28:56 +08:00
Jan Ciolek	77d68153f1	test preparing expr::usertype_constructor Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:41:10 +01:00
Jan Ciolek	eb92fb4289	expr_test: test that prepare_expression checks style_type of collection_constructor Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:41:10 +01:00
Jan Ciolek	77c63a6b92	expr_test: test preparing expr::collection_constructor for map Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:41:09 +01:00
Jan Ciolek	db67ade778	prepare_expr: make preparing nonfrozen empty maps return null In Scylla and Cassandra inserting an empty collection that is not frozen, is interpreted as inserting a null value. list_prepare_expression and set_prepare_expression have an if which handles this behavior, but there wasn't one in map_prepare_expression. As a result preparing empty list or set would result in null, but preparing an empty map wouldn't. This is inconsistent, it's better to return null in all cases of empty nonfrozen collections. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:41:09 +01:00
Jan Ciolek	da71f9b50b	prepare_expr: fix a bug in map_prepare_expression map_prepare_expression takes a collection_constructor of unprepared items and prepares it. Elements of a map collection_constructor are tuples (key and value). map_prepare_expression creates a prepared collection_constructor by preparing each tuple and adding it to the result. During this preparation it needs to set the type of the tuple. There was a bug here - it took the type from unprepared tuple_constructor and assigned it to the prepared one. An unprepared tuple_constructor doesn't have a type so it ended up assigning nullptr. Instead of that it should create a tuple_type_impl instance by looking at the types of map key and values, and use this tuple_type_impl as the type of the prepared tuples. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:35:04 +01:00
Jan Ciolek	a656fdfe9a	expr_test: test preparing expr::collection_constructor for set Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:37 +01:00
Jan Ciolek	76f587cfe7	expr_test: test preparing expr::collection_constructor for list Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:37 +01:00
Jan Ciolek	44b55e6caf	expr_test: test preparing expr::tuple_constructor Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:37 +01:00
Jan Ciolek	265100a638	expr_test: test preparing expr::untyped_constant Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:37 +01:00
Jan Ciolek	f6b9100cd2	expr_test_utils: add make_bigint_raw/const Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:37 +01:00
Jan Ciolek	f9ff131f86	expr_test_utils: add make_tinyint_raw/const Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:36 +01:00
Jan Ciolek	76b6161386	expr_test: test preparing expr::bind_variable Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:36 +01:00
Jan Ciolek	4882724066	cql3: prepare_expr: forbid preparing bind_variable without a receiver prepare_expression treats receiver as an optional argument, it can be set to nullptr and the preparation should still succeed when it's possible to infer the type of an expression. preparing a bind_variable requires the receiver to be present, because it doesn't contain any information about the type of the bound value. Added a check that the receiver is present. Allowing to prepare a bind_variable without the receiver present was a bug. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 20:22:36 +01:00
Avi Kivity	2779a171fc	Merge 'Do not run aborted tasks' from Aleksandra Martyniuk task_manager::task::impl contains an abort source which can be used to check whether it is aborted and an abort method which aborts the task (request_abort on abort_source) and all its descendants recursively. When the start method is called after the task was aborted, then its state is set to failed and the task does not run. Fixes: #11995 Closes #11996 * github.com:scylladb/scylladb: tasks: do not run tasks that are aborted tasks: delete unused variable tasks: add abort_source to task_manager::task::impl	2022-11-17 19:42:46 +02:00
Pavel Emelyanov	a396c27efc	Merge 'message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client' from Kamil Braun `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when this client was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes #11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: #11780 Closes #11942 * github.com:scylladb/scylladb: message: messaging_service: check for known topology before calling is_same_dc/rack test: reenable test_topology::test_decommission_node_add_column test/pylib: util: configurable period in wait_for message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client message: messaging_service: topology independent connection settings for GOSSIP verbs	2022-11-17 20:14:32 +03:00
Jan Ciolek	42e01cc67f	expr_test: test preparing expr::null Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:05 +01:00
Jan Ciolek	45b3fca71c	expr_test: test preparing expr::cast Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:05 +01:00
Jan Ciolek	498c9bfa0d	expr_test_utils: add make_receiver Add a convenience function which creates receivers. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:04 +01:00
Jan Ciolek	6873a21fbd	expr_test_utils: add make_smallint_raw/const Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:04 +01:00
Jan Ciolek	488056acb7	expr_test: test preparing expr::token Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:04 +01:00
Jan Ciolek	7958f77a40	expr_test: test preparing expr::subscript Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:04 +01:00
Jan Ciolek	569bd61c6c	expr_test: test preparing expr::column_value Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:04 +01:00
Jan Ciolek	26174e29c6	expr_test: test preparing expr::unresolved_identifier It's interesting that prepare_expression for column identifiers doesn't require a receiver. I hope this won't break validation in the future. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:04 +01:00
Jan Ciolek	c719a923bb	expr_test_utils: mock data_dictionary::database Add a function which creates a mock instance of data_dictionary::database. prepare_expression requires a data_dictionary::database as an argument, so unit tests for it need something to pass there. make_data_dictionary_database can be used to create an instance that is sufficient for tests. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-11-17 17:30:00 +01:00
Kamil Braun	8e8c32befe	test/pylib: manager_client: print used cluster when starting test case It will be easier to map server IDs to their IP addresses when browsing through the test logs.	2022-11-17 17:14:23 +01:00
Pavel Emelyanov	bc62ca46d4	lister: Make lister::dir_entry_types an enum_set This type is currently an unordered_set, but only consists of at most two elements. Making it an enum_set renders it into a size_t variable and better describes the intention. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-17 19:01:45 +03:00
Pavel Emelyanov	c6021b57a1	database: Avoid useless local variable It's used to run lister::scan_dir() with directory_entry_type::directory only, but for that is copied around on lambda captures. It's simpler just to use the value directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-17 19:00:49 +03:00
Kamil Braun	b83234d8aa	test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address Since recently the framework uses a separate set of unique IDs to identify servers, but the log file and workdir is still named using the last part of the IP address. This is confusing: the test logs sometimes don't provide the IP addr (only the ID), and even if they do, the reader of the test log may not know that they need to look at the last part of the IP to find the node's log/workdir. Also using ID will be necessary if we want to reuse IP addresses (e.g. during node replace, or simply not to run out of IP addresses during testing).	2022-11-17 16:55:12 +01:00
Anna Stuchlik	f7f03e38ee	doc: update the link to Enabling Experimental Features	2022-11-17 15:44:46 +01:00
Anna Stuchlik	02cea98f55	doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph	2022-11-17 15:05:00 +01:00
Anna Stuchlik	ce88c61785	doc: update the links to the Enabling Experimental Features section	2022-11-17 14:59:34 +01:00
Avi Kivity	76be6402ed	Merge 'repair: harden effective replication map' from Benny Halevy As described in #11993 per-shard repair_info instances get the effective_replication_map on their own with no centralized synchronization. This series ensures that the effective replication maps used by repair (and other associated structures like the token metadata and topology) are all in sync with the one used to initiate the repair operation. While at at, the series includes other cleanups in this area in repair and view that are not fixes as the calls happen in synchronous functions that do not yield. Fixes #11993 Closes #11994 * github.com:scylladb/scylladb: repair: pass erm down to get_hosts_participating_in_repair and get_neighbors repair: pass effective_replication_map down to repair_info repair: coroutinize sync_data_using_repair repair: futurize do_repair_start effective_replication_map: add global_effective_replication_map shared_token_metadata: get_lock is const repair: sync_data_using_repair: require to run on shard 0 repair: require all node operations to be called on shard 0 repair: repair_info: keep effective_replication_map repair: do_repair_start: use keyspace erm to get keyspace local ranges repair: do_repair_start: use keyspace erm for get_primary_ranges repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc repair: do_repair_start: check_in_shutdown first repair: get_db().local() where needed repair: get topology from erm/token_metdata_ptr view: get_view_natural_endpoint: get topology from erm	2022-11-17 13:29:02 +02:00
Konstantin Osipov	262566216b	raft: persist the initial raft address map	2022-11-17 14:26:36 +03:00
Konstantin Osipov	b35af73fdf	raft: (upgrade) do not use IP addresses from Raft config Always use raft address map to obtain the IP addresses of upgrade peers. Right now the map is populated from Raft configuration, so it's an equivalent transformation, but in the future raft address map will be populated from other sources: discovery and gossip, hence the logic of upgrade will change as well. Do not proceed with the upgrade if an address is missing from the map, since it means we failed to contact a raft member.	2022-11-17 14:26:31 +03:00
Pavel Emelyanov	2add9ba292	Merge 'Refactor topology out of token_metadata' from Benny Halevy This series moves the topology code from locator/token_metadata.{cc,hh} out to localtor/topology.{cc,hh} and introduces a shared header file: locator/types.hh contains shared, low level definitions, in anticipation of https://github.com/scylladb/scylladb/pull/11987 While at it, the token_metadata functions are turned into coroutines and topology copy constructor is deleted. The copy functionality is moved into an async `clone_gently` function that allows yielding while copying the topology. Closes #12001 * github.com:scylladb/scylladb: locator: refactor topology out of token_metadata locator: add types.hh topology: delete copy constructor token_metadata: coroutinize clone functions	2022-11-17 13:55:34 +03:00
Aleksandra Martyniuk	7ead1a7857	compaction: request abort only once in compaction_data::stop compaction_manager::task (and thus compaction_data) can be stopped because of many different reasons. Thus, abort can be requested more than once on compaction_data abort source causing a crash. To prevent this before each request_abort() we check whether an abort was requested before. Closes #12004	2022-11-17 12:44:59 +02:00
Benny Halevy	1e2741d2fe	abstract_replication_strategy: recognized_options: return unordered_set An unordered_set is more efficient and there is no need to return an ordered set for this purpose. This change facilitates a follow-up change of adding topology::get_datacenters(), returning an unordered_set of datacenter names. Refs #11987 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12003	2022-11-17 11:27:05 +02:00
Botond Dénes	e925c41f02	utils/gs/barrett.hh: aarch64: s/brarett/barrett/ Fix a typo introduced by the the recent patch fixing the spelling of Barrett. The patch introduced a typo in the aarch64 version of the code, which wasn't found by promotion, as that only builds on X86_64. Closes #12006	2022-11-17 11:09:59 +02:00
Konstantin Osipov	051dceeaff	raft: (and gossip) begin gossiping raft server ids We plan to use gossip data to educate Raft RPC about IP addresses of raft peers. Add raft server ids to application state, so that when we get a notification about a gossip peer we can identify which raft server id this notification is for, specifically, we can find what IP address stands for this server id, and, whenever the IP address changes, we can update Raft address map with the new address. On the same token, at boot time, we now have to start Gossip before Raft, since Raft won't be able to send any messages without gossip data about IP addresses.	2022-11-17 12:07:31 +03:00
Konstantin Osipov	990c7a209f	raft: change the API of conf change notifications Pass a change diff into the notification callback, rather than add or remove servers one by one, so that if we need to persist the state, we can do it once per configuration change, not for every added or removed server. For now still pass added and removed entries in two separate calls per a single configuration change. This is done mainly to fulfill the library contract that it never sends messages to servers outside the current configuration. The group0 RPC implementation doesn't need the two calls, since it simply marks the removed servers as expired: they are not removed immediately anyway, and messages can still be delivered to them. However, there may be test/mock implementations of RPC which could benefit from this contract, so we decided to keep it.	2022-11-17 12:07:31 +03:00
Benny Halevy	53fdf75cf9	repair: pass erm down to get_hosts_participating_in_repair and get_neighbors Now that it is available in repair_info. Fixes #11993 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 08:07:30 +02:00
Benny Halevy	b69be61f41	repair: pass effective_replication_map down to repair_info And make sure the token_metadata ring version is same as the reference one (from the erm on shard 0), when starting the repair on each shard. Refs #11993 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 08:07:29 +02:00
Benny Halevy	c47d36b53d	repair: coroutinize sync_data_using_repair Prepare for the next path that will co_await make_global_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 08:07:04 +02:00
Benny Halevy	58b1c17f5d	repair: futurize do_repair_start Turn it into a coroutine to prepare for the next path that will co_await make_global_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 08:07:04 +02:00
Benny Halevy	4b9269b7e2	effective_replication_map: add global_effective_replication_map Class to hold a coherent view of a keyspace effective replication map on all shards. To be used in a following patch to pass the sharded keyspace e_r_m:s to repair. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 08:07:01 +02:00
Avi Kivity	b8b78959fb	build: switch to packaged libdeflate rather than a submodule Now that our toolchain is based on Fedora 37, we can rely on its libdeflate rather than have to carry our own in a submodule. Frozen toolchain is regenerated. As a side effect clang is updated from 15.0.0 to 15.0.4. Closes #12000	2022-11-17 08:01:00 +02:00
Benny Halevy	2c677e294b	shared_token_metadata: get_lock is const The lock is acquired using an a function that doesn't modify the shared_token_metadata object. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:58:21 +02:00
Benny Halevy	d6b2124903	repair: sync_data_using_repair: require to run on shard 0 And with that do_sync_data_using_repair can be folded into sync_data_using_repair. This will simplify using the effective_replication_map throughout the operation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:58:21 +02:00
Benny Halevy	0c56c75cf8	repair: require all node operations to be called on shard 0 To simplify using of the effective_replication_map / token_metadata_ptr throught the operation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:58:21 +02:00
Benny Halevy	64b0756adc	repair: repair_info: keep effective_replication_map Sampled when repair info is constructed. To be used throughout the repair process. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:58:21 +02:00
Benny Halevy	c7d753cd44	repair: do_repair_start: use keyspace erm to get keyspace local ranges Rather than calling db.get_keyspace_local_ranges that looks up the keyspace and its erm again. We want all the inforamtion derived from the erm to be based on the same source. The function is synchronous so this changes doesn't fix anything, just cleans up the code. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:58:21 +02:00
Benny Halevy	aaf74776c2	repair: do_repair_start: use keyspace erm for get_primary_ranges Ensure that the primary ranges are in sync with the keyspace erm. The function is synchronous so this change doesn't fix anything, it just cleans up the code. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:58:21 +02:00
Benny Halevy	9200e6b005	repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc Ensure the erm and topology are in sync. The function is synchronous so this change doesn't fix anything, just cleans up the code. Fix mistake in comment while at it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:57:56 +02:00
Benny Halevy	59dc2567fd	repair: do_repair_start: check_in_shutdown first Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:56:34 +02:00
Benny Halevy	881eb0df83	repair: get_db().local() where needed In several places we get the sharded database using get_db() and then we only use db.local(). Simplify the code by keeping reference only to the local database upfront. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:56:34 +02:00
Benny Halevy	c22c4c8527	repair: get topology from erm/token_metdata_ptr We want the topology to be synchronized with the respective effective_replication_map / token_metadata. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:56:34 +02:00
Benny Halevy	94f2e95a2f	view: get_view_natural_endpoint: get topology from erm Get the topology for the effective replication map rather than from the storage_proxy to ensure its synchronized with the natural endpoints. Since there's no preemption between the two calls currently there is no issue, so this is merely a clean up of the code and not supposed to fix anything. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-17 07:56:34 +02:00
Nadav Har'El	e393639114	test/cql-pytest: reproducer for crash in LWT with null key This patch adds a reproducer for issue #11954: Attempting an "IF NOT EXISTS" (LWT) write with a null key crashes Scylla, instead of producing a simple error message (like happens without the "IF NOT EXISTS" after #7852 was fixed). The test passed on Cassandra, but crashes Scylla. Because of this crash, we can't just mark the test "xfail" and it's temporarily marked "skip" instead. Refs #11954. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11982	2022-11-17 07:31:13 +02:00
Benny Halevy	d0bd305d16	locator: refactor topology out of token_metadata Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-16 21:55:54 +02:00
Benny Halevy	297a4de4e4	locator: add types.hh To export low-level types that are used by oher modules for the locator interfaces. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-16 21:53:05 +02:00
Kamil Braun	0c9cb5c5bf	Merge 'raft: wait for the next tick before retrying' from Gusev Petr When `modify_config` or `add_entry` is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of `modify_config`, other `modify_config` is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type `transient_error`. When the client receives it, it is obliged to retry the request after some delay. Previously leader-side exceptions were converted to `not_a_leader`, which is strange, especially for `conf_change_in_progress`. Fixes: #11564 Closes #11769 * github.com:scylladb/scylladb: raft: rafactor: remove duplicate code on retries delays raft: use wait_for_next_tick in read_barrier raft: wait for the next tick before retrying	2022-11-16 18:20:54 +01:00
Aleksandra Martyniuk	4250bd9458	tasks: do not run tasks that are aborted Currently in start() method a task is run even if it was already aborted. When start() is called on an aborted task, its state is set to task_manager::task_state::failed and it doesn't run.	2022-11-16 18:09:41 +01:00
Aleksandra Martyniuk	ebffca7ea5	tasks: delete unused variable	2022-11-16 18:07:57 +01:00
Aleksandra Martyniuk	752edc2205	tasks: add abort_source to task_manager::task::impl task_manager::task can be aborted with impl's abort_source. By default abort request is propagated to all task's descendants.	2022-11-16 18:07:11 +01:00
Avi Kivity	c4f069c6fc	Update seastar submodule * seastar 153223a188...4f4cc00660 (10): > Merge 'Avoid using namespace internal' from Pavel Emelyanov > Merge 'De-futurize IO class update calls' from Pavel Emelyanov > abort_source: subscribe(): remove noexcept qualifier > Merge 'Add Prometheus filtering capabilities by label' from Amnon Heiman > fsqual: stop causing memory leak error on LeakSanitizer > metrics.cc: Do not merge empty histogram > Update tutorial.md > README-DPDK.md: document --cflags option > build: install liburing.pc using stow > core/polymorphic_temporary_buffer: include <seastar/core/memory.hh> Closes #11991	2022-11-16 17:59:33 +02:00
Avi Kivity	3497891cf9	utils: spell "barrett" correctly As P. T. Barnoom famously said, "write what you like but spell my name correctly". Following that, we correct the spelling of Barrett's name in the source tree. Closes #11989	2022-11-16 16:30:38 +02:00
Benny Halevy	0c94ffcc85	topology: delete copy constructor Topology is copied only from token_metadata_impl::clone_only_token_map which copies the token_metadata_impl with yielding to prevent reactor stalls. This should apply to topology as well, so add a clone_gently function for cloning the topology from token_metadata_impl::clone_only_token_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-16 15:27:28 +02:00
Benny Halevy	4f4fc7fe22	token_metadata: coroutinize clone functions Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-16 15:27:28 +02:00
Kamil Braun	a83789160d	message: messaging_service: check for known topology before calling is_same_dc/rack `is_same_dc` and `is_same_rack` assume that the peer's topology is known. If it's unknown, `on_internal_error` will be called inside topology. When these functions are used in `get_rpc_client`, they are already protected by an earlier check for knowing the peer's topology (the `has_topology()` lambda). Another use is in `do_start_listen()`, where we create a filter for RPC module to check if it should accept incoming connections. If cross-dc or cross-rack encryption is enabled, we will reject connections attempts to the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`. However, it might happen that something (other Scylla node or otherwise) tries to contact us on the regular port and we don't know that thing's topology, which would result in `on_internal_error`. But this is not a fatal error; we simply want to reject that connection. So protect these calls as well. Finally, there's `get_preferred_ip` with an unprotected `is_same_dc` call which, for a given peer, may return a different IP from preferred IP cache if the endpoint resides in the same DC. If there is not entry in the preferred IP cache, we return the original (external) IP of the peer. We can do the same if we don't know the peer's topology. It's interesting that we didn't see this particular place blowing up. Perhaps the preferred IP cache is always populated after we know the topology.	2022-11-16 14:01:50 +01:00
Kamil Braun	9b2449d3ea	test: reenable test_topology::test_decommission_node_add_column Also improve the test to increase the probability of reproducing #11780 by injecting sleeps in appropriate places. Without the fix for #11780 from the earlier commit, the test reproduces the issue in roughly half of all runs in dev build on my laptop.	2022-11-16 14:01:50 +01:00
Kamil Braun	0f49813312	test/pylib: util: configurable period in wait_for	2022-11-16 14:01:50 +01:00
Kamil Braun	1bd2471c19	message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when topology was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes #11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: #11780	2022-11-16 14:01:50 +01:00
Kamil Braun	840be34b5f	message: messaging_service: topology independent connection settings for GOSSIP verbs The gossip verbs are used to learn about topology of other nodes. If inter-dc/rack encryption is enabled, the knowledge of topology is necessary to decide whether it's safe to send unencrypted messages to nodes (i.e., whether the destination lies in the same dc/rack). The logic in `messaging_service::get_rpc_client`, which decided whether a connection must be encrypted, was this (given that encryption is enabled): if the topology of the peer is known, and the peer is in the same dc/rack, don't encrypt. Otherwise encrypt. However, it may happen that node A knows node B's topology, but B doesn't know A's topology. A deduces that B is in the same DC and rack and tries sending B an unencrypted message. As the code currently stands, this would cause B to call `on_internal_error`. This is what I encountered when attempting to fix #11780. To guarantee that it's always possible to deliver gossiper verbs (even if one or both sides don't know each other's topology), and to simplify reasoning about the system in general, choose connection settings that are independent of the topology - for the connection used by gossiper verbs (other connections are still topology-dependent and use complex logic to handle the situation of unknown-and-later-known topology). This connection only contains 'rare' and 'cheap' verbs, so it's not a performance problem to always encrypt it (given that encryption is configured). And this is what already was happening in the past; it was at some point removed during topology knowledge management refactors. We just bring this logic back. Fixes #11992. Inspired by xemul/scylla@45d48f3d02.	2022-11-16 13:58:07 +01:00
Anna Stuchlik	01c9846bb6	doc: add the link to the Enabling Experimental Features section	2022-11-16 13:24:45 +01:00
Anna Stuchlik	f1b2f44aad	doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section	2022-11-16 13:23:07 +01:00
Nadav Har'El	2f2f01b045	materialized views: fix view writes after base table schema change When we write to a materialized view, we need to know some information defined in the base table such as the columns in its schema. We have a "view_info" object that tracks each view and its base. This view_info object has a couple of mutable attributes which are used to lazily-calculate and cache the SELECT statement needed to read from the base table. If the base-table schema ever changes - and the code calls set_base_info() at that point - we need to forget this cached statement. If we don't (as before this patch), the SELECT will use the wrong schema and writes will no longer work. This patch also includes a reproducing test that failed before this patch, and passes afterwords. The test creates a base table with a view that has a non-trivial SELECT (it has a filter on one of the base-regular columns), makes a benign modification to the base table (just a silly addition of a comment), and then tries to write to the view - and before this patch it fails. Fixes #10026 Fixes #11542	2022-11-16 13:58:21 +02:00
Nadav Har'El	7cbb0b98bb	Merge 'doc: document user defined functions (UDFs)' from Anna Stuchlik This PR is V2 of the[ PR created by @psarna.](https://github.com/scylladb/scylladb/pull/11560). I have: - copied the content. - applied the suggestions left by @nyh. - made minor improvements, such as replacing "Scylla" with "ScyllaDB", fixing punctuation, and fixing the RST syntax. Fixes https://github.com/scylladb/scylladb/issues/11378 Closes #11984 * github.com:scylladb/scylladb: doc: label user-defined functions as Experimental doc: restore the note for the Count function (removed by mistatke) doc: document user defined functions (UDFs)	2022-11-16 13:09:47 +02:00
Botond Dénes	cbf9be9715	Merge 'Avoid 0.0.0.0 (and :0) as preferred IP' from Pavel Emelyanov Despite docs discourage from using INADDR_ANY as listen address, this is not disabled in code. Worse -- some snitch drivers may gossip it around as the INTERNAL_IP state. This set prevents this from happening and also adds a sanity check not to use this value if it somehow sneaks in. Closes #11846 * github.com:scylladb/scylladb: messaging_service: Deny putting INADD_ANY as preferred ip messaging_service: Toss preferred ip cache management gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP gossiping_property_file_snitch: Make _listen_address optional	2022-11-16 08:30:42 +02:00
Avi Kivity	43d3e91e56	tools: toolchain: prepare: use real bash associative array When we translate from docker/go arch names to the kernel arch names, we use an associative array hack using computed variable names "{$!variable_name}". But it turns out bash has real associative arrays, introduced with "declare -A". Use the to make the code a little clearer. Closes #11985	2022-11-16 08:17:47 +02:00
Botond Dénes	e90d0811d0	Merge 'doc: update ScyllaDB requirements - supported CPUs and AWS i4g instances' from Anna Stuchlik Fix https://github.com/scylladb/scylla-docs/issues/4144 Closes #11226 * github.com:scylladb/scylladb: Update docs/getting-started/system-requirements.rst doc: specify the recommended AWS instance types doc: replace the tables with a generic description of support for Im4gn and Is4gen instances doc: add support for AWS i4g instances doc: extend the list of supported CPUs	2022-11-16 08:15:00 +02:00
Botond Dénes	bd1fcbc38f	Merge 'Introduce reverse vector_deserializer.' from Michał Radwański As indicated in #11816, we'd like to enable deserializing vectors in reverse. The forward deserialization is achieved by reading from an input_stream. The input stream internally is a singly linked list with complicated logic. In order to allow for going through it in reverse, instead when creating the reverse vector initializer, we scan the stream and store substreams to all the places that are a starting point for a next element. The iterator itself just deserializes elements from the remembered substreams, this time in reverse. Fixes #11816 Closes #11956 * github.com:scylladb/scylladb: test/boost/serialization_test.cc: add test for reverse vector deserializer serializer_impl.hh: add reverse vector serializer serializer_impl: remove unneeded generic parameter	2022-11-16 07:37:24 +02:00
Anna Stuchlik	cdb6557f23	doc: label user-defined functions as Experimental	2022-11-15 21:22:01 +01:00
Avi Kivity	d85f731478	build: update toolchain to Fedora 37 with clang 15 'cargo' instantiation now overrides internal git client with cli client due to unbounded memory usage [1]. [1] https://github.com/rust-lang/cargo/issues/10583#issuecomment-1129997984	2022-11-15 16:48:09 +00:00
Anna Stuchlik	1f1d88d04e	doc: restore the note for the Count function (removed by mistatke)	2022-11-15 17:41:22 +01:00
Anna Stuchlik	dbb19f55fb	doc: document user defined functions (UDFs)	2022-11-15 17:33:05 +01:00
Nadav Har'El	e4dba6a830	test/cql-pytest: add test for when MV requires IS NOT NULL As noted in issue #11979, Scylla inconsistently (and unlike Cassandra) requires "IS NOT NULL" one some but not all materialized-view key columns. Specifically, Scylla does not require "IS NOT NULL" on the base's partition key, while Cassandra does. This patch is a test which demonstrates this inconsistency. It currently passes on Cassandra and fails on Scylla, so is marked xfail. Refs #11979 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11980	2022-11-15 14:21:48 +01:00
Asias He	16bd9ec8b1	gossip: Improve get_live_token_owners and get_unreachable_token_owners The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952	2022-11-15 14:21:48 +01:00
Botond Dénes	21489c9f9c	Merge 'doc: add the "Scylladb Enterprise" label to the Enterprise-only features' from Anna Stuchlik This PR is a follow-up to https://github.com/scylladb/scylladb/pull/11918. With this PR: - The "ScyllaDB Enterprise" label is added to all the features that are only available in ScyllaDB Enterprise. - The previous Enterprise-only note is removed (it was included in multiple files as _/rst_include/enterprise-only-note.rst_ - this file is removed as it is no longer used anywhere in the docs). - "Scylla Enterprise" was removed from `versionadded `because now it's clear that the feature was added for Enterprise. Closes #11975 * github.com:scylladb/scylladb: doc: remove the enterprise-only-note.rst file, which was replaced by the ScyllaDB Enterprise label and is not used anymore doc: add the ScyllaDB Enterprise label to the descriptions of Enterprise-only features	2022-11-15 14:21:48 +01:00
Botond Dénes	34f29c8d67	Merge 'Use with_sstable_directory() helper in tests' from Pavel Emelyanov The helper is already widely used, one (last) test case can benefit from using it too Closes #11978 * github.com:scylladb/scylladb: test: Indentation fix after previous patch test: Wse with_sstable_directory() helper	2022-11-15 14:21:48 +01:00
Nadav Har'El	8a4ab87e44	Merge 'utils: crc: generate crc barrett fold tables at compile time' from Avi Kivity We use Barrett tables (misspelled in the code unfortunately) to fold crc computations of multiple buffers into a single crc. This is important because it turns out to be faster to compute crc of three different buffers in parallel rather than compute the crc of one large buffer, since the crc instruction has latency 3. Currently, we have a separate code generation step to compute the fold tables. The step generates a new C++ source files with the tables. But modern C++ allows us to do this computation at compile time, avoiding the code generation step. This simplifies the build. This series does that. There is some complication in that the code uses compiler intrinsics for the computation, and these are not constexpr friendly. So we first introduce constexpr-friendly alternatives and use them. To prove the transformation is correct, I compared the generated code from before the series and from just before the last step (where we use constexpr evaluation but still retain the generated file) and saw no difference in the values. Note that constexpr is not strictly needed - we could have run the code in the global variables' initializer. But that would cause a crash if we run on a pre-clmul machine, and is not as fun. Closes #11957 * github.com:scylladb/scylladb: test: crc: add unit tests for constexpr clmul and barrett fold utils: crc combine table: generate at compile time utils: barrett: inline functions in header utils: crc combine table: generate tables at compile time utils: crc combine table: extract table generation into a constexpr function utils: crc combine table: extract "pow table" code into constexpr function utils: crc combine table: store tables std::arrray rather than C array utils: barrett: make the barrett reduction constexpr friendly utils: clmul: add 64-bit constexpr clmul utils: barrett: extract barrett reduction constants utils: barrett: reorder functions utils: make clmul() constexpr	2022-11-15 14:21:48 +01:00
Petr Gusev	ae3e0e3627	raft: rafactor: remove duplicate code on retries delays Introduce a templated function do_on_leader_with_retries, use it in add_entries/modify_config/read_barrier. The function implements the basic logic of retries with aborts and leader changes handling, adds a delay between iterations to protect against tight loops.	2022-11-15 13:18:53 +04:00
Petr Gusev	15cc1667d0	raft: use wait_for_next_tick in read_barrier Replaced the yield on transport_error with wait_for_next_tick. Added delays for retries, similar to add_entry/modify_config: we postpone the next call attempt if we haven't received new information about the current leader.	2022-11-15 12:31:49 +04:00
Petr Gusev	5e15c3c9bd	raft: wait for the next tick before retrying When modify_config or add_entry is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of modify_config, other modify_config is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type transient_error. When the client node receives it, it is obliged to retry the request, possibly after some delay. Previously, leader-side exceptions were converted to not_a_leader exception, which is strange, especially for conf_change_in_progress. We add a delay before retrying in modify_config and add_entry if the client hasn't received any new information about the leader since the last attempt. This can happen if the server responds with a transient_error with an empty leader and the current node has not yet learned the new leader. We neglect an excessive delay if the newly elected leader is the same as the previous one, this supposed to be a rare. Fixes: #11564	2022-11-15 11:49:26 +04:00
Pavel Emelyanov	8dcd9d98d6	test: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-14 20:11:01 +03:00
Pavel Emelyanov	c9128e9791	test: Wse with_sstable_directory() helper It's already used everywhere, but one test case wires up the sstable_directory by hand. Fix it too, but keep in mind, that the caller fn stops the directory early. (indentation is deliberately left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-14 20:11:01 +03:00
Michał Radwański	32c60b44c5	test/boost/serialization_test.cc: add test for reverse vector deserializer This test is just a copy-pasted version of forward serializer test.	2022-11-14 16:06:24 +01:00
Michał Radwański	dce67f42f8	serializer_impl.hh: add reverse vector serializer Currently when we want to deserialize mutation in reverse, we unfreeze it and consume from the end. This new reverse vector deserializer goes through input stream remembering substreams that contain a given output range member, and while traversing from the back, deserialize each substream.	2022-11-14 16:06:24 +01:00
Anna Stuchlik	e36bd208cc	doc: remove the enterprise-only-note.rst file, which was replaced by the ScyllaDB Enterprise label and is not used anymore	2022-11-14 15:20:51 +01:00
Anna Stuchlik	36324fe748	doc: add the ScyllaDB Enterprise label to the descriptions of Enterprise-only features	2022-11-14 15:16:51 +01:00
Takuya ASADA	da6c472db9	install.sh: Skip systemd existance check when --without-systemd When --without-systemd specified, install.sh should skip systemd existance check. Fixes #11898 Closes #11934	2022-11-14 14:07:46 +02:00
Benny Halevy	ff5527deb1	topology: copy _sort_by_proximity in copy constructor Fixes #11962 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11965	2022-11-14 13:59:56 +03:00
Pavel Emelyanov	bd48fdaad5	Merge 'handle_state_normal: do not update topology of removed endpoint' from Benny Halevy Currently, when replacing a node ip, keeping the old host, we might end up with the the old endpoint in system.peers if it is inserted back into the topology by `handle_state_normal` when on_join is called with the old endpoint. Then, later on, on_change sees that: ``` if (get_token_metadata().is_member(endpoint)) { co_await do_update_system_peers_table(endpoint, state, value); ``` As described in #11925. Fixes #11925 Closes #11930 * github.com:scylladb/scylladb: storage_service, system_keyspace: add debugging around system.peers update storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed	2022-11-14 13:58:28 +03:00
Botond Dénes	8e38551d93	Merge 'Allow each compaction group to have its own compaction backlog tracker' from Raphael "Raph" Carvalho Today, compaction_backlog_tracker is managed in each compaction_strategy implementation. So every compaction strategy is managing its own tracker and providing a reference to it through get_backlog_tracker(). But this prevents each group from having its own tracker, because there's only a single compaction_strategy instance per table. To remove this limitation, compaction_strategy impl will no longer manage trackers but will instead provide an interface for trackers to be created, such that each compaction_group will be allowed to create its own tracker and manage it by itself. Now table's backlog will be the sum of all compaction_group backlogs. The normalization factor is applied on the sum, so we don't have to adjust each individual backlog to any factor. Closes #11762 * github.com:scylladb/scylladb: replica: Allow one compaction_backlog_tracker for each compaction_group compaction: Make compaction_state available for compaction tasks being stopped compaction: Implement move assignment for compaction_backlog_tracker compaction: Fix compaction_backlog_tracker move ctor compaction: Use table_state's backlog tracker in compaction_read_monitor_generator compaction: kill undefined get_unimplemented_backlog_tracker() replica: Refactor table::set_compaction_strategy for multiple groups Fix exception safety when transferring ongoing charges to new backlog tracker replica: move_sstables_from_staging: Use tracker from group owning the SSTable replica: Move table::backlog_tracker_adjust_charges() to compaction_group replica: table::discard_sstables: Use compaction_group's backlog tracker replica: Disable backlog tracker in compaction_group::stop() replica: database_sstable_write_monitor: use compaction_group's backlog tracker replica: Move table::do_add_sstable() to compaction_group test/sstable_compaction_test: Switch to table_state::get_backlog_tracker() compaction/table_state: Introduce get_backlog_tracker()	2022-11-14 07:05:28 +02:00
Avi Kivity	b8cb34b928	test: crc: add unit tests for constexpr clmul and barrett fold Check that the constexpr variants indeed match the runtime variants. I verified manually that exactly one computation in each test is executed at run time (and is compared against a constant).	2022-11-13 16:22:29 +02:00
Avi Kivity	70217b5109	utils: crc combine table: generate at compile time By now the crc combine tables are generated at compile time, but still in a separate code generation step. We now eliminate the code generation step and instead link the global variables directly into the main executable. The global variables have been conveniently named exactly as the code generation step names them, so we don't need to touch any users.	2022-11-12 17:26:45 +02:00
Avi Kivity	164e991181	utils: barrett: inline functions in header Avoid duplicate definitions if the same header is used from more than one place, at it will soon be.	2022-11-12 17:26:08 +02:00
Avi Kivity	a4f06773da	utils: crc combine table: generate tables at compile time Move the tables into global constinit variables that are generated at compile time. Note the code that creates the generated crc32_combine_table.cc is still called; it transorms compile-time generated tables into a C++ source that contains the same values, as literals. If we generate a diff between gen/utils/gz/crc_combine_table.cc before this series and after this patch, we see the only change in the file is the type of the variable (which changed to std::array), proving our constexpr code is correct.	2022-11-12 17:16:59 +02:00
Avi Kivity	a229fdc41e	utils: crc combine table: extract table generation into a constexpr function Move the code to a constexpr function, so we can later generate the tables at compile time. Note that although the function is constexpr, it is still evaluated at runtime, since the calling function (main()) isn't constexpr itself.	2022-11-12 17:13:52 +02:00
Avi Kivity	d42bec59bb	utils: crc combine table: extract "pow table" code into constexpr function A "pow table" is used to generate the Barrett fold tables. Extract its code into a constexpr function so we can later generate the fold tables at compile time.	2022-11-12 17:11:44 +02:00
Avi Kivity	6e34014b64	utils: crc combine table: store tables std::arrray rather than C array C arrays cannot be returned from functions and therefore aren't suitable for constexpr processing. std::array<> is a regular value and so is constexpr friendly.	2022-11-12 17:09:02 +02:00
Avi Kivity	1e9252f79a	utils: barrett: make the barrett reduction constexpr friendly Dispatch to intrinsics or constexpr based on evaluation context.	2022-11-12 17:04:44 +02:00
Avi Kivity	0bd90b5465	utils: clmul: add 64-bit constexpr clmul This is used when generating the Barrett reduction tables, and also when applying the Barrett reduction at runtime, so we need it to be constexpr friendly.	2022-11-12 17:04:05 +02:00
Avi Kivity	c376c539b8	utils: barrett: extract barrett reduction constants The constants are repeated across x86_64 and aarch64, so extract them into a common definition.	2022-11-12 17:00:17 +02:00
Avi Kivity	2fdf81af7b	utils: barrett: reorder functions Reorder functions in dependency order rather than forward declaring them. This makes them more constexpr-friendly.	2022-11-12 16:52:41 +02:00
Avi Kivity	8aa59a897e	utils: make clmul() constexpr clmul() is a pure function and so should already be constexpr, but it uses intrinsics that aren't defined as constexpr and so the compiler can't really compute it at compile time. Fix by defining a constexpr variant and dispatching based on whether we're being constant-evaluated or not. The implementation is simple, but in any case proof that it is correct will be provided later on.	2022-11-12 16:49:43 +02:00
Raphael S. Carvalho	b88acffd66	replica: Allow one compaction_backlog_tracker for each compaction_group Today, compaction_backlog_tracker is managed in each compaction_strategy implementation. So every compaction strategy is managing its own tracker and providing a reference to it through get_backlog_tracker(). But this prevents each group from having its own tracker, because there's only a single compaction_strategy instance per table. To remove this limitation, compaction_strategy impl will no longer manage trackers but will instead provide an interface for trackers to be created, such that each compaction group will be allowed to have its own tracker, which will be managed by compaction manager. On compaction strategy change, table will update each group with the new tracker, which is created using the previously introduced ompaction_group_sstable_set_updater. Now table's backlog will be the sum of all compaction_group backlogs. The normalization factor is applied on the sum, so we don't have to adjust each individual backlog to any factor. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:22:51 -03:00
Raphael S. Carvalho	d862dd815c	compaction: Make compaction_state available for compaction tasks being stopped compaction_backlog_tracker will be managed by compaction_manager, in the per table state. As compaction tasks can access the tracker throughout its lifetime, remove() can only deregister the state once we're done stopping all tasks which map to that state. remove() extracted the state upfront, then performed the stop, to prevent new tasks from being registered and left behind. But we can avoid the leak of new tasks by only closing the gate, which waits for all tasks (which are stopped a step earlier) and once closed, prevents new tasks from being registered. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:22:51 -03:00
Raphael S. Carvalho	0a152a2670	compaction: Implement move assignment for compaction_backlog_tracker That's needed for std::optional to work on its behalf. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:22:49 -03:00
Raphael S. Carvalho	fe305cefd0	compaction: Fix compaction_backlog_tracker move ctor Luckily it's not used anywhere. Default move ctor was picked but it won't clear _manager of old object, meaning that its destructor will incorrectly deregister the tracker from compaction_backlog_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	8e1e30842d	compaction: Use table_state's backlog tracker in compaction_read_monitor_generator A step closer towards a separate backlog tracker for each compaction group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	fedafd76eb	compaction: kill undefined get_unimplemented_backlog_tracker() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	90991bda69	replica: Refactor table::set_compaction_strategy for multiple groups Refactoring the function for it to accomodate multiple compaction groups. To still provide strong exception guarantees, preparation and execution of changes will be separated. Once multiple groups are supported, each group will be prepared first, and the noexcept execution will be done as a last step. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	244efddb22	Fix exception safety when transferring ongoing charges to new backlog tracker When setting a new strategy, the charges of old tracker is transferred to the new one. The problem is that we're not reverting changes if exception is triggered before the new strategy is successfully set. To fix this exception safety issue, let's copy the charges instead of moving them. If exception is triggered, the old tracker is still the one used and remain intact. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	d1e2dbc592	replica: move_sstables_from_staging: Use tracker from group owning the SSTable When moving SSTables from staging directory, we'll conditionally add them to backlog tracker. As each group has its own tracker, a given sstable will be added to the tracker of the group that owns it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	9031dc3199	replica: Move table::backlog_tracker_adjust_charges() to compaction_group Procedures that call this function happen to be in compaction_group, so let's move it to group. Simplifies the change where the procedure retrieves tracker from the group itself. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Raphael S. Carvalho	116459b69e	replica: table::discard_sstables: Use compaction_group's backlog tracker Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Raphael S. Carvalho	b2d8545b15	replica: Disable backlog tracker in compaction_group::stop() As we're moving backlog tracker to compaction group, we need to stop the tracker there too. We're moving it a step earlier in table::stop(), before sstables are cleared, but that's okay because it's still done after the group was deregistered from compaction manager, meaning no compactions are running. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Raphael S. Carvalho	91b0d772e2	replica: database_sstable_write_monitor: use compaction_group's backlog tracker Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Raphael S. Carvalho	f37a05b559	replica: Move table::do_add_sstable() to compaction_group All callers of do_add_sstable() live in compaction_group, so it should be moved into compaction_group too. It also makes easier for the function to retrieve the backlog tracker from the group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Raphael S. Carvalho	835927a2ad	test/sstable_compaction_test: Switch to table_state::get_backlog_tracker() Important for decoupling backlog tracker from table's compaction strategy. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Raphael S. Carvalho	1ec0ef18a5	compaction/table_state: Introduce get_backlog_tracker() This interface will be helpful for allowing replica::table, unit tests and sstables::compaction to access the compaction group's tracker which will be managed by the compaction manager, once we complete the decoupling work. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Nadav Har'El	ff87624fb4	test/cql-pytest: add another regression test for reversed-type bug In commit `544ef2caf3` we fixed a bug where a reveresed clustering-key order caused problems using a secondary index because of incorrect type comparison. That commit also included a regression test for this fix. However, that fix was incomplete, and improved later in commit `c8653d1321`. That later fix was labeled "better safe than sorry", and did not include a test demonstrating any actual bug, so unsurprisingly we never backported that second fix to any older branches. Recently we discovered that missing the second patch does cause real problems, and this patch includes a test which fails when the first patch is in, but the second patch isn't (and passes when both patches are in, and also passes on Cassandra). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11943	2022-11-11 11:01:22 +02:00
Botond Dénes	302917f63d	mutation_compactor: add validator The mutation compactor is used on most read-paths we have, so adding a validator to it gives us a good coverage, in particular it gives us full coverage of queries and compaction. The validator validates mutation token (and mutation fragment kind) monotonicity as that is quite cheap, while it is enough to catch the most common problems. As we already have a validator on the compaction path (in the sstable writer), the validator is disabled when the mutation compactor is instantiated for compaction. We should probably make this configurable at some point. The addition of this validator should prevent the worst of the fragment reordering bugs to affect reads.	2022-11-11 10:26:05 +02:00
Botond Dénes	5c245b4a5e	mutation_fragment_stream_validator: add a 'none' validation level Which, as its name suggests, makes the validating filter not validate anything at all. This validation level can be used effectively to make it so as if the validator was not there at all.	2022-11-11 09:58:44 +02:00
Botond Dénes	a4b58f5261	test/boost/mutation_query_test: test_partition_limit: sort input data The test's input data is currently out-of-order, violating a fundamental invariant of data always being sorted. This doesn't cause any problems right now, but soon it will. Sort it to avoid it.	2022-11-11 09:58:44 +02:00
Botond Dénes	2c551bb7ce	querier: consume_page(): use partition_start as the sentinel value Said method calls `compact_mutation_state::start_new_page()` which requires the kind of the next fragment in the reader. When there is no fragment (reader is at EOS), we use partition-end. This was a poor choice: if the reader is at EOS, partition-kind was the last fragment kind, if the stream were to continue the next fragment would be a partition-start.	2022-11-11 09:58:18 +02:00
Botond Dénes	0bcfc9d522	treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{} We just added a convenience static factory method for partition end, change the present users of the clunky constructor+tag to use it instead.	2022-11-11 09:58:18 +02:00
Botond Dénes	f1a039fc2b	treewide: use ::for_partition_start() instead of ::partition_start_tag_t{} We just added a convenience static factory method for partition start, change the present users of the clunky constructor+tag to use it instead.	2022-11-11 09:58:18 +02:00
Botond Dénes	6a002953e9	position_in_partition: add for_partition_{start,end}()	2022-11-11 09:58:18 +02:00
Kamil Braun	4a2ec888d5	Merge 'test.py: use internal id to manage servers' from Alecco Instead of using assigned IP addresses, use a local integer ID for managing servers. IP address can be reused by a different server. While there, get host ID (UUID). This can also be reused with `node replace` so it's not good enough for tracking. Closes #11747 * github.com:scylladb/scylladb: test.py: use internal id to manage servers test.py: rename hostname to ip_addr test.py: get host id test.py: use REST api client in ScyllaCluster test.py: remove unnecessary reference to web app test.py: requests without aiohttp ClientSession	2022-11-10 17:12:16 +01:00
Kamil Braun	1cc68b262e	docs: describe the Raft upgrade and recovery procedures In the 5.1 -> 5.2 upgrade doc, include additional steps for enabling Raft using the `consistent_cluster_management` flag. Note that we don't have this flag yet but it's planned to replace the experimental flag in 5.2. In the "Raft in ScyllaDB" document, add sections about: - enabling Raft in existing clusters in Scylla 5.2, - verifying that the internal Raft upgrade procedure finishes successfully, - recovering from a stuck Raft upgrade procedure or from a majority loss situation. Fix some problems in the documentation, e.g. it is not possible to enable Raft in an existing cluster in 5.0, but the documentation claimed that it is. Follow-up items: - if we decide for a different name for `consistent_cluster_management`, use that name in the docs instead - update the warnings in Scylla to link to the Raft doc - mention Enterprise versions once we know the numbers - update the appropriate upgrade docs for Enterprise versions once they exist	2022-11-10 17:08:57 +01:00
Kamil Braun	3dab07ec11	docs: add upgrade guide 5.1 -> 5.2 It's a copy-paste from the 5.0 -> 5.1 guide with substitutions: s/5.1/5.2, s/5.0/5.1 The metric update guide is not written, I left a TODO. Also I didn't include the guide in docs/upgrade/upgrade-opensource/index.rst, since 5.2 is not released yet. The guide can be accessed by manually following the link: /upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/	2022-11-10 16:49:14 +01:00
Alejo Sanchez	700054abee	test.py: use internal id to manage servers Instead of using assigned IP addresses, use an internal server id. Define types to distinguish local server id, host ID (UUID), and IP address. This is needed to test servers changing IP address and for node replace (host UUID). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Alejo Sanchez	1e38f5478c	test.py: rename hostname to ip_addr The code explicitly manages an IP as string, make it explicit in the variable name. Define its type and test for set in the instance instead of using an empty string as placeholder. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Alejo Sanchez	f478eb52a3	test.py: get host id When initializing a ScyllaServer, try to get the host id instead of only checking the REST API is up. Use the existing aiohttp session from ScyllaCluster. In case of HTTP error check the status was not an internal error (500+). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Alejo Sanchez	78663dda72	test.py: use REST api client in ScyllaCluster Move the REST api client to ScyllaCluster. This will allow the cluster to query its own servers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Alejo Sanchez	75ea345611	test.py: remove unnecessary reference to web app The aiohttp.web.Application only needs to be passed, so don't store a reference in ScyllaCluster object. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Alejo Sanchez	a5316b0c6b	test.py: requests without aiohttp ClientSession Simplify REST helper by doing requests without a session. Reusing an aiohttp.ClientSession causes knock-on effects on `rest_api/test_task_manager` due to handling exceptions outside of an async with block. Requests for cluster management and Scylla REST API don't need session, anyway. Raise HTTPError with status code, text reason, params, and json. In ScyllaCluster.install_and_start() instead of adding one more custom exception, just catch all exceptions as they will be re-raised later. While there avoid code duplication and improve sanity, type checking, and lint score. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Botond Dénes	21bc37603a	Merge 'utils: config_src: add set_value_on_all_shards functions' from Benny Halevy Currently when we set a single value we need to call broadcast_to_all_shards to let observers on all shards get notified of the new value. However, the latter broadcasts all value to all shards so it's terribly inefficient. Instead, add async set_value_on_all_shards functions to broadcast a value to all shards. Use those in system_keyspace for db_config_table virtual table and in task_manager_test to update the task_manager ttl. Refs #7316 Closes #11893 * github.com:scylladb/scylladb: tests: check ttl on different shards utils: config_src: add set_value_on_all_shards functions utils: config_file: add config_source::API	2022-11-10 07:16:39 +02:00
Botond Dénes	3aff59f189	Merge 'staging sstables: filter tokens for view update generation' from Benny Halevy This mini-series introduces dht::tokens_filter and uses it for consuming staging sstable in the view_update_generator. The tokens_filter uses the token ranges owned by the current node, as retrieved by get_keyspace_local_ranges. Refs #9559 Closes #11932 * github.com:scylladb/scylladb: db: view_update_generator: always clean up staging sstables compaction: extract incremental_owned_ranges_checker out to dht	2022-11-10 07:00:51 +02:00
Avi Kivity	9b6ab5db4a	Update seastar submodule * seastar e0dabb361f...153223a188 (8): > build: compile dpdk with -fpie (position independent executable) > Merge 'io_request: remove ctor overloads of io_request and s/io_request/const io_request/' from Kefu Chai > iostream: remove unused function > smp: destroy_smp_service_group: verify smp_service_group id > core/circular_buffer: refactor loop in circular_buffer::erase() > Merge 'Outline reactor::add_task() and sanitize reactor::shuffle() methods' from Pavel Emelyanov > Add NOLINT for cert-err58-cpp > tests: Fix false-positive use-after-free detection Closes #11940	2022-11-09 23:36:50 +02:00
Aleksandra Martyniuk	b0ed4d1f0f	tests: check ttl on different shards Test checking if ttl is properly set is extended to check whether the ttl value is changed on non-zero shard.	2022-11-09 16:58:46 +02:00
Botond Dénes	725e5b119d	Revert "replica: Pick new generation for SSTables being moved from staging dir" This reverts commit `ba6186a47f`. Said commit violates the widely held assumption that sstables generations can be used as sstable identity. One known problem caused this is potential OOO partition emitted when reading from sstables (#11843). We now also have a better fix for #11789 (the bug this commit was meant to fix): `4aa0b16852`. So we can revert without regressions. Fixes: #11843 Closes #11886	2022-11-09 16:35:31 +02:00
Eliran Sinvani	ab7429b77d	cql: Fix crash upon use of the word empty for service level name Wrong access to an uninitialized token instead of the actual generated string caused the parser to crash, this wasn't detected by the ANTLR3 compiler because all the temporary variables defined in the ANTLR3 statements are global in the generated code. This essentialy caused a null dereference. Tests: 1. The fixed issue scenario from github. 2. Unit tests in release mode. Fixes #11774 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Message-Id: <20190612133151.20609-1-eliransin@scylladb.com> Closes #11777	2022-11-09 15:58:57 +02:00
Anna Stuchlik	d2e54f7097	Merge branch 'master' into anna-requirements-arm-aws	2022-11-09 14:39:00 +01:00
Anna Stuchlik	8375304d9b	Update docs/getting-started/system-requirements.rst Co-authored-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2022-11-09 14:37:34 +01:00
Benny Halevy	38d8777d42	storage_service, system_keyspace: add debugging around system.peers update Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-09 14:45:47 +02:00
Benny Halevy	5401b6055c	storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed Currently, when replacing a node ip, keeping the old host, we might end up with the the old endpoint in system.peers if it is inserted back into the topology by `handle_state_normal` when on_join is called with the old endpoint. Then, later on, on_change sees that: ``` if (get_token_metadata().is_member(endpoint)) { co_await do_update_system_peers_table(endpoint, state, value); ``` As described in #11925. Fixes #11925 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-09 14:45:22 +02:00
Benny Halevy	1a183047c0	utils: config_src: add set_value_on_all_shards functions Currently when we set a single value we need to call broadcast_to_all_shards to let observers on all shards get notified of the new value. However, the latter broadcasts all value to all shards so it's terribly inefficient. Instead, add async set_value_on_all_shards functions to broadcast a value to all shards. Use those in system_keyspace for db_config_table virtual table and in task_manager_test to update the task_manager ttl. Refs #7316 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-09 11:55:14 +02:00
Benny Halevy	e83f42ec70	utils: config_file: add config_source::API For task_manager test api. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-09 11:53:20 +02:00
Botond Dénes	94db2123b9	Update tools/java submodule * tools/java 583261fc0e...caf754f243 (1): > build: remove JavaScript snippets in ant build file	2022-11-09 07:59:04 +02:00
Benny Halevy	10f8f13b90	db: view_update_generator: always clean up staging sstables Since they are currently not cleaned up by cleanup compaction filter their tokens, processing only tokens owned by the current node (based on the keyspace replication strategy). Refs #9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-09 07:38:22 +02:00
Benny Halevy	fd3e66b0cc	compaction: extract incremental_owned_ranges_checker out to dht It is currently used by cleanup_compaction partition filter. Factor it out so it can be used to filter staging sstables in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-09 07:32:56 +02:00
Gleb Natapov' via ScyllaDB development	2100a8f4ca	service: raft: demote configuration change error to warning since it is retried anyway Message-Id: <Y2ohbFtljmd5MNw0@scylladb.com>	2022-11-09 00:09:39 +01:00
Avi Kivity	04ecf4ee18	Update tools/java submodule (cassandra-stress fails with node down) * tools/java 87672be28e...583261fc0e (1): > cassandra-stress: pass all hosts stright to the driver	2022-11-08 14:58:14 +02:00
Botond Dénes	7f69cccbdf	scylla-gdb.py: $downcast_vptr(): add multiple inheritance support When a class inherits from multiple virtual base classes, pointers to instances of this class via one of its base classes, might point to somewhere into the object, not at its beginning. Therefore, the simple method employed currently by $downcast_vptr() of casting the provided pointer to the type extracted from the vtable name fails. Instead when this situation is detected (detectable by observing that the symbol name of the partial vtable is not to an offset of +16, but larger), $downcast_vptr() will iterate over the base classes, adjusting the pointer with their offsets, hoping to find the true start of the object. In the one instance I tested this with, this method worked well. At the very least, the method will now yield a null pointer when it fails, instead of a badly casted object with corrupt content (which the developer might or might not attribute to the bad cast). Closes #11892	2022-11-08 14:51:26 +02:00
Michał Chojnowski	3e0c7a6e9f	test: sstable_datafile_test: eliminate a use of std::regex to prevent stack overflow This usage of std::regex overflows the seastar::thread stack size (128 KiB), causing memory corruption. Fix that. Closes #11911	2022-11-08 14:41:34 +02:00
Botond Dénes	2037d7f9cd	Merge 'doc: add the "ScyllaDB Enterprise" label to highlight the Enterprise-only features' from Anna Stuchlik This PR adds the "ScyllaDB Enterprise" label to highlight the Enterprise-only features on the following pages: - Encryption at Rest - the label indicates that the entire page is about an Enterprise-only feature. - Compaction - the labels indicate the sections that are Enterprise-only. There are more occurrences across the docs that require a similar update. I'll update them in another PR if this PR is approved. Closes #11918 * github.com:scylladb/scylladb: doc: fix the links to resolve the warnings doc: add the Enterprise label on the Compaction page (to a subheading and on a list of strategies) to replace the info box doc: add the Enterprise label to the Encryption at Rest page (the entire page) to replace the info box	2022-11-08 09:53:48 +02:00
Raphael S. Carvalho	a57724e711	Make off-strategy compaction wait for view building completion Prior to off-strategy compaction, streaming / repair would place staging files into main sstable set, and wait for view building completion before they could be selected for regular compaction. The reason for that is that view building relies on table providing a mutation source without data in staging files. Had regular compaction mixed staging data with non-staging one, table would have a hard time providing the required mutation source. After off-strategy compaction, staging files can be compacted in parallel to view building. If off-strategy completes first, it will place the output into the main sstable set. So a parallel view building (on sstables used for off-strategy) may potentially get a mutation source containing staging data from the off-strategy output. That will mislead view builder as it won't be able to detect changes to data in main directory. To fix it, we'll do what we did before. Filter out staging files from compaction, and trigger the operation only after we're done with view building. We're piggybacking on off-strategy timer for still allowing the off-strategy to only run at the end of the node operation, to reduce the amount of compaction rounds on the data introduced by repair / streaming. Fixes #11882. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11919	2022-11-08 08:53:58 +02:00
Botond Dénes	243fcb96f0	Update tools/python3 submodule * tools/python3 bf6e892...773070e (1): > create-relocatable-package: harden against missing files	2022-11-08 08:43:30 +02:00
Avi Kivity	46690bcb32	build: harden create-relocatable-package.py against changes in libthread-db.so name create-relocatable-package.py collects shared libraries used by executables for packaging. It also adds libthread-db.so to make debugging possible. However, the name it uses has changed in glibc, so packaging fails in Fedora 37. Switch to the version-agnostic names, libthread-db.so. This happens to be a symlink, so resolve it. Closes #11917	2022-11-08 08:41:22 +02:00
Takuya ASADA	acc408c976	scylla_setup: fix incorrect type definition on --online-discard option --online-discard option defined as string parameter since it doesn't specify "action=", but has default value in boolean (default=True). It breaks "provisioning in a similar environment" since the code supposed boolean value should be "action='store_true'" but it's not. We should change the type of the option to int, and also specify "choices=[0, 1]" just like --io-setup does. Fixes #11700 Closes #11831	2022-11-08 08:40:44 +02:00
Avi Kivity	3d345609d8	config: disable "mc" format sstables for new data "md" format was introduced in 4.3, in `3530e80ce1`, two years ago. Disable the option to create new sstables with the "mc" format. Closes #11265	2022-11-08 08:36:27 +02:00
Anna Stuchlik	0eaafced9d	doc: fix the links to resolve the warnings	2022-11-07 19:15:21 +01:00
Anna Stuchlik	b57e0cfb7c	doc: add the Enterprise label on the Compaction page (to a subheading and on a list of strategies) to replace the info box	2022-11-07 18:54:35 +01:00
Anna Stuchlik	9f3fcb3fa0	doc: add the Enterprise label to the Encryption at Rest page (the entire page) to replace the info box	2022-11-07 18:48:37 +01:00
Tomasz Grabiec	a9063f9582	Merge 'service/raft: failure detector: ping `raft::server_id`s, not `gms::inet_address`es' from Kamil Braun Whenever a Raft configuration change is performed, `raft::server` calls `raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc` implementation has a function, `_on_server_update`, passed in the constructor, which it called in `add_server`/`remove_server`; that function would update the set of endpoints detected by the direct failure detector. `_on_server_update` was passed an IP address and that address was added to / removed from the failure detector set (there's another translation layer between the IP addresses and internal failure detector 'endpoint ID's; but we can ignore it for the purposes of this commit). Therefore: the failure detector was pinging a certain set of IP addresses. These IP addresses were updated during Raft configuration changes. To implement the `is_alive(raft::server_id)` function (required by `raft::failure_detector` interface), we would translate the ID using the Raft address map, which is currently also updated during configuration changes, to an IP address, and check if that IP address is alive according to the direct failure detector (which maintained an `_alive_set` of type `unordered_set<gms::inet_address>`). This all works well but it assumes that servers can be identified using IP addresses - it doesn't play well with the fact that servers may change their IP addresses. The only immutable identifier we have for a server is `raft::server_id`. In the future, Raft configurations will not associate IP addresses with Raft servers; instead we will assume that IP addresses can change at any time, and there will be a different mechanism that eventually updates the Raft address map with the latest IP address for each `raft::server_id`. To prepare us for that future, in this commit we no longer operate in terms of IP addresses in the failure detector, but in terms of `raft::server_id`s. Most of the commit is boilerplate, changing `gms::inet_address` to `raft::server_id` and function/variable names. The interesting changes are: - in `is_alive`, we no longer need to translate the `raft::server_id` to an IP address, because now the stored `_alive_set` already contains `raft::server_id`s instead of `gms::inet_address`es. - the `ping` function now takes a `raft::server_id` instead of `gms::inet_address`. To send the ping message, we need to translate this to IP address; we do it by the `raft_address_map` pointer introduced in an earlier commit. Thus, there is still a point where we have to translate between `raft::server_id` and `gms::inet_address`; but observe we now do it at the last possible moment - just before sending the message. If we have no translation, we consider the `ping` to have failed - it's equivalent to a network failure where no route to a given address was found. Closes #11759 * github.com:scylladb/scylladb: direct_failure_detector: get rid of complex `endpoint_id` translations service/raft: ping `raft::server_id`s, not `gms::inet_address`es service/raft: store `raft_address_map` reference in `direct_fd_pinger` gms: gossiper: move `direct_fd_pinger` out to a separate service gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class	2022-11-07 16:42:35 +01:00
Botond Dénes	2b572d94f5	Merge 'doc: improve the documentation landing page ' from Anna Stuchlik This PR introduces the following changes to the documentation landing page: - The " New to ScyllaDB? Start here!" box is added. - The "Connect your application to Scylla" box is removed. - Some wording has been improved. - "Scylla" has been replaced with "ScyllaDB". Closes #11896 * github.com:scylladb/scylladb: Update docs/index.rst doc: replace Scylla with ScyllaDB on the landing page doc: improve the wording on the landing page doc: add the link to the ScyllaDB Basics page to the documentation landing page	2022-11-07 16:18:59 +02:00
Avi Kivity	91f2cd5ac4	test: lib: exception_predicate: use boost::regex instead of std::regex std::regex was observed to overflow stack on aarch64 in debug mode. Use boost::regex until the libstdc++ bug[1] is fixed. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61582 Closes #11888	2022-11-07 14:03:25 +02:00
Kamil Braun	0c7ff0d2cb	docs: a single 5.0 -> 5.1 upgrade guide There were 4 different pages for upgrading Scylla 5.0 to 5.1 (and the same is true for other version pairs, but I digress) for different environments: - "ScyllaDB Image for EC2, GCP, and Azure" - Ubuntu - Debian - RHEL/CentOS THe Ubuntu and Debian pages used a common template: ``` .. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst .. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst ``` with different variable substitutions. The "Image" page used a similar template, with some extra content in the middle: ``` .. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst .. include:: /upgrade/_common/upgrade-image-opensource.rst .. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst ``` The RHEL/CentOS page used a different template: ``` .. include:: /upgrade/_common/upgrade-guide-v4-rpm.rst ``` This was an unmaintainable mess. Most of the content was "the same" for each of these options. The only content that must actually be different is the part with package installation instructions (e.g. calls to `yum` vs `apt-get`). The rest of the content was logically the same - the differences were mistakes, typos, and updates/fixes to the text that were made in some of these docs but not others. In this commit I prepare a single page that covers the upgrade and rollback procedures for each of these options. The section dependent on the system was implemented using Sphinx Tabs. I also fixed and changed some parts: - In the "Gracefully stop the node" section: Ubuntu/Debian/Images pages had: ```rst .. code:: sh sudo service scylla-server stop ``` RHEL/CentOS pages had: ```rst .. code:: sh .. include:: /rst_include/scylla-commands-stop-index.rst ``` the stop-index file contained this: ```rst .. tabs:: .. group-tab:: Supported OS .. code-block:: shell sudo systemctl stop scylla-server .. group-tab:: Docker .. code-block:: shell docker exec -it some-scylla supervisorctl stop scylla (without stopping some-scylla container) ``` So the RHEL/CentOS version had two tabs: one for Scylla installed directly on the system, one for Scylla running in Docker - which is interesting, because nothing anywhere else in the upgrade documents mentions Docker. Furthermore, the RHEL/CentOS version used `systemctl` while the ubuntu/debian/images version used `service` to stop/start scylla-server. Both work on modern systems. The Docker option is completely out of place - the rest of the upgrade procedure does not mention Docker. So I decided it doesn't make sense to include it. Docker documentation could be added later if we actually decide to write upgrade documentation when using Docker... Between `systemctl` and `service` I went with `service` as it's a bit higher-level. - Similar change for "Start the node" section, and corresponding stop/start sections in the Rollback procedure. - To reuse text for Ubuntu and Debian, when referencing "ScyllaDB deb repo" in the Debian/Ubuntu tabs, I provide two separate links: to Debian and Ubuntu repos. - the link to rollback procedure in the RPM guide (in 'Download and install the new release' section) pointed to rollback procedure from 3.0 to 3.1 guide... Fixed to point to the current page's rollback procedure. - in the rollback procedure steps summary, the RPM version missed the "Restore system tables" step. - in the rollback procedure, the repository links were pointing to the new versions, while they should point to the old versions. There are some other pre-existing problems I noticed that need fixing: - EC2/GCP/Azure option has no corresponding coverage in the rollback section (Download and install the old release) as it has in the upgrade section. There is no guide for rolling back 3rd party and OS packages, only Scylla. I left a TODO in a comment. - the repository links assume certain Debian and Ubuntu versions (Debian 10 and Ubuntu 20), but there are more available options (e.g. Ubuntu 22). Not sure how to deal with this problem. Maybe a separate section with links? Or just a generic link without choice of platform/version? Closes #11891	2022-11-07 14:02:08 +02:00
Avi Kivity	9fa1783892	Merge 'cleanup compaction: flush memtable' from Benny Halevy Flush the memtable before cleaning up the table so not to leave any disowned tokens in the memtable as they might be resurrected if left in the memtable. Fixes #1239 Closes #11902 * github.com:scylladb/scylladb: table: perform_cleanup_compaction: flush memtable table: add perform_cleanup_compaction api: storage_service: add logging for compaction operations et al	2022-11-07 13:18:12 +02:00
Anna Stuchlik	c8455abb71	Update docs/index.rst Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2022-11-07 10:25:24 +01:00
AdamStawarz	6bc455ebea	Update tombstones-flush.rst change syntax: nodetool compact <keyspace>.<mytable>; to nodetool compact <keyspace> <mytable>; Closes #11904	2022-11-07 11:19:26 +02:00
Avi Kivity	224a2877b9	build: disable -Og in debug mode to avoid coroutine asan breakage Coroutines and asan don't mix well on aarch64. This was seen in `22f13e7ca3` (" Revert "Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity"") where a routine coroutinization was reverted due to failures on aarch64 debug mode. In clang 15 this is even worse, the existing code starts failing. However, if we disable optimization (-O0 rather than -Og), things begin to work again. In fact we can reinstate the patch reverted above even with clang 12. Fix (or rather workaround) the problem by avoiding -Og on aarch64 debug mode. There's the lingering fear that release mode is miscompiled too, but all the tests pass on clang 15 in release mode so it appears related to asan. Closes #11894	2022-11-07 10:55:13 +02:00
Benny Halevy	eb3a94e2bc	table: perform_cleanup_compaction: flush memtable We don't explicitly cleanup the memtable, while it might hold tokens disowned by the current node. Flush the memtable before performing cleanup compaction to make sure all tokens in the memtable are cleaned up. Note that non-owned ranges are invalidate in the cache in compaction_group::update_main_sstable_list_on_compaction_completion using desc.ranges_for_cache_invalidation. Fixes #1239 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-06 19:41:40 +02:00
Benny Halevy	fc278be6c4	table: add perform_cleanup_compaction Move the integration with compaction_manager from the api layer to the tabel class so it can also make sure the memtable is cleaned up in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-06 19:41:33 +02:00
Benny Halevy	85523c45c0	api: storage_service: add logging for compaction operations et al Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-06 19:41:31 +02:00
Petr Gusev	44f48bea0f	raft: test_remove_node_with_concurrent_ddl The test runs remove_node command with background ddl workload. It was written in an attempt to reproduce scylladb#11228 but seems to have value on its own. The if_exists parameter has been added to the add_table and drop_table functions, since the driver could retry the request sent to a removed node, but that request might have already been completed. Function wait_for_host_known waits until the information about the node reaches the destination node. Since we add new nodes at each iteration in main, this can take some time. A number of abort-related options was added SCYLLA_CMDLINE_OPTIONS as it simplifies nailing down problems. Closes #11734	2022-11-04 17:16:35 +01:00
David Garcia	26bc53771c	docs: automatic previews configuration Closes #11591	2022-11-04 15:44:22 +02:00
Kamil Braun	e086521c1a	direct_failure_detector: get rid of complex `endpoint_id` translations The direct failure detector operates on abstract `endpoint_id`s for pinging. The `pigner` interface is responsible for translating these IDs to 'real' addresses. Earlier we used two types of addresses: IP addresses in 'production' code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test code (in `randomized_nemesis_test`). For each of these use cases we would maintain mappings between `endpoint_id`s and the address type. In recent commits we switched the 'production' code to also operate on Raft server IDs, which are UUIDs underneath. In this commit we switch `endpoint_id`s from `unsigned` type to `utils::UUID`. Because each use case operates in Raft server IDs, we can perform a simple translation: `raft_id.uuid()` to get an `endpoint_id` from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from an `endpoint_id`. We no longer have to maintain complex sharded data structures to store the mappings.	2022-11-04 09:38:08 +01:00
Kamil Braun	bdeef77f20	service/raft: ping `raft::server_id`s, not `gms::inet_address`es Whenever a Raft configuration change is performed, `raft::server` calls `raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc` implementation has a function, `_on_server_update`, passed in the constructor, which it called in `add_server`/`remove_server`; that function would update the set of endpoints detected by the direct failure detector. `_on_server_update` was passed an IP address and that address was added to / removed from the failure detector set (there's another translation layer between the IP addresses and internal failure detector 'endpoint ID's; but we can ignore it for the purposes of this commit). Therefore: the failure detector was pinging a certain set of IP addresses. These IP addresses were updated during Raft configuration changes. To implement the `is_alive(raft::server_id)` function (required by `raft::failure_detector` interface), we would translate the ID using the Raft address map, which is currently also updated during configuration changes, to an IP address, and check if that IP address is alive according to the direct failure detector (which maintained an `_alive_set` of type `unordered_set<gms::inet_address>`). This all works well but it assumes that servers can be identified using IP addresses - it doesn't play well with the fact that servers may change their IP addresses. The only immutable identifier we have for a server is `raft::server_id`. In the future, Raft configurations will not associate IP addresses with Raft servers; instead we will assume that IP addresses can change at any time, and there will be a different mechanism that eventually updates the Raft address map with the latest IP address for each `raft::server_id`. To prepare us for that future, in this commit we no longer operate in terms of IP addresses in the failure detector, but in terms of `raft::server_id`s. Most of the commit is boilerplate, changing `gms::inet_address` to `raft::server_id` and function/variable names. The interesting changes are: - in `is_alive`, we no longer need to translate the `raft::server_id` to an IP address, because now the stored `_alive_set` already contains `raft::server_id`s instead of `gms::inet_address`es. - the `ping` function now takes a `raft::server_id` instead of `gms::inet_address`. To send the ping message, we need to translate this to IP address; we do it by the `raft_address_map` pointer introduced in an earlier commit. Thus, there is still a point where we have to translate between `raft::server_id` and `gms::inet_address`; but observe we now do it at the last possible moment - just before sending the message. If we have no translation, we consider the `ping` to have failed - it's equivalent to a network failure where no route to a given address was found.	2022-11-04 09:38:08 +01:00
Kamil Braun	ac70a05c7e	service/raft: store `raft_address_map` reference in `direct_fd_pinger` The pinger will use the map to translate `raft::server_id`s to `gms::inet_address`es when pinging.	2022-11-04 09:38:08 +01:00
Kamil Braun	2c20f2ab9d	gms: gossiper: move `direct_fd_pinger` out to a separate service In later commit `direct_fd_pinger` will operate in terms of `raft::server_id`s. Decouple it from `gossiper` since we don't want to entangle `gossiper` with Raft-specific stuff.	2022-11-04 09:38:08 +01:00
Kamil Braun	e9a4263e14	gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class `gms::gossiper::direct_fd_pinger` serves multiple purposes: one of them is to maintain a mapping between `gms::inet_address`es and `direct_failure_detector::pinger::endpoint_id`s, another is to cache the last known gossiper's generation number to use it for sending gossip echo messages. The latter is the only gossiper-specific thing in this class. We want to move `direct_fd_pinger` utside `gossiper`. To do that, split the gossiper-specific thing -- the generation number management -- to a smaller class, `echo_pinger`. `echo_pinger` is a top-level class (not a nested one like `direct_fd_pinger` was) so we can forward-declare it and pass references to it without including gms/gossiper.hh header.	2022-11-04 09:38:08 +01:00
Avi Kivity	768d77d31b	Update seastar submodule * seastar f32ed00954...e0dabb361f (12): > sstring: define formatter > file: Dont violate API layering > Add compile_commands.json to gitignore > Merge 'Add an allocation failure metric' from Travis Downs > Use const test objects > Ragel chunk parser: compilation err, unused var > build: do not expose Valgrind in SeastarTargets.cmake > defer: mark deferred_* with [[nodiscard]] > Log selected reactor backend during startup > http: mark str with [[maybe_unused]] > Merge 'reactor: open fd without O_NONBLOCK when using io_uring backend' from Kefu Chai > reactor: add accept and connect to io_uring backend Closes #11895	2022-11-04 09:27:56 +04:00
Anna Stuchlik	fb01565a15	doc: replace Scylla with ScyllaDB on the landing page	2022-11-03 17:42:49 +01:00
Anna Stuchlik	7410ab0132	doc: improve the wording on the landing page	2022-11-03 17:38:14 +01:00
Anna Stuchlik	ab5e48261b	doc: add the link to the ScyllaDB Basics page to the documentation landing page	2022-11-03 17:31:03 +01:00
Pavel Emelyanov	efbfcdb97e	Merge 'Replicate `raft_address_map` non-expiring entries to other shards' from Kamil Braun Replicating `raft_address_map` entries is needed for the following use cases: - the direct failure detector - currently it assumes a static mapping of `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft group 0 configuration changes. To handle dynamic mappings we need to modify the failure detector so it pings `raft::server_id`s and obtains the `gms::inet_address` before sending the message from `raft_address_map`. The failure detector is sharded, so we need the mappings to be available on all shards. - in the future we'll have multiple Raft groups running on different shards. To send messages they'll need `raft_address_map`. Initially I tried to replicate all entries - expiring and non-expiring. The implementation turned out to be very complex - we need to handle dropping expired entries and refreshing expiring entries' timestamps across shards, and doing this correctly while accounting for possible races is quite problematic. Eventually I arrived at the conclusion that replicating only non-expiring entries, and furthermore allowing non-expiring entries to be added only on shard 0, is good enough for our use cases: - The direct failure detector is pinging group 0 members only; group 0 members correspond exactly to the non-expiring entries. - Group 0 configuration changes are handled on shard 0, so non-expiring entries are added/removed on shard 0. - When we have multiple Raft groups, we can reuse a single Raft server ID for all Raft servers running on a single node belonging to different groups; they are 'namespaced' by the group IDs. Furthermore, every node has a server that belongs to group 0. Thus for every Raft server in every group, it has a corresponding server in group 0 with the same ID, which has a non-expiring entry in `raft_address_map`, which is replicated to all shards; so every group will be able to deliver its messages. With these assumptions the implementation is short and simple. We can always complicate it in the future if we find that the assumptions are too strong. Closes #11791 * github.com:scylladb/scylladb: test/raft: raft_address_map_test: add replication test service/raft: raft_address_map: replicate non-expiring entries to other shards service/raft: raft_address_map: assert when entry is missing in drop_expired_entries service/raft: turn raft_address_map into a service	2022-11-03 18:34:42 +03:00
Avi Kivity	ca2010144e	test: loading_cache_test: fix use-after-free in test_loading_cache_remove_leaves_no_old_entries_behind We capture `key` by reference, but it is in a another continuation. Capture it by value, and avoid the default capture specification. Found by clang 15 + asan + aarch64. Closes #11884	2022-11-03 17:23:40 +02:00
Avi Kivity	0c3967cf5e	Merge 'scylla-gdb.py: improve scylla-fiber' from Botond Dénes The main theme of this patchset is improving `scylla-fiber`, with some assorted unrelated improvement tagging along. In lieu of explicit support for mapping up continuation chains in memory from seastar (there is one but it uses function calls), scylla fiber uses a quite crude method to do this: it scans task objects for outbound references to other task objects to find waiters tasks and scans inbound references from other tasks to find waited-on tasks. This works well for most objects, but there are some problematic ones: * `seastar::thread_context`: the waited-on task (`seastar::(anonymous namespace)::thread_wake_task`) is allocated on the thread's stack which is not in the object itself. Scylla fiber now scans the stack bottom-up to find this task. * `seastar::smp_message_queue::async_work_item`: the waited on task lives on another shard. Scylla fiber now digs out the remote shard from the work item and continues the search on the remote shard. * `seastar::when_all_state`: the waited on task is a member in the same object tripping loop detection and terminating the search. Seastar fiber now uses the `_continuation` member explicitely to look for the next links. Other minor improvements were also done, like including the shard of the task in the printout. Example demonstrating all the new additions: ``` (gdb) scylla fiber 0x000060002d650200 Stopping because loop is detected: task 0x000061c00385fb60 was seen before. [shard 28] #-13 (task) 0x000061c00385fba0 0x00000000003b5b00 vtable for seastar::internal::when_all_state_component<seastar::future<void> > + 16 [shard 28] #-12 (task) 0x000061c00385fb60 0x0000000000417010 vtable for seastar::internal::when_all_state<seastar::internal::identity_futures_tuple<seastar::future<void>, seastar::future<void> >, seastar::future<void>, seastar::future<void> > + 16 [shard 28] #-11 (task) 0x000061c009f16420 0x0000000000419830 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_6futureISt5tupleIJNS4_IvEES6_EEE14discard_resultEvEUlDpOT_E_ZNS8_14then_impl_nrvoISC_S6_EET0_OT_EUlOS3_RSC_ONS_12future_stateIS7_EEE_S7_EE + 16 [shard 28] #-10 (task) 0x000061c0098e9e00 0x0000000000447440 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>::run_and_dispose()::{lambda(auto:1)#1}, seastar::future<void>::then_wrapped_nrvo<void, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}> >(seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 [shard 0] #-9 (task) 0x000060000858dcd0 0x0000000000449d68 vtable for seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}> + 16 [shard 0] #-8 (task) 0x0000600050c39f60 0x00000000007abe98 vtable for seastar::parallel_for_each_state + 16 [shard 0] #-7 (task) 0x000060000a59c1c0 0x0000000000449f60 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::sharded<cql_transport::cql_server>::stop()::{lambda(seastar::future<void>)#2}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(seastar::future<void>)#2}>({lambda(seastar::future<void>)#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(seastar::future<void>)#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 [shard 0] #-6 (task) 0x000060000a59c400 0x0000000000449ea0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, cql_transport::controller::do_stop_server()::{lambda(std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > >&)#1}::operator()(std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > >&) const::{lambda()#1}::operator()() const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 [shard 0] #-5 (task) 0x0000600009d86cc0 0x0000000000449c00 vtable for seastar::internal::do_with_state<std::tuple<std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > > >, seastar::future<void> > + 16 [shard 0] #-4 (task) 0x00006000019ffe20 0x00000000007ab368 vtable for seastar::(anonymous namespace)::thread_wake_task + 16 [shard 0] #-3 (task) 0x00006000085ad080 0x0000000000809e18 vtable for seastar::thread_context + 16 [shard 0] #-2 (task) 0x0000600009c04100 0x00000000006067f8 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_5asyncIZZN7service15storage_service5drainEvENKUlRS6_E_clES7_EUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSC_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSD_DpOSG_EUlvE0_ZNS_6futureIvE14then_impl_nrvoIST_SV_EET0_SQ_EUlOS3_RST_ONS_12future_stateINS1_9monostateEEEE_vEE + 16 [shard 0] #-1 (task) 0x000060000a59c080 0x0000000000606ae8 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEENS_6futureIvE12finally_bodyIZNS_5asyncIZZN7service15storage_service5drainEvENKUlRS9_E_clESA_EUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSF_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSG_DpOSJ_EUlvE1_Lb0EEEZNS5_17then_wrapped_nrvoIS5_SX_EENSD_ISG_E4typeEOT0_EUlOS3_RSX_ONS_12future_stateINS1_9monostateEEEE_vEE + 16 [shard 0] #0 (task) 0x000060002d650200 0x0000000000606378 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<service::storage_service::run_with_api_lock<service::storage_service::drain()::{lambda(service::storage_service&)#1}>(seastar::basic_sstring<char, unsigned int, 15u, true>, service::storage_service::drain()::{lambda(service::storage_service&)#1}&&)::{lambda(service::storage_service&)#1}::operator()(service::storage_service&)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(service::storage_service&)#1}>({lambda(service::storage_service&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(service::storage_service&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 [shard 0] #1 (task) 0x000060000bc40540 0x0000000000606d48 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEENS_6futureIvE12finally_bodyIZNS_3smp9submit_toIZNS_7shardedIN7service15storage_serviceEE9invoke_onIZNSB_17run_with_api_lockIZNSB_5drainEvEUlRSB_E_EEDaNS_13basic_sstringIcjLj15ELb1EEEOT_EUlSF_E_JES5_EET1_jNS_21smp_submit_to_optionsESK_DpOT0_EUlvE_EENS_8futurizeINSt9result_ofIFSJ_vEE4typeEE4typeEjSN_SK_EUlvE_Lb0EEEZNS5_17then_wrapped_nrvoIS5_S10_EENSS_ISJ_E4typeEOT0_EUlOS3_RS10_ONS_12future_stateINS1_9monostateEEEE_vEE + 16 [shard 0] #2 (task) 0x000060000332afc0 0x00000000006cb1c8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<seastar::json::json_return_type>, api::set_storage_service(api::http_context&, seastar::httpd::routes&)::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}::operator()(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >) const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}, {lambda()#1}<seastar::json::json_return_type> >({lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::json::json_return_type>&&, {lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 [shard 0] #3 (task) 0x000060000a1af700 0x0000000000812208 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::httpd::function_handler::function_handler(std::function<seastar::future<seastar::json::json_return_type> (std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)> const&)::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}::operator()(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >) const::{lambda(seastar::json::json_return_type&&)#1}, seastar::future<seastar::json::json_return_type>::then_impl_nrvo<seastar::json::json_return_type&&, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > >(seastar::json::json_return_type&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, seastar::json::json_return_type&, seastar::future_state<seastar::json::json_return_type>&&)#1}, seastar::json::json_return_type> + 16 [shard 0] #4 (task) 0x0000600009d86440 0x0000000000812228 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::httpd::function_handler::handle(seastar::basic_sstring<char, unsigned int, 15u, true> const&, std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future>({lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, {lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16 [shard 0] #5 (task) 0x0000600009dba0c0 0x0000000000812f48 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::handle_exception<std::function<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > (std::__exception_ptr::exception_ptr)>&>(std::function<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > (std::__exception_ptr::exception_ptr)>&)::{lambda(auto:1&&)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_wrapped_nrvo<seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, {lambda(auto:1&&)#1}>({lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, {lambda(auto:1&&)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16 [shard 0] #6 (task) 0x0000600026783ae0 0x00000000008118b0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<bool>, seastar::httpd::connection::generate_reply(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::httpd::connection::generate_reply(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}<bool> >({lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<bool>&&, {lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16 [shard 0] #7 (task) 0x000060000a4089c0 0x0000000000811790 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::httpd::connection::read_one()::{lambda()#1}::operator()()::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}::operator()(std::default_delete<std::unique_ptr>) const::{lambda(std::default_delete<std::unique_ptr>)#1}::operator()(std::default_delete<std::unique_ptr>) const::{lambda(bool)#2}, seastar::future<bool>::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}, {lambda(std::default_delete<std::unique_ptr>)#1}<void> >({lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}&, seastar::future_state<bool>&&)#1}, bool> + 16 [shard 0] #8 (task) 0x000060000a5b16e0 0x0000000000811430 vtable for seastar::internal::do_until_state<seastar::httpd::connection::read()::{lambda()#1}, seastar::httpd::connection::read()::{lambda()#2}> + 16 [shard 0] #9 (task) 0x000060000aec1080 0x00000000008116d0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::httpd::connection::read()::{lambda(seastar::future<void>)#3}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(seastar::future<void>)#3}>({lambda(seastar::future<void>)#3}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(seastar::future<void>)#3}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 [shard 0] #10 (task) 0x000060000b7d2900 0x0000000000811950 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::httpd::connection::read()::{lambda()#4}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::httpd::connection::read()::{lambda()#4}>(seastar::httpd::connection::read()::{lambda()#4}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::httpd::connection::read()::{lambda()#4}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16 Found no further pointers to task objects. If you think there should be more, run `scylla fiber 0x000060002d650200 --verbose` to learn more. Note that continuation across user-created seastar::promise<> objects are not detected by scylla-fiber. ``` Closes #11822 * github.com:scylladb/scylladb: scylla-gdb.py: collection_element: add support for boost::intrusive::list scylla-gdb.py: optional_printer: eliminate infinite loop scylla-gdb.py: scylla-fiber: add note about user-instantiated promise objects scylla-gdb.py: scylla-fiber: reject self-references when probing pointers scylla-gdb.py: scylla-fiber: add starting task to known tasks scylla-gdb.py: scylla-fiber: add support for walking over when_all scylla-gdb.py: add when_all_state to task type whitelist scylla-gdb.py: scylla-fiber: also print shard of tasks scylla-gdb.py: scylla-fiber: unify task printing scylla-gdb.py: scylla fiber: add support for walking over shards scylla-gdb.py: scylla fiber: add support for walking over seastar threads scylla-gdb.py: scylla-ptr: keep current thread context scylla-gdb.py: improve scylla column_families scylla-gdb.py: scylla_sstables.filename(): fix generation formatting scylla-gdb.py: improve schema_ptr scylla-gdb.py: scylla memory: restore compatibility with <= 5.1	2022-11-03 13:52:31 +02:00
Kamil Braun	2049962e11	Fix version numbers in upgrade page title Closes #11878	2022-11-03 10:06:25 +02:00
Takuya ASADA	45789004a3	install-dependencies.sh: update node_exporter to 1.4.0 To fix CVE-2022-24675, we need to a binary compiled in <= golang 1.18.1. Only released version which compiled <= golang 1.18.1 is node_exporter 1.4.0, so we need to update to it. See scylladb/scylla-enterprise#2317 Closes #11400 [avi: regenerated frozen toolchain] Closes #11879	2022-11-03 10:15:22 +04:00
Yaron Kaikov	20110bdab4	configure.py: remove un-used tar files creation Starting from https://github.com/scylladb/scylla-pkg/pull/3035 we removed all old tar.gz prefix from uploading to S3 or been used by downstream jobs. Hence, there is no point building those tar.gz files anymore Closes #11865	2022-11-02 17:44:09 +02:00
Anna Stuchlik	d1f7cc99bc	doc: fix the external links to the ScyllaDB University lesson about TTL Closes #11876	2022-11-02 15:05:43 +02:00
Nadav Har'El	59fa8fe903	Merge 'doc: add the information about AArch64 support to Requirements' from Anna Stuchlik Fix https://github.com/scylladb/scylla-doc-issues/issues/864 This PR: - updates the introduction to add information about AArch64 and rewrite the content. - replaces "Scylla" with "ScyllaDB". Closes #11778 * github.com:scylladb/scylladb: Update docs/getting-started/system-requirements.rst doc: fix the link to the OS Support page doc: replace Scylla with ScyllaDB doc: update the info about supported architecture and rewrite the introduction	2022-11-02 11:18:20 +02:00
Anna Stuchlik	ea799ad8fd	Update docs/getting-started/system-requirements.rst Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2022-11-02 09:56:56 +01:00
guy9	097a65df9f	adding top banner to the Docs website with a link to the ScyllaDB University fall LIVE event Closes #11873	2022-11-02 10:20:40 +02:00
Nadav Har'El	b9d88a3601	cql/pytest: add reproducer for timestamp column validation issue This patch adds a reproducing test for issue #11588, which is still open so the test is expected to fail on Scylla ("xfail), and passes on Cassandra. The test shows that Scylla allows an out-of-range value to be written to timestamp column, but then it can't be read back. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11864	2022-11-01 08:11:01 +02:00
Botond Dénes	dc46bfa783	Merge 'Prepare repair for task manager integration' from Aleksandra Martyniuk The PR prepares repair for task manager integration: - Creates repair_module - Keeps repair_module in repair_service - Moves tracker methods to repair_module - Changes UUID to task_id in repair module Closes #11851 * github.com:scylladb/scylladb: repair: check shutdown with abort source in repair module repair: use generic module gate for repair module operations repair: move tracker to repair module repair: move next_repair_command to repair_module repair: generate repair id in repair module repair: keep shard number in repair_uniq_id repair: change UUID to task_id repair: add task_manager::module to repair_service repair: create repair module and task	2022-11-01 08:05:14 +02:00
Aleksandra Martyniuk	f2fe586f03	repair: check shutdown with abort source in repair module In repair module the shutdown can be checked using abort_source. Thus, we can get rid of shutdown flag.	2022-10-31 10:57:29 +01:00
Aleksandra Martyniuk	2d878cc9b5	repair: use generic module gate for repair module operations Repair module uses a gate to prevent starting new tasks on shutdown. Generic module's gate serves the same purpose, thus we can use it also in repair specific context.	2022-10-31 10:56:36 +01:00
Aleksandra Martyniuk	4aae7e9026	repair: move tracker to repair module Since both tracker and repair_module serve similar purpose, it is confusing where we should seek for methods connected to them. Thus, to make it more transparent, tracker class is deleted and all its attributes and methods are moved to repair_module.	2022-10-31 10:55:36 +01:00
Aleksandra Martyniuk	a5c05dcb60	repair: move next_repair_command to repair_module Number of the repair operation was counted both with next_repair_command from tracer and sequence number from task_manager::module. To get rid of redundancy next_repair_command was deleted and all methods using its value were moved to repair_module.	2022-10-31 10:54:39 +01:00
Aleksandra Martyniuk	c81260fb8b	repair: generate repair id in repair module repair_uniq_id for repair task can be generated in repair module and accessed from the task.	2022-10-31 10:54:24 +01:00
Aleksandra Martyniuk	6432a26ccf	repair: keep shard number in repair_uniq_id Execution shard is one of the traits specific to repair tasks. Child task should freely access shard id of its parent. Thus, the shard number is kept in a repair_uniq_id struct.	2022-10-31 10:41:17 +01:00
guy9	276ec377c0	removed broken roadmap link Closes #11854	2022-10-31 11:33:03 +02:00
Aleksandra Martyniuk	e2c7c1495d	repair: change UUID to task_id Change type of repair id from utils::UUID to task_id to distinguish them from ids of other entities.	2022-10-31 10:07:08 +01:00
Aleksandra Martyniuk	dc80af33bc	repair: add task_manager::module to repair_service repair_service keeps a shared pointer to repair_module.	2022-10-31 10:04:50 +01:00
Aleksandra Martyniuk	576277384a	repair: create repair module and task Create repair_task_impl and repair_module inheriting from respectively task manager task_impl and module to integrate repair operations with task manager.	2022-10-31 10:04:48 +01:00
Takuya ASADA	159bc7c7ea	install-dependencies.sh: use binary distributions of PIP package We currently avoid compiling C code in "pip3 install scylla-driver", but we actually providing portable binary distributions of the package, so we should use it by "pip3 install --only-binary=:all: scylla-driver". The binary distribution contains dependency libraries, so we won't have problem loading it on relocatable python3. Closes #11852	2022-10-31 10:38:36 +02:00
Kamil Braun	db6cc035ed	test/raft: raft_address_map_test: add replication test	2022-10-31 09:17:12 +01:00
Kamil Braun	7d84007fd5	service/raft: raft_address_map: replicate non-expiring entries to other shards Replicating `raft_address_map` entries is needed for the following use cases: - the direct failure detector - currently it assumes a static mapping of `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft group 0 configuration changes. To handle dynamic mappings we need to modify the failure detector so it pings `raft::server_id`s and obtains the `gms::inet_address` before sending the message from `raft_address_map`. The failure detector is sharded, so we need the mappings to be available on all shards. - in the future we'll have multiple Raft groups running on different shards. To send messages they'll need `raft_address_map`. Initially I tried to replicate all entries - expiring and non-expiring. The implementation turned out to be very complex - we need to handle dropping expired entries and refreshing expiring entries' timestamps across shards, and doing this correctly while accounting for possible races is quite problematic. Eventually I arrived at the conclusion that replicating only non-expiring entries, and furthermore allowing non-expiring entries to be added only on shard 0, is good enough for our use cases: - The direct failure detector is pinging group 0 members only; group 0 members correspond exactly to the non-expiring entries. - Group 0 configuration changes are handled on shard 0, so non-expiring entries are added/removed on shard 0. - When we have multiple Raft groups, we can reuse a single Raft server ID for all Raft servers running on a single node belonging to different groups; they are 'namespaced' by the group IDs. Furthermore, every node has a server that belongs to group 0. Thus for every Raft server in every group, it has a corresponding server in group 0 with the same ID, which has a non-expiring entry in `raft_address_map`, which is replicated to all shards; so every group will be able to deliver its messages. With these assumptions the implementation is short and simple. We can always complicate it in the future if we find that the assumptions are too strong.	2022-10-31 09:17:12 +01:00
Kamil Braun	acacbad465	service/raft: raft_address_map: assert when entry is missing in drop_expired_entries	2022-10-31 09:17:12 +01:00
Kamil Braun	159bb32309	service/raft: turn raft_address_map into a service	2022-10-31 09:17:10 +01:00
Botond Dénes	139fbb466e	Merge 'Task manager extension' from Aleksandra Martyniuk The PR adds changes to task manager that allow more convenient integration with modules. Introduced changes: - adds internal flag in task::impl that allows user to filter too specific tasks - renames `parent_data` to more appropriate name `task_info` - creates `tasks/types.hh` which allows using some types connected with task manager without the necessity to include whole task manager - adds more flexible version of `make_task` method Closes #11821 * github.com:scylladb/scylladb: tasks: add alternative make_task method tasks: rename parent_data to task_info and move it tasks: move task_id to tasks/types.hh tasks: add internal flag for task_manager::task::impl	2022-10-31 09:57:10 +02:00
Botond Dénes	2c021affd1	Merge 'storage_service, repair: use per-shard abort_source' from Benny Halevy Prevent copying shared_ptr across shards in do_sync_data_using_repair by allocating a shared_ptr<abort_source> per shard in node_ops_meta_data and respectively in node_ops_info. Fixes #11826 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11827 * github.com:scylladb/scylladb: repair: use sharded abort_source to abort repair_info repair: node_ops_info: add start and stop methods storage_service: node_ops_abort_thread: abort all node ops on shutdown storage_service: node_ops_abort_thread: co_return only after printing log message storage_service: node_ops_meta_data: add start and stop methods repair: node_ops_info: prevent accidental copy	2022-10-31 09:43:34 +02:00
Botond Dénes	63a90cfb6c	scylla-gdb.py: collection_element: add support for boost::intrusive::list	2022-10-31 08:18:20 +02:00
Botond Dénes	2fa1864174	scylla-gdb.py: optional_printer: eliminate infinite loop Currently, to_string() recursively calls itself for engaged optionals. Eliminate it. Also, use the std_optional wrapper instead of accessing std::optional internals directly.	2022-10-31 08:18:20 +02:00
Botond Dénes	77b2555a04	scylla-gdb.py: scylla-fiber: add note about user-instantiated promise objects Scylla fiber uses a crude method of scanning inbound and outbound references to/from other task objects of recognized type. This method cannot detect user instantiated promise<> objects. Add a note about this to the printout, so users are beware of this.	2022-10-31 08:18:20 +02:00
Botond Dénes	2276565a2e	scylla-gdb.py: scylla-fiber: reject self-references when probing pointers A self-reference is never the pointer we are looking for when looking for other tasks referencing us. Reject such references when scanning outright.	2022-10-31 08:18:20 +02:00
Botond Dénes	f4365dd7f5	scylla-gdb.py: scylla-fiber: add starting task to known tasks We collect already seen tasks in a set to be able to detect perceived task loops and stop when one is seen. Initialize this set with the starting task, so if it forms a loop, we won't repeat it in the trace before cutting the loop.	2022-10-31 08:18:20 +02:00
Botond Dénes	48bbf2e467	scylla-gdb.py: scylla-fiber: add support for walking over when_all	2022-10-31 08:18:20 +02:00
Botond Dénes	cb8f02e24b	scylla-gdb.py: add when_all_state to task type whitelist	2022-10-31 08:18:20 +02:00
Botond Dénes	62621abc44	scylla-gdb.py: scylla-fiber: also print shard of tasks Now that scylla-fiber can cross shards, it is important to display the shard each task in the chain lives on.	2022-10-31 08:18:19 +02:00
Botond Dénes	c21c80f711	scylla-gdb.py: scylla-fiber: unify task printing Currently there is two loops and a separate line printing the starting task, all duplicating the formatting logic. Define a method for it and use it in all 3 places instead.	2022-10-31 08:18:19 +02:00
Botond Dénes	c103280bfd	scylla-gdb.py: scylla fiber: add support for walking over shards Shard boundaries can be crossed in one direction currently: when looking for waiters on a task, but not in the other direction (looking for waited-on tasks). This patch fixes that.	2022-10-31 08:18:19 +02:00
Botond Dénes	437f888ba0	scylla-gdb.py: scylla fiber: add support for walking over seastar threads Currently seastar threads end any attempt to follow waited-on-futures. Seastar threads need special handling because it allocates the wake up task on its stack. This patch adds this special handling.	2022-10-31 08:18:19 +02:00
Botond Dénes	fcc63965ed	scylla-gdb.py: scylla-ptr: keep current thread context scylla_ptr.analyze() switches to the thread the analyzed object lives on, but forgets to switch back. This was very annoying as any commands using it (which is a bunch of them) were prone to suddenly and unexpectedly switching threads. This patch makes sure that the original thread context is switched back to after analyzing the pointer.	2022-10-31 08:18:19 +02:00
Botond Dénes	91516c1d68	scylla-gdb.py: improve scylla column_families Rename to scylla tables. Less typing and more up-to-date. By default it now only lists tables from local shard. Added flag -a which brings back old behaviour (lists on all shards). Added -u (only list user tables) and -k (list tables of provided keyspace only) filtering options.	2022-10-31 08:18:19 +02:00
Botond Dénes	1d3d613b76	scylla-gdb.py: scylla_sstables.filename(): fix generation formatting Generation was recently converted from an integer to an object. Update the filename formatting, while keeping backward compatibility.	2022-10-31 08:18:19 +02:00
Botond Dénes	c869f54742	scylla-gdb.py: improve schema_ptr Add __getitem__(), so members can be accessed. Strip " from ks_name and cf_name. Add is_system().	2022-10-31 08:18:19 +02:00
Botond Dénes	66832af233	scylla-gdb.py: scylla memory: restore compatibility with <= 5.1 Recent reworks around dirty memory manager broke backward compatibility of the scylla memory command (and possibly others). This patch restores it.	2022-10-31 08:18:19 +02:00
Tenghuan He	e0948ba199	Add directory change instruction Add directory change instruction while building scylla Closes #11717	2022-10-30 23:53:02 +02:00
Pavel Emelyanov	477e0c967a	scylla-gdb: Evaluate LSA object sizes dynamically The lsa-segment command tries to walk LSA segment objects by decoding their descriptors and (!) object sizes as well. Some objects in LSA have dynamic sizes, i.e. those depending on the object contents. The script tries to drill down the object internals to get this size, but bad news is that nowadays there are many dynamic objects that are not covered. Once stepped upon unsupported object, scylla-gdb likely stops because the "next" descriptor happens to be in the middle of the object and its parsing throws. This patch fixes this by taking advantage of the virtual size() call of the migrate_fn_type all LSA objects are linked with (indirectly). It gets the migrator object, the LSA object itself and calls ((migrate_fn_type)<migrator_ptr>)->size((const void)<object_ptr>) with gdb. The evaluated value is the live dynamic size of the object. fixes: #11792 refs: #2455 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11847	2022-10-28 14:11:30 +03:00
Botond Dénes	74c9aa3a3f	Merge 'removenode: allow specifying nodes to ignore using host_id' from Benny Halevy Currently, when specifying nodes to ignore for replace or removenode, we support specifying them only using their ip address. As discussed in https://github.com/scylladb/scylladb/issues/11839 for removenode, we intentionally require the host uuid for specifying the node to remove, so the nodes to ignore (that are also done, otherwise we need not ignore them), should be consistent with that and be specified using their host_id. The series extends the apis and allows either the nodes ip address or their host_id to be specified, for backward compatibility. We should deprecate the ip address method over time and convert the tests and management software to use the ignored nodes' host_id:s instead. Closes #11841 * github.com:scylladb/scylladb: api: doc: remove_node: improve summary api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint locator: token_metadata: add parse_host_id_and_endpoint api: storage_service: remove_node: validate host_id	2022-10-28 13:35:04 +03:00
Benny Halevy	335a8cc362	api: doc: remove_node: improve summary The current summary of the operation is obscure. It refers to a token in the ring and the endpoint associated with it, while the operation uses a host_id to identify a whole node. Instead, clarify the summary to refer to a node in the cluster, consistent with the description for the host_id parameter. Also, describe the effect the call has on the data the removed node logically owned. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-28 07:52:37 +03:00
Benny Halevy	9ef2631ec2	api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s Currently the api is inconsistent: requiring a uuid for the host_id of the node to be removed, while the ignored nodes list is given as comma-separated ip addresses. Instead, support identifying the ignored_nodes either by their host_id (uuid) or ip address. Also, require all ignore_nodes to be of the same kind: either UUIDs or ip addresses, as a mix of the 2 is likely indicating a user error. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-28 07:49:03 +03:00
Benny Halevy	40cd685371	storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint Allow specifying the dead node to ignore either as host_id or ip address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-28 07:38:13 +03:00
Benny Halevy	b74807cb8a	locator: token_metadata: add parse_host_id_and_endpoint To be used for specifying nodes either by their host_id or ip address and using the token_metadata to resolve the mapping. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-28 07:38:13 +03:00
Benny Halevy	340a5a0c94	api: storage_service: remove_node: validate host_id The node to be removed must be identified by its host_id. Validate that at the api layer and pass the parsed host_id down to storage_service::removenode. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-28 07:38:13 +03:00
Takuya ASADA	464b5de99b	scylla_setup: allow symlink to --disks option Currently, --disks options does not allow symlinks such as /dev/disk/by-uuid/* or /dev/disk/azure/*. To allow using them, is_unused_disk() should resolve symlink to realpath, before evaluating the disk path. Fixes #11634 Closes #11646	2022-10-28 07:24:11 +03:00
Botond Dénes	b744036840	Merge 'scylla_util.py: on sysconfig_parser, don't use double quote when it's possible' from Takuya ASADA It seems like distribution original sysconfig files does not use double quote to set the parameter when the value does not contain space. Adding function to detect spaces in the value, don't usedouble quote when it not detected. Fixes #9149 Closes #9153 * github.com:scylladb/scylladb: scylla_util.py: adding unescape for sysconfig_parser scylla_util.py: on sysconfig_parser, don't use double quote when it's possible	2022-10-28 07:19:13 +03:00
Benny Halevy	44e1058f63	docs: nodetool/removenode: fix host_id in examples removenode host_id must specify the host ID as a UUID, not an ip address. Fixes #11839 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11840	2022-10-27 14:29:36 +03:00
Pavel Emelyanov	7b193ab0a5	messaging_service: Deny putting INADD_ANY as preferred ip Even though previous patch makes scylla not gossip this as internal_ip, an extra sanity check may still be useful. E.g. older versions of scylla may still do it, or this address can be loaded from system_keyspace. refs: #11502 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-27 14:25:43 +03:00
Pavel Emelyanov	aa7a759ac9	messaging_service: Toss preferred ip cache management Make it call cache_preferred_ip() even when the cache is loaded from system_keyspace and move the connection reset there. This is mainly to prepare for the next patch, but also makes the code a bit shorter Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-27 14:25:43 +03:00
Pavel Emelyanov	91b460f1c4	gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP Gossiping 0.0.0.0 as preferred IP may break the peer as it will "interpret" this address as <myself> which is not what peer expects. However, g.p.f.s. uses --listen-address argument as the internal IP and it's not prohibited to configure it to be 0.0.0.0 It's better not to gossip the INTERNAL_IP property at all if the listen address is such. fixes: #11502 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-27 14:25:43 +03:00
Pavel Emelyanov	99579bd186	gossiping_property_file_snitch: Make _listen_address optional As the preparation for the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-27 14:15:26 +03:00
Benny Halevy	0ea8250e83	repair: use sharded abort_source to abort repair_info Currently we use a single shared_ptr<abort_source> that can't be copied across shards. Instead, use a sharded<abort_source> in node_ops_info so that each repair_info instance will use an (optional) abort_source* on its own shard. Added respective start and stop methodsm plus a local_abort_source getter to get the shard-local abort_source (if available). Fixes #11826 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-27 12:18:30 +03:00
Benny Halevy	88f993e5ed	repair: node_ops_info: add start and stop methods Prepare for adding a sharded<abort_source> member. Wire start/stop in storage_service::node_ops_meta_data. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-27 12:18:30 +03:00
Benny Halevy	c2f384093d	storage_service: node_ops_abort_thread: abort all node ops on shutdown A later patch adds a sharded<abort_source> to node_ops_info. On shutdown, we must orderly stop it, so use node_ops_abort_thread shutdown path (where node_ops_singal_abort is called will a nullopt) to abort (and stop) all outstanding node_ops by passing a null_uuid to node_ops_abort, and let it iterate over all node ops to abort and stop them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-27 12:14:06 +03:00
Benny Halevy	0efd290378	storage_service: node_ops_abort_thread: co_return only after printing log message Currently the function co_returns if (!uuid_opt) so the log info message indicating it's stopped is not printed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-27 12:14:03 +03:00
Benny Halevy	47e4761b4e	storage_service: node_ops_meta_data: add start and stop methods Prepare for starting and stopping repair node_ops_info Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-27 12:14:03 +03:00
Benny Halevy	5c25066ea7	repair: node_ops_info: prevent accidental copy Delete node_ops_info copy and move constructors before we add a sharded<abort_source> member for the per-shard repairs in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-27 12:14:03 +03:00
Takuya ASADA	cd6030d5df	scylla_util.py: adding unescape for sysconfig_parser Even we have __escape() for escaping " middle of the value to writing sysconfig file, we didn't unescape for reading from sysconfig file. So adding __unescape() and call it on get().	2022-10-27 16:39:47 +09:00
Takuya ASADA	de57433bcf	scylla_util.py: on sysconfig_parser, don't use double quote when it's possible It seems like distribution original sysconfig files does not use double quote to set the parameter when the value does not contain space. Adding function to detect spaces in the value, don't usedouble quote when it not detected. Fixes #9149	2022-10-27 16:36:27 +09:00
Aleksandra Martyniuk	6494de9bb0	tasks: add alternative make_task method Task manager tasks should be created with make_task method since it properly sets information about child-parent relationship between tasks. Though, sometimes we may want to keep additional task data in classes inheriting from task_manager::task::impl. Doing it with existing make_task method makes it impossible since implementation objects are created internally. The commit adds a new make_task that allows to provide a task implementation pointer created by caller. All the fields except for the one connected with children and parent should be set before.	2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk	10d11a7baf	tasks: rename parent_data to task_info and move it parent_data struct contains info that is common for each task, not only in parent-child relationship context. To use it this way without confusion, its name is changed to task_info. In order to be able to widely and comfortably use task_info, it is moved from tasks/task_manager.hh to tasks/types.hh and slightly extended.	2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk	9ecc2047ac	tasks: move task_id to tasks/types.hh	2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk	e2e8a286cc	tasks: add internal flag for task_manager::task::impl It is convenient to create many different tasks implementations representing more and more specific parts of the operation in a module. Presenting all of them through the api makes it cumbersome for user to navigate and track, though. Flag internal is added to task_manager::task::impl so that the tasks could be filtered before they are sent to user.	2022-10-26 14:01:05 +02:00
Pavel Emelyanov	e245780d56	gossiper: Request topology states in shadow round When doing shadow round for replacement the bootstrapping node needs to know the dc/rack info about the node it replaces to configure it on topology. This topology info is later used by e.g. repair service. fixes: #11829 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11838	2022-10-25 13:21:20 +03:00
Pavel Emelyanov	64c9359443	storage_proxy: Don't use default-initialized endpoint in get_read_executor() After calling filter_for_query() the extra_replica to speculate to may be left default-initialized which is :0 ipv6 address. Later below this address is used as-is to check if it belongs to the same DC or not which is not nice, as :0 is not an address of any existing endpoint. Recent move of dc/rack data onto topology made this place reveal itself by emitting the internal error due to :0 not being present on the topology's collection of endpoints. Prior to this move the dc filter would count :0 as belonging to "default_dc" datacenter which may or may not match with the dc of the local node. The fix is to explicitly tell set extra_replica from unset one. fixes: #11825 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11833	2022-10-25 09:16:50 +03:00
Takuya ASADA	1a11a38add	unified: move unified package contents to sub-directory On most of the software distribution tar.gz, it has sub-directory to contain everything, to prevent extract contents to current directory. We should follow this style on our unified package too. To do this we need to increment relocatable package version to '3.0'. Fixes #8349 Closes #8867	2022-10-25 08:58:15 +03:00
Takuya ASADA	a938b009ca	scylla_raid_setup: run uuidpath existance check only after mount failed We added UUID device file existance check on #11399, we expect UUID device file is created before checking, and we wait for the creation by "udevadm settle" after "mkfs.xfs". However, we actually getting error which says UUID device file missing, it probably means "udevadm settle" doesn't guarantee the device file created, on some condition. To avoid the error, use var-lib-scylla.mount to wait for UUID device file is ready, and run the file existance check when the service is failed. Fixes #11617 Closes #11666	2022-10-25 08:54:21 +03:00
Yaniv Kaul	cec21d10ed	docs: Fix typo (patch -> batch) See subject. Closes #11837	2022-10-25 08:50:44 +03:00
Michał Radwański	36508bf5e9	serializer_impl: remove unneeded generic parameter Input stream used in vector_deserializer doesn't need to be generic, as there is only one implementation used.	2022-10-24 17:21:38 +02:00
Tomasz Grabiec	687df05e28	db: make_forwardable::reader: Do not emit range_tombstone_change with position past the range Since the end bound is exclusive, the end position should be before_key(), not after_key(). Affects only tests, as far as I know, only there we can get an end bound which is a clustering row position. Would cause failures once row cache is switched to v2 representation because of violated assumptions about positions. Introduced in `76ee3f029c` Closes #11823	2022-10-24 17:06:52 +03:00
Avi Kivity	9e34779c53	Update seastar submodule * seastar 601e0776c0...f32ed00954 (28): > Merge 'treewide: more fmt 9 adjustments' from Avi Kivity > rpc: Remove nested class friend declaration from connection > reactor: advance the head pointer in batch > Add git submodule instructions to HACKING.md, resolves #541 > dns: Handle TCP mode connect failure > future: s/make_exception_ptr/std::make_exception_ptr/ > reactor: implement read_some(fd, buffer, len) in io_uring > reactor: remove unneeded "protected" > Merge 'reactor: support more network ops in io_uring backend' from Kefu Chai > reactor: Indentation fix after previous patch > io: Remove --max-io-requests concept > future: add concept constraints to handle_exception() > future: improve the doxygen document > aio_general_context: flush: provide 1 second grace for retries > reactor: destroy_scheduling_group: make sure scheduling_group is valid > reactor: pass a plain pointer to io_uring_wait_cqes() > gate: add move ctor and move assignment operator for gate > reactor: drop stale comment > reactor_config: update stale doc comments > test: alloc_test: Actually prevent dead allocation elimination > util/closeable: hold _obj with reference_wrapper<> > memory: Fix off-by-one in large allocation detection > util/closeable: add move ctor for deferred_stop > reactor: Remove some unused friend declarations > core/sharded.hh: tweak on comment for better readability > Merge 'fmt 9 ostream fix' from longlene > program_options: allow configure switch-stytle option programmatically > inet_address: Add helper to check for address being lo/any Closes #11814	2022-10-21 21:30:07 +03:00
Botond Dénes	4aa0b16852	Merge 'distributed_loader: detect highest generation before populating column families' from Benny Halevy We should scan all sstables in the table directory and its subdirectories to determine the highest sstable version and generation before using it for creating new sstables (via reshard or reshape). Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789 Refs scylladb/scylladb#11789 Fixes scylladb/scylladb#11793 Closes #11795 * github.com:scylladb/scylladb: distributed_loader: populate_column_family: reindent distributed_loader: coroutinize populate_column_family distributed_loader: table_population_metadata: start: reindent distributed_loader: table_population_metadata: coroutinize start_subdir distributed_loader: table_population_metadata: start_subdir: reindent distributed_loader: pre-load all sstables metadata for table before populating it	2022-10-21 14:07:51 +03:00
Botond Dénes	e981bd4f21	Merge 'Alternator, MV: fix bug in some view updates which set the view key to its existing value' from Nadav Har'El As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted. In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it). Fixes #11801 Closes #11808 * github.com:scylladb/scylladb: test/alternator: add test for issue 11801 MV: fix handling of view update which reassign the same key value materialized views: inline used-once and confusing function, replace_entry()	2022-10-21 10:49:28 +03:00
Botond Dénes	396d9e6a46	Merge 'Subscribe repair_info::abort on node_ops_meta_data::abort_source' from Pavel Emelyanov The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes. The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically. fixes: #10284 Closes #11797 * github.com:scylladb/scylladb: repair: Remove ops_uuid repair: Remove abort_repair_node_ops() altogether repair: Subscribe on node_ops_info::as abortion repair: Keep abort source on node_ops_info repair: Pass node_ops_info arg to do_sync_data_using_repair() repair: Mark repair_info::abort() noexcept node_ops: Remove _aborted bit node_ops: Simplify construction of node_ops_metadata main: Fix message about repair service starting	2022-10-21 10:08:43 +03:00
Avi Kivity	9ebac12e60	test: mutation-test: fix off-by-one in test_large_collection_allocation The test wants to see that no allocations larger than 128k are present, but sets the warning threshold to exactly 128k. Due to an off-by-one in Seastar, this went unnoticed. However, now that the off-by-one in Seastar is fixed [1], this test starts to fail. Fix by setting the warning threshold to 128k + 1. [1] `429efb5086` Closes #11817	2022-10-21 10:04:40 +03:00
Avi Kivity	f0643d1713	alternator: ttl: do not copy mutation while constructing a vector The vector(initializer_list<T>) constructor copies the T since initializer_list is read-only. Move the mutation instead. This happens to fix a use-after-return on clang 15 on aarch64. I'm fairly sure that's a miscompile, but the fix is worthwhile regardless. Closes #11818	2022-10-21 10:04:00 +03:00
Avi Kivity	db79f1eb60	Merge 'cql3: expr: Add unit tests for evaluate()' from Jan Ciołek This PR adds some unit tests for the `expr::evaluate()` function. At first I wanted to add the unit tests as part of #11658, but their size grew and grew, until I decided that they deserve their own pull request. I found a few places where I think it would be better to behave in a different way, but nothing serious. Closes #11815 * github.com:scylladb/scylladb: test/boost: move expr_test_utils.hh to .hh and .cc in test/lib cql3: expr: Add unit tests for bind_variable validation of collections cql3: expr: Add test for subscripted list and map cql3: expr: Add test for usertype_constructor cql3: expr: Add test for tuple_constructor cql3: expr: Add tests for evaluation of collection constructors cql3: expr: Add tests for evaluation of column_values and bind_variables cql3: expr: Add constant evaluation tests test/boost: Add expr_test_utils.hh cql3: Add ostream operator for raw_value cql3: add is_empty_value() to raw_value and raw_value_view	2022-10-20 22:55:34 +03:00
Jan Ciolek	4c4ed8e6df	test/boost: move expr_test_utils.hh to .hh and .cc in test/lib expr_test_utils.hh was a header file with helper methods for expression tests. All functions were inline, because I didn't know how to create and link a .cc file in test/boost. Now the header is split into expr_test_utils.hh and expr_test_utils.cc and moved to test/lib, which is designed to keep this kind of files. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-20 17:31:37 +02:00
Avi Kivity	6ce659be5b	Merge "Deglobalize snitch" from Pavel E " Snitch was the junction of several services' deps because it was the holder of endpoint->dc/rack mappings. Now this information is all on topology object, so snitch can be finally made main-local " * 'br-deglobalize-snitch' of https://github.com/xemul/scylla: code: Deglobalize snitch tests: Get local reference on global snitch instance once gossiper: Pass current snitch name into checker snitch: Add sharded<snitch_ptr> arg to reset_snitch() api: Move update_snitch endpoint api: Use local snitch reference api: Unset snitch endpoints on stop storage_service: Keep local snitch reference system_keyspace: Don't use global snitch instance snitch: Add const snitch_ptr::operator->()	2022-10-20 16:51:24 +03:00
Avi Kivity	dd0b571d7e	Update tools/java submodule (Scylla Cloud serverless config option) * tools/java 5f2b91d774...87672be28e (1): > Add serverless Scylla Cloud config file option	2022-10-20 16:15:28 +03:00
Konstantin Osipov	8c920add42	test: (pytest) fix the pytest wrapper to work on Ubuntu Ubuntu doesn't have python, only python2 and python3. Closes #11810	2022-10-20 15:53:24 +03:00
Botond Dénes	669b225c67	reader_permit: resources: remove operator bool and >= These cannot be meaningfully define for a vector value like resources. To prevent instinctive misuse, remove them. Operator bool is replaced with `non_zero()` which hopefully better expresses what to expected. The comparison operator is just removed and inlined into its own user, which actually help said user's readability. Closes #11813	2022-10-20 15:25:11 +03:00
Jan Ciolek	75b27cb61c	cql3: expr: Add unit tests for bind_variable validation of collections evaluating a bind variable should validate collection values. Test that bound collection values are validated, even in case of a nested collection. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-20 12:12:03 +02:00
Jan Ciolek	c4651e897f	cql3: expr: Add test for subscripted list and map Test that subscripting lists and maps works as expected. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-20 12:12:03 +02:00
Jan Ciolek	5a00c3dd76	cql3: expr: Add test for usertype_constructor Test that evaluate(usertype_constructor) works as expected. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-20 12:12:03 +02:00
Jan Ciolek	8f6309bd66	cql3: expr: Add test for tuple_constructor Test that evaluate(tuple_constructor) works as expected. It was necessary to implement a custom function for serializing tuples, because some tests require the tuple to contain unset_value or an empty value, which is impossible to express using the exisiting code. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-20 12:12:03 +02:00
Jan Ciolek	5ae719d51a	cql3: expr: Add tests for evaluation of collection constructors Test that evaluate(collection_constructor) works as expected. Added a bunch of utility methods for creating collection values to expr_test_utils.hh. I was forced to write custom serialization of collections. I tried to use data_value, but it doesn't allow to express unset_value and empty values. The custom serialization isnt actually used in this specific commit, but it's needed in the following ones. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-20 12:12:02 +02:00
Pavel Emelyanov	01b1f56bd7	code: Deglobalize snitch All uses of snitch not have their own local referece. The global instance can now be replaced with the one living in main (and tests) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:33:41 +03:00
Pavel Emelyanov	8e4e3f7185	tests: Get local reference on global snitch instance once Some tests actively use global snitch instance. This patch makes each test get a local reference and use it everywhere. Next patch will replace global instance with local one Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:33:40 +03:00
Pavel Emelyanov	898579027d	gossiper: Pass current snitch name into checker Gossiper makes sure local snitch name is the same as the one of other nodes in the ring. It now gets global snitch to get the name, this patch passes the name as an argument, because the caller (storage_service) has snitch instance local reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:33:38 +03:00
Pavel Emelyanov	1674882220	snitch: Add sharded<snitch_ptr> arg to reset_snitch() The method replaces snitch instance on the existing sharded<snitch_ptr> and the "existing" is nowadays the global instance. This patch changes it to use local reference passed from API code Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:33:34 +03:00
Pavel Emelyanov	5fba0a7f65	api: Move update_snitch endpoint It's now living in storage_service.cc, but non-global snitch is available in endpoint_snitch.cc so move the endpoint handler there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:33:20 +03:00
Pavel Emelyanov	0d49b0e24a	api: Use local snitch reference The snitch/name endpoint needs snitch instance to get the name from. Also the storage_service/reset_snitch endpoint will also need snitch instance to call reset on. This patch carries local snitch reference all thw way through API setup and patches the get_name() call. The reset_snitch() will come in the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:31:45 +03:00
Pavel Emelyanov	c175ea33e2	api: Unset snitch endpoints on stop Some time soon snitch API handlers will operate on local snitch reference capture, so those need to be unset before the target local variable variable goes away Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:31:12 +03:00
Pavel Emelyanov	ea8bfc4844	storage_service: Keep local snitch reference Storage service uses snitch in several places: - boot - snitch-reconfigured subscription - preferred IP reconnection At this point it's worth adding storage_service->snitch explicit dependency and patch the above to use local reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:30:00 +03:00
Pavel Emelyanov	52d6e56a10	system_keyspace: Don't use global snitch instance There are two places to patch: .start() and .setup() and both only need snitch to get local dc/rack from, nothing more. Thus both can live with the explicit argument for now Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:29:26 +03:00
Pavel Emelyanov	f524a79fe9	snitch: Add const snitch_ptr::operator->() To call snitch->something() on const snitch_ptr& variable later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:29:25 +03:00
Nadav Har'El	264f453b9d	Merge 'Associate alternator user with its service level configuration' from Piotr Sarna Until now, authentication in alternator served only two purposes: - refusing clients without proper credentials - printing user information with logs After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests. tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user Fixes #11379 Closes #11380 * github.com:scylladb/scylladb: alternator: propagate authenticated user in client state client_state: add internal constructor with auth_service alternator: pass auth_service and sl_controller to server	2022-10-19 23:27:48 +03:00
Avi Kivity	22f13e7ca3	Revert "Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity" This reverts commit `df8e1da8b2`, reversing changes made to `4ff204c028`. It causes a crash in debug mode on aarch64 (likely a coroutine miscompile). Fixes #11809.	2022-10-19 21:28:55 +03:00
Alexander Turetskiy	636e14cc77	Alternator: Projection field added to return from DescribeTable which describes GSIs and LSIs. The return from DescribeTable which describes GSIs and LSIs is missing the Projection field. We do not yet support all the settings Projection (see #5036), but the default which we support is ALL, and DescribeTable should return that in its description. Fixes #11470 Closes #11693	2022-10-19 19:01:08 +03:00
Avi Kivity	69199dbfba	Merge 'schema_tables: limit concurrency' from Benny Halevy To prevent stalls due to large number of tables. Fixes scylladb/scylladb#11574 Closes #11689 * github.com:scylladb/scylladb: schema_tables: merge_tables_and_views reindent schema_tables: limit paralellism	2022-10-19 18:40:45 +03:00
Tomasz Grabiec	a979bbf829	dbuild: Do not fail if .gdbinit is missing Closes #11811	2022-10-19 18:38:09 +03:00
Avi Kivity	6b0afb968d	Merge 'reader_concurrency_semaphore: add set_resources()' from Botond Dénes Allowing to change the total or initial resources the semaphore has. After calling `set_resources()` the semaphore will look like as if it was created with the specified amount of resources when created. Use the new method in `replica::database::revert_initial_system_read_concurrency_boost()` so it doesn't lead to strange semaphore diagnostics output. Currently the system semaphore has 90/100 count units when there are no reads against it, which has led to some confusion. I also plan on using the new facility in enterprise. Closes #11772 * github.com:scylladb/scylladb: replica/database: revert initial boost to system semaphore with set_resources() reader_concurrency_semaphore: add set_resources()	2022-10-19 18:04:20 +03:00
Raphael S. Carvalho	ba6186a47f	replica: Pick new generation for SSTables being moved from staging dir When moving a SSTable from staging to base dir, we reused the generation under the assumption that no SSTable in base dir uses that same generation. But that's not always true. When reshaping staging dir, reshape compaction can pick a generation taken by a SSTable in base dir. That's because staging dir is populated first and it doesn't have awareness of generations in base dir yet. When that happens, view building will fail to move SSTable in staging which shares the same generation as another in base dir. We could have played with order of population, populating base dir first than staging dir, but the fragility wouldn't be gone. Not future proof at all. We can easily make this safe by picking a new generation for the SSTable being moved from staging, making sure no clash will ever happen. Fixes #11789. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11790	2022-10-19 15:33:30 +03:00
Nadav Har'El	2e439c9471	test/alternator: add test for issue 11801 This patch adds a test reproducing issue #11801, and confirming that the previous patch fixed it. Before the previous patch, the test passed on DynamoDB but failed on Alternator. The patch also adds four more passing tests which demonstrate that issue #11801 only happened in the very specific case where: 1. A GSI has two key attributes which weren't key attributes in the base, and 2. An update sets the second of those attributes to the same value which it already had. This bug was originally discovered and explained by @fee-mendes. Refs #11801. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-10-19 14:36:48 +03:00
Benny Halevy	4d7f0be929	distributed_loader: populate_column_family: reindent	2022-10-19 14:18:38 +03:00
Benny Halevy	030afaa934	distributed_loader: coroutinize populate_column_family Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-19 14:18:04 +03:00
Benny Halevy	0f23ee14c9	distributed_loader: table_population_metadata: start: reindent	2022-10-19 14:16:59 +03:00
Benny Halevy	39cec4f304	distributed_loader: table_population_metadata: coroutinize start_subdir Calling it in a seastar thread was done to reduce code churn and facilitate backporting. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-19 14:16:59 +03:00
Benny Halevy	5749a54cab	distributed_loader: table_population_metadata: start_subdir: reindent	2022-10-19 14:16:59 +03:00
Benny Halevy	119c0f3983	distributed_loader: pre-load all sstables metadata for table before populating it We should scan all sstables in the table directory and its subdirectories to determine the highest sstable version and generation before using it for creating new sstables (via reshard or reshape). Fixes scylladb/scylladb#11793 Note: table_population_metadata::start_subdir is called in a seastar thread to facilitate backporting to old versions that do not support coroutines yet. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-19 14:16:57 +03:00
Nadav Har'El	8f4243b875	MV: fix handling of view update which reassign the same key value When a materialized view has a key (in Alternator, this can be two keys) which was a regular column in the base table, and a base update modifies that regular column, there are two distinct cases: 1. If the old and new key values are different, we need to delete the old view row, and create a new view row (with the different key). 2. If the old and new key values are the same, we just need to update the pre-existing row. It's important not to confuse the two cases: If we try to delete and create the same view row in the same timestamp, the result will be that the row will be deleted (a tombstone wins over data if they have the same timestamp) instead of updated. This is what we saw in issue #11801. We had a bug that was seen when an update set the view key column to the old value it already had: To compare the old and new key values we used the function compare_atomic_cell_for_merge(), but this compared not just they values but also incorrectly compared the metadata such as a the timestamp. Because setting a column to the same value changes its timestamp, we wrongly concluded that these to be different view keys and used the delete-and-create code for this case, resulting in the view row being deleted (as explained above). The simple fix is to compare just the key values - not looking at the metadata. See tests reproducing this bug and confirming its fix in the next patch. Fixes #11801 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-10-19 13:43:12 +03:00
Nadav Har'El	e1f8cb6521	materialized views: inline used-once and confusing function, replace_entry() The replace_entry() function is nothing more than a convenience for calling delete_old_entry() and then create_entry(). But it is only used once in the code, and we can just open-code the two calls instead of the one. The reason I want to change it now is that the shortcut replace_entry() helped hide a bug (#11801) - replace_entry() works incorrectly if the old and new row have the same key, because if they do we get a deletion and creation of the same row with the same timestamp - and the deletion wins. Having the two calls not hidden by a convenience function makes this potential problem more apparent. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-10-19 13:25:34 +03:00
Benny Halevy	ce22dd4329	schema_tables: merge_tables_and_views reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-19 13:05:41 +03:00
Benny Halevy	7ccb0e70f0	schema_tables: limit paralellism To prevent stalls due to large number of tables. Fixes scylladb/scylladb#11574 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-19 13:05:38 +03:00
Anna Stuchlik	7ec750fc63	docs: add the list of new metrics in 5.1 Closes #11703	2022-10-19 12:06:25 +03:00
Jan Ciolek	1b7acc758e	cql3: expr: Add tests for evaluation of column_values and bind_variables Add tests which test that evaluate(column_value) and evaluate(bind_variable) work as expected. values of columns and bind variables are kept in evaluation_inputs, so we need to mock them in order for evaluate() to work. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-19 10:30:51 +02:00
Jan Ciolek	0f29015d9f	cql3: expr: Add constant evaluation tests Add unit test for evaluating expr::constant values. evaluate(constant) just returns constant.value, so there is no point in trying all the possible combinations. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-19 10:30:42 +02:00
Anna Stuchlik	a066396cd3	doc: fix the command to create and sign a certificate so that the trusted certificate SHA256 is created Closes #11758	2022-10-19 11:30:20 +03:00
Botond Dénes	37ebbc819a	Merge 'Scylla-gdb lsa polishing' from Pavel Emelyanov It was supposed to be fix for #2455, but eventually it turned out that #11792 blocks this progress but takes more efforts. So for now only a couple of small improvements (not to lose them by chance) Closes #11794 * github.com:scylladb/scylladb: scylla-gdb: Make regions iterable object scylla-gdb: Dont print 0x0x	2022-10-19 06:54:49 +03:00
Botond Dénes	2d581e9e8f	Merge "Maintain dc/rack by topology" from Pavel Emelyanov " There's an ongoing effort to move the endpoint -> {dc/rack} mappings from snitch onto topology object and this set finalizes it. After it the snitch service stops depending on gossiper and system keyspace and is ready for de-globalization. As a nice side-effect the system keyspace no longer needs to maintain the dc/rack info cache and its starting code gets relaxed. refs: #2737 refs: #2795 " * 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits) system_keyspace: Dont maintain dc/rack cache system_keyspace: Indentation fix after previous patch system_keyspace: Coroutinuze build_dc_rack_info() topology: Move all post-configuration to topology::config snitch: Start early gossiper: Do not export system keyspace snitch: Remove gossiper reference snitch: Mark get_datacenter/_rack methods const snitch: Drop some dead dependency knots snitch, code: Make get_datacenter() report local dc only snitch, code: Make get_rack() report local rack only storage_service: Populate pending endpoint in on_alive() code: Populate pending locations topology: Put local dc/rack on topology early topology: Add pending locations collection topology: Make get_location() errors more verbose token_metadata: Add config, spread everywhere token_metadata: Hide token_metadata_impl copy constructor gosspier: Remove messaging service getter snitch: Get local address to gossip via config ...	2022-10-19 06:50:21 +03:00
Jan Ciolek	429600a957	test/boost: Add expr_test_utils.hh Add a header file which will contain utilities for writing expression tests. For now it contains simple functions like make_int_constant(), but there are many more to come. I feel like it's cleaner to put all these functions in a separate file instead of having them spread randomly between tests. It also enables code reuse so that future expression tests can reuse these functions instead of writing them from scratch. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-18 22:48:33 +02:00
Jan Ciolek	855db49306	cql3: Add ostream operator for raw_value It's possible to print raw_value_view, but not raw_value. It would be useful to be able to print both. Implement printing raw_value by creating raw_value_view from it and printing the view. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-18 22:48:25 +02:00
Jan Ciolek	096c65d27f	cql3: add is_empty_value() to raw_value and raw_value_view An empty value is a value that is neither null nor unset, but has 0 bytes of data. Such values can be created by the user using certain CQL functions, for example an empty int value can be inserted using blobasint(0x). Add a method to raw_value and raw_value_view, which allows to check whether the value is empty. This will be used in many places in which we need to validate that a value isn't empty. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-18 22:47:48 +02:00
Pavel Emelyanov	3dc7c33847	repair: Remove ops_uuid It used to be used to abort repair_info by the corresponding node-ops uuid, but this code is no longer there, so it's good to drop the uuid as well Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:04:23 +03:00
Pavel Emelyanov	b835c3573c	repair: Remove abort_repair_node_ops() altogether This code is dead after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:04:23 +03:00
Pavel Emelyanov	8231b4ec1b	repair: Subscribe on node_ops_info::as abortion When node_ops_meta_data aborts it also kicks repair to find and abort all relevant repair_infos. Now it can be simplified by subscribing repair_meta on the abort source and aborting it without explicit kick Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:04:23 +03:00
Pavel Emelyanov	bf5825daac	repair: Keep abort source on node_ops_info Next patches will need to subscribe on node_ops_meta_data's abort source inside repair code, so keep the pointer on node_ops_info too. At the same time, the node_ops_info::abort becomes obsolete, because the same check can be performed via the abort_source->abort_requested() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:04:23 +03:00
Pavel Emelyanov	bbb7fca09c	repair: Pass node_ops_info arg to do_sync_data_using_repair() Next patches will need to know more than the ops_uuid. The needed info is (well -- will be) sitting on node_ops_info, so pass it along Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:04:23 +03:00
Pavel Emelyanov	5e9c3c65b5	repair: Mark repair_info::abort() noexcept Next patch will call it inside abort_source subscription callback which requires the calling code to be noexcept Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:04:23 +03:00
Pavel Emelyanov	34458ec2c5	node_ops: Remove _aborted bit A short cleanup "while at it" -- the node_ops_meta_data doesn't need to carry dedicated _aborted boolean -- the abort source that sets it is available instantly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:04:22 +03:00
Pavel Emelyanov	96f0695731	node_ops: Simplify construction of node_ops_metadata It always constructs node_ops_info the same way Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 20:03:53 +03:00
Pavel Emelyanov	2fa58632b3	main: Fix message about repair service starting Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 17:23:17 +03:00
Botond Dénes	7fbad8de87	reader_concurrency_semaphore: unify admission logic across all paths The semaphore currently has two admission paths: the obtain_permit()/with_permit() methods which admits permits on user request (the front door) and the maybe_admit_waiters() which admits permits based on internal events like memory resource being returned (the back door). The two paths used their own admission conditions and naturally this means that they diverged in time. Notably, maybe_admit_waiters() did not look at inactive readers assuming that if there are waiters there cannot be inactive readers. This is not true however since we merged the execution-stage into the semaphore. Waiters can queue up even when there are inactive reads and thus maybe_admit_waiters() has to consider evicting some of them to see if this would allow for admitting new reads. To avoid such divergence in the future, the admission logic was moved into a new method can_admit_read() which is now shared between the two method families. This method now checks for the possibility of evicting inactive readers as well. The admission logic was tuned slightly to only consider evicting inactive readers if there is a real possibility that this will result in admissions: notably, before this patch, resource availability was checked before stalls were (used permits == blocked permits), so we could evict readers even if this couldn't help. Because now eviction can be started from maybe_admit_waiters(), which is also downstream from eviction, we added a flag to avoid recursive evict -> maybe admit -> evict ... loops. Fixes: #11770 Closes #11784	2022-10-18 17:07:43 +03:00
Pavel Emelyanov	b5fd65af61	scylla-gdb: Make regions iterable object This makes it re-usable across different commands (not there yet) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 16:09:47 +03:00
Pavel Emelyanov	0b6b0bd8d2	scylla-gdb: Dont print 0x0x Formatting pointer adds 0x automatically, no need in adding it explicitly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-18 16:09:09 +03:00
Botond Dénes	df8e1da8b2	Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity indexed_table_select_statement::do_execute_base_query() is fairly complicated and becomes a little simpler with coroutines. Closes #11297 * github.com:scylladb/scylladb: cql3: indexed_table_select_statement: fix indentation cql3: indexed_table_select_statement: clarify loop termination cql3: indexed_table_select_statement: get rid of internal base_query_state struct cql3: indexed_table_select_statement: coroutinize do_execute_base_query() cql3: indexed_table_select_statement: de-result_wrap() do_execute_base_query()	2022-10-18 08:24:21 +03:00
Avi Kivity	50b1fd4cd2	cql3: indexed_table_select_statement: fix indentation Restore normal indentation after coroutinization, no code changes.	2022-10-17 22:03:11 +03:00
Avi Kivity	3ad956ca2d	cql3: indexed_table_select_statement: clarify loop termination The loop terminates when we run out of keys. There are extra conditions such as for short read or page limit, but these are truly discovered during the loop and qualify as special conditions, if you squint enough.	2022-10-17 22:03:11 +03:00
Avi Kivity	ec183d4673	cql3: indexed_table_select_statement: get rid of internal base_query_state struct It was just a crutch for do_with(), and now can be replaced with ordinary coroutine-protected variables. The member names were renamed to the final names they were assigned within the do_with().	2022-10-17 22:03:11 +03:00
Avi Kivity	75e1321b08	cql3: indexed_table_select_statement: coroutinize do_execute_base_query() Indentation and "infinite" for-loop left for later cleanup. Note the last check for a utils::result<> failure is no longer needed, since the previous checks for failure resulted in an immediate co_return rather than propagating the failure into a variable as with continuations. The lambda coroutine is stabilized with the new seastar::coroutine::lambda facility.	2022-10-17 22:03:11 +03:00
Avi Kivity	8b019841d8	cql3: indexed_table_select_statement: de-result_wrap() do_execute_base_query() It's an obstacle to coroutinization as it introduces more lambdas.	2022-10-17 22:03:11 +03:00
Tomasz Grabiec	4ff204c028	Merge 'cache: make all removals of cache items explicit' from Michał Chojnowski This series is a step towards non-LRU cache algorithms. Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor. However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`. This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction. Closes #11716 * github.com:scylladb/scylladb: utils: lru: assert that evictables are unlinked before destruction utils: lru: remove unlink_from_lru() cache: make all cache unlinks explicit	2022-10-17 12:47:02 +02:00
Michał Chojnowski	a96433d3a4	utils: lru: assert that evictables are unlinked before destruction Previous patches introduce the assumption that evictables are manually unlinked before destruction, to allow for correct bookkeeping within the cache. This assert assures that this assumptions is correct. This is particularly important because the switch from automatic to explicit unlinking had to be done manually. Destructor calls are invisible, so it's possible that we have missed some automatic destruction site.	2022-10-17 12:07:27 +02:00
Michał Chojnowski	f340c9cca5	utils: lru: remove unlink_from_lru() unlink_from_lru() allows for unlinking elements from cache without notifying the cache. This messes up any potential cache bookkeeping. Improved that by replacing all uses of unlink_from_lru() with calls to lru::remove(), which does have access to cache's metadata.	2022-10-17 12:07:27 +02:00
Michał Chojnowski	d785364375	cache: make all cache unlinks explicit Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning that elements of the list unlink themselves from the list automatically on destruction. Some parts of the code rely on that, and don't unlink them manually. However, this precludes accurate bookkeeping about the cache. Elements only have access to themselves and their neighbours, not to any bookkeeping context. Therefore, a destructor cannot update the relevant metadata. In this patch, we fix this by adding explicit unlink calls to places where it would be done by a destructor. In a following patch, we will add an assert to the destructor to check that every element is unlinked before destruction.	2022-10-17 12:07:27 +02:00
Nadav Har'El	c31bf4184f	test/cql-pytest: two reproducers for SI returning oversized pages This patch has two reproducing tests for issue #7432, which are cases where a paged query with a restriction backed by a secondary-index returns pages larger than the desired page size. Because these tests reproduce a still-open bug, they are both marked "xfail". Both tests pass on Cassandra. The two tests involve quite dissimilar casess - one involves requesting an entire partition (and Scylla forgetting to page through it), and the other involves GROUP BY - so I am not sure these two bugs even have the same underlying cause. But they were both reported in #7432, so let's have reproducers for both. Refs #7432 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11586	2022-10-17 11:36:05 +03:00
Botond Dénes	d85208a574	replica/database: revert initial boost to system semaphore with set_resources() Unlike the current method (which uses consume()), this will also adjust the initial resources, adjusting the semaphore as if it was created with the reduced amount of resources in the first place. This fixes the confusing 90/100 count resources seen in diagnostics dump outputs.	2022-10-17 07:39:20 +03:00
Botond Dénes	ecc7c72acd	reader_concurrency_semaphore: add set_resources() Allowing to change the total or initial resources the semaphore has. After calling `set_resources()` the semaphore will look like as if it was created with the specified amount of resources when created.	2022-10-17 07:39:20 +03:00
Avi Kivity	e5e7780f32	test: work around modern pytest rejecting site-packages Modern (as of Fedora 37) pytest has the "-sP" flags in the Python command line, as found in /usr/bin/pytest. This means it will reject the site-packages directory, where we install the Scylla Python driver. This causes all the tests to fail. Work around it by supplying an alternative pytest script that does not have this change. Closes #11764	2022-10-17 07:18:33 +03:00
Nadav Har'El	9f02431064	test/cql-pytest: fix test_permissions.py when running with "--ssl" The tests in test_permissions.py use the new_session() utility function to create a new connection with a different logged-in user. It models the new connection on the existing one, but incorrectly assumed that the connection is NOT ssl. This made this test failed with cql-pytest/run is passed the "--ssl" option. In this patch we correctly infer the is_ssl state from the existing cql fixture, instead of assuming it is false. After this pass, "cql-pytest/run --ssl" works as expected for this test. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11742	2022-10-17 06:46:46 +03:00
Tomasz Grabiec	c8a372ae7f	test: db: Add test for row merging involving many versions The test verifies that a row which participated in earlier merge, and its cells lost on the timestamp check, behaves exactly like an empty row and can accept any mutation. This wasn't the case in versions prior to `f006acc`. Closes #11787	2022-10-16 14:29:49 +03:00
Tomasz Grabiec	5d7e40af99	mvcc: Add snapshot details to the printout of partition_entry Useful for debugging. Closes #11788	2022-10-16 14:22:14 +03:00
Nadav Har'El	d2cd9b71b3	Merge 'Make tracing test run again, simplify backend registry and few related cleanups' from Pavel Emelyanov It turned out that boost/tracing test is not run because its name doesn't match the _test.cc pattern. While fixing it it turned out that the test cannot even start, because it uses future<>.get() calls outside of seastar::thread context. While patching this place the trace-backend registry was removed for simplicity. And, while at it, few more cleanups "while at it" Closes #11779 github.com:scylladb/scylladb: tracing: Wire tracing test back tracing: Indentation fix after previous patch tracing: Move test into thread tracing: Dismantle trace-backend registry tracing: Use class-registrator for backends tracing: Add constraint to trace_state::begin() tracing: Remove copy-n-paste comments from test tracing: Outline may_create_new_session	2022-10-16 12:32:17 +03:00
Nadav Har'El	1f936838ba	Merge 'doc: fix the notes on the OS Support by Platform and Version page' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/11773 This PR fixes the notes by removing repetition and improving the clairy of the notes on the OS Support page. In addition, "Scylla" was replaced with "ScyllaDB" on related pages. Closes #11783 * github.com:scylladb/scylladb: doc: replace Scylla with ScyllaDB doc: add a comment to remove in future versions any information that refers to previous releases doc: rewrite the notes to improve clarity doc: remove the reperitions from the notes	2022-10-16 10:13:50 +03:00
Tomasz Grabiec	87b7e7ff9c	Merge 'storage_proxy: prepare for fencing, complex ops' from Avi Kivity Following up on `69aea59d97`, which added fencing support for simple reads and writes, this series does the same for the complex ops: - partition scan - counter mutation - paxos With this done, the coordinator knows about all in-flight requests and can delay topology changes until they are retired. Closes #11296 * github.com:scylladb/scylladb: storage_proxy: hold effective_replication_map for the duration of a paxos transaction storage_proxy: move paxos_response_handler class to .cc file storage_proxy: deinline paxos_response_handler constructor/destructor storage_proxy: use consistent effective_replication_map for counter coordinator storage_proxy: improve consistency in query_partition_key_range{,_concurrent} storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency storage_proxy: query_singular: use fewer smart pointers storage_proxy: query_singular: simplify lambda captures locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata storage_proxy: use consistent token_metadata with rest of singular read	2022-10-14 15:44:35 +02:00
Pavel Emelyanov	6150214da3	Add rust/Cargo.lock to .gitignore The file appears after build Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11776	2022-10-14 13:54:50 +03:00
Anna Stuchlik	09b0e3f63e	doc: replace Scylla with ScyllaDB	2022-10-14 11:06:27 +02:00
Anna Stuchlik	9e2b7e81d3	doc: add a comment to remove in future versions any information that refers to previous releases	2022-10-14 10:53:17 +02:00
Anna Stuchlik	fc0308fe30	doc: rewrite the notes to improve clarity	2022-10-14 10:48:59 +02:00
Anna Stuchlik	1bd0bc00b3	doc: remove the reperitions from the notes	2022-10-14 10:32:52 +02:00
Botond Dénes	621e43a0c8	Merge 'dirty_memory_manager: tidy up' from Avi Kivity A collection of small cleanups, and a bug fix. Closes #11750 * github.com:scylladb/scylladb: dirty_memory_manager: move region_group data members to top-of-class dirty_memory_manager: update region_group comment dirty_memory_manager: remove outdated friend dirty_memory_manager: fold region_group::push_back() into its caller dirty_memory_manager: simplify blocked calculation in region_group::run_when_memory_available dirty_memory_manager: remove unneeded local from region_group::run_when_memory_is_available dirty_memory_manager: tidy up region_group::execution_permitted() dirty_memory_manager: reindent region_group::release_queued_allocations() dirty_memory_manager: convert region_group::release_queued_allocations() to a coroutine dirty_memory_manager: move region_group::_releaser after _shutdown_requested dirty_memory_manager: move region_group queued allocation releasing into a function dirty_memory_manager: fold allocation_queue into region_group dirty_memory_manager: don't ignore timeout in allocation_queue::push_back()	2022-10-14 06:56:42 +03:00
Avi Kivity	1feaa2dfb4	storage_proxy: handle_write: use coroutine::all() instead of when_all() coroutine::all() saves an allocation. Since it's safe for lambda coroutines, remove a coroutine::lambda wrapper. Closes #11749	2022-10-14 06:56:16 +03:00
Tomasz Grabiec	ee2398960c	Merge 'service/raft: simplify `raft_address_map`' from Kamil Braun The `raft_address_map` code was "clever": it used two intrusive data structures and did a lot of manual lifetime management; raw pointer manipulation, manual deletion of objects... It wasn't clear who owns which object, who is responsible for deleting what. And there was a lot of code. In this PR we replace one of the intrusive data structures with a good old `std::unordered_map` and make ownership clear by replacing the raw pointers with `std::unique_ptr`. Furthermore, some invariants which were not clear and enforced in runtime are now encoded in the type system. The code also became shorter: we reduced its length from ~360 LOC to ~260 LOC. Closes #11763 * github.com:scylladb/scylladb: service/raft: raft_address_map: get rid of `is_linked` checks service/raft: raft_address_map: get rid of `to_list_iterator` service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr` service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr service/raft: raft_address_map: don't use intrusive set for timestamped entries service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr`	2022-10-13 18:08:49 +02:00
Kamil Braun	954849799d	test/topology: disable flaky `test_decommission_add_column` Flaky due to #11780, causes next promotion failures. We can reenable it after the issue is fixed or a workaround is found.	2022-10-13 17:45:46 +02:00
Pavel Emelyanov	707efb6dfb	tracing: Wire tracing test back The boost/tracing test is not run, because test.py boost suite collects tests that match *_test.cc pattern. The tracing one apparently doesn't Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:59:13 +03:00
Pavel Emelyanov	5b67a2a876	tracing: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:59:08 +03:00
Pavel Emelyanov	53ac8536f1	tracing: Move test into thread The test calls future<>.get()'s in its lambda which is only allowed in seastar threads. It's not stepped upon because (surprise, surprise) this test is not run at all. Next patch fixes it. Meanwhile, the fix is in using cql_env_thread thing for the whole lambda which runs in it seastar::async() context Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:57:35 +03:00
Pavel Emelyanov	5c8a61ace2	tracing: Dismantle trace-backend registry It's not used any longer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:57:24 +03:00
Pavel Emelyanov	fe7d38661c	tracing: Use class-registrator for backends Currently the code uses its own class registration engine, but there's a generic one in utils/ that applies here too. In fact, the tracing backend registry is just a transparent wrapper over the generic one :\ Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:56:24 +03:00
Pavel Emelyanov	1adb2c8cc3	tracing: Add constraint to trace_state::begin() It expects that the function is (void) and returns back a string Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:56:08 +03:00
Pavel Emelyanov	0a6a5a242e	tracing: Remove copy-n-paste comments from test Tests don't have supervisor, so there's no sense in keeping these bits Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:55:40 +03:00
Pavel Emelyanov	79820c2006	tracing: Outline may_create_new_session It's a private method used purely in tracing.cc, no need in compiling it every time the header is met somewhere else. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 17:55:14 +03:00
Anna Stuchlik	9f7536d549	doc: fix the link to the OS Support page	2022-10-13 15:36:51 +02:00
Anna Stuchlik	1fd1ce042a	doc: replace Scylla with ScyllaDB	2022-10-13 15:21:46 +02:00
Anna Stuchlik	81ce7a88de	doc: update the info about supported architecture and rewrite the introduction	2022-10-13 15:18:29 +02:00
Kamil Braun	5a9371bcb0	service/raft: raft_address_map: get rid of `is_linked` checks Being linked is an invariant of `expiring_entry_ptr`. Make it explicit by moving the `_expiring_list.push_front` call into the constructor.	2022-10-13 15:17:07 +02:00
Kamil Braun	cdf3367c05	service/raft: raft_address_map: get rid of `to_list_iterator` Unnecessary.	2022-10-13 15:17:06 +02:00
Kamil Braun	0e29495c38	service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr` The owner of `expiring_entry_ptr` was almost uniquely its corresponding `timestamp_entry`; it would delete the expiring entry when it itself got destroyed. There was one call to explicit `unlink_and_dispose`, which made the picture unclear. Make the picture clear: `timestamped_entry` now contains a `unique_ptr` to its `expiring_entry_ptr`. The `unlink_and_dispose` was replaced with `_lru_entry = nullptr`. We can also get rid of the back-reference from `expiring_entry_ptr` to `timestamped_entry`. The code becomes shorter and simpler.	2022-10-13 15:16:40 +02:00
Petr Gusev	c76cf5956d	removenode: don't stream data from the leaving node If a removenode is run for a recently stopped node, the gossiper may not yet know that the node is down, and the removenode will fail with a Stream failed error trying to stream data from that node. In this patch we explicitly reject removenode operation if the gossiper considers the leaving node up. Closes #11704	2022-10-13 15:11:32 +02:00
Takuya ASADA	49d5e51d76	reloc: add support stripped binary installation for relocatable package This add support stripped binary installation for relocatable package. After this change, scylla and unified packages only contain stripped binary, and introduce "scylla-debuginfo" package for debug symbol. On scylla-debuginfo package, install.sh script will extract debug symbol at /opt/scylladb/<dir>/.debug. Note that we need to keep unstripped version of relocatable package for rpm/deb, otherwise rpmbuild/debuild fails to create debug symbol package. This version is renamed to scylla-unstripped-$version-$release.$arch.tar.gz. See #8918 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #9005	2022-10-13 15:11:32 +02:00
Asias He	6134fe4d1f	storage_service: Prevent removed node to rejoin in handle_state_normal - Start n1, n2, n3 (127.0.0.3) - Stop n3 - Change ip address of n3 to 127.0.0.33 and restart n3 - Decommission n3 - Start new node n4 The node n4 will learn from the gossip entry for 127.0.0.3 that node 127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of the ring. This patch prevents this by checking the status for the host id on all the entries. If any of the entries shows the node with the host id is in LEFT status, reject to put the node in NORMAL status. Fixes #11355 Closes #11361	2022-10-13 15:11:32 +02:00
Jan Ciolek	52bbc1065c	cql3: allow lists of IN elements to be NULL Requests like `col IN NULL` used to cause an error - Invalid null value for colum col. We would like to allow NULLs everywhere. When a NULL occurs on either side of a binary operator, the whole operation should just evaluate to NULL. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #11775	2022-10-13 15:11:32 +02:00
Avi Kivity	19e62d4704	commitlog: delete unused "num_deleted" variable Since `d478896d46` we update the variable, but never read it. Clang 15 notices and complains. Remove the variable to make it happy. Closes #11765	2022-10-13 15:11:32 +02:00
Avi Kivity	a2da08f9f9	storage_proxy: hold effective_replication_map for the duration of a paxos transaction Luckily, all topology calculations are done in get_paxos_participants(), so all we have to do is it hold the effective_replication_map for the duration of the transaction, and pass it to get_paxos_participants(). This ensures that the coordinator knows about all in-flight requests and can fence them from topology changes.	2022-10-13 14:27:26 +03:00
Avi Kivity	69aaa5e131	storage_proxy: move paxos_response_handler class to .cc file It's not used elsewhere.	2022-10-13 14:27:26 +03:00
Avi Kivity	b2f3934e95	storage_proxy: deinline paxos_response_handler constructor/destructor They have no business being inline as it's a heavyweight object.	2022-10-13 14:27:26 +03:00
Avi Kivity	94e4ff11be	storage_proxy: use consistent effective_replication_map for counter coordinator Hold the effective_replication_map while talking to the counter leader, to allow for fencing in the future. The code is somewhat awkward because the API allows for multiple keyspaces to be in use. The error code generation, already broken as it doesn't use the correct table, continues to be broken in that it doesn't use the correct effective_replication_map, for the same reason.	2022-10-13 14:27:23 +03:00
Avi Kivity	406a046974	storage_proxy: improve consistency in query_partition_key_range{,_concurrent} query_partition_key_range captures a token_metadata_ptr and uses it consistently in sequential calls to query_partition_key_range_concurrent (via tail recursion), but each invocation of query_partition_key_range_concurrent captures its own effective_replication_map_ptr. Since these are captured at different times, they can be inconsistent after the first iteration. Fix by capturing it once in the caller and propagating it everywhere.	2022-10-13 13:56:52 +03:00
Avi Kivity	5d320e95d5	storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use Capture token_metadata by reference rather than smart pointer, since out effective_replication_map_ptr protects it.	2022-10-13 13:56:52 +03:00
Avi Kivity	f75efa965f	storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency Derive the token_metadata from the effective_replication_map rather than getting it independently. Not a real bug since these were in the same continuation, but safer this way.	2022-10-13 13:56:52 +03:00
Avi Kivity	161ce4b34f	storage_proxy: query_singular: use fewer smart pointers Capture token_metadata by reference since we're protecting it with the mighty effective_replication_map_ptr. This saves a few instructions to manage smart pointers.	2022-10-13 13:56:33 +03:00
Avi Kivity	efd89c1890	storage_proxy: query_singular: simplify lambda captures The lambdas in query_singular do not outlive the enclosing coroutine, so they can capture everything by reference. This simplifies life for a future update of the lambda, since there's one thing less to worry about.	2022-10-13 13:52:54 +03:00
Avi Kivity	d9955ab35b	locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata token_metadata is protected by holders of an effective_replication_map_ptr, so it's just as safe and less expensive for them to obtain a reference to token_metadata rather than a smart pointer, so give them that option with a new accessor.	2022-10-13 13:46:04 +03:00
Avi Kivity	86a48cf12f	storage_proxy: use consistent token_metadata with rest of singular read query_singular() uses get_token_metadata_ptr() and later, in get_read_executor(), captures the effective_replication_map(). This isn't a bug, since the two are captured in the same continuation and are therefore consistent, but a way to ensure it stays so is to capture the effective_replication_map earlier and derive the token_metadata from it.	2022-10-13 13:46:04 +03:00
Avi Kivity	720fc733f0	dirty_memory_manager: move region_group data members to top-of-class Rather than have them spread out throughout the class.	2022-10-13 13:12:01 +03:00
Avi Kivity	61b780ae63	dirty_memory_manager: update region_group comment It's still named region_group. I may merge the whole thing into dirty_memory_manager to retire the name.	2022-10-13 13:09:01 +03:00
Avi Kivity	7a5fa1497c	dirty_memory_manager: remove outdated friend That friend no longer exists.	2022-10-13 13:03:43 +03:00
Avi Kivity	02b7697051	dirty_memory_manager: fold region_group::push_back() into its caller It is too trivial to live.	2022-10-13 13:03:43 +03:00
Avi Kivity	d403ecbed9	dirty_memory_manager: simplify blocked calculation in region_group::run_when_memory_available - apply De Morgan's law - merge if block into boolean calculation	2022-10-13 13:03:43 +03:00
Avi Kivity	cb6c7023c1	dirty_memory_manager: remove unneeded local from region_group::run_when_memory_is_available	2022-10-13 13:03:43 +03:00
Avi Kivity	39668d5ae2	dirty_memory_manager: tidy up region_group::execution_permitted() - remove excess parentheses - apply De Morgan's law - remove unneeded this-> - whitespace cleanups	2022-10-13 13:03:43 +03:00
Avi Kivity	02706e78f9	dirty_memory_manager: reindent region_group::release_queued_allocations()	2022-10-13 13:03:43 +03:00
Avi Kivity	128f1c8c21	dirty_memory_manager: convert region_group::release_queued_allocations() to a coroutine Nicer and faster. We have a rare case where we hold a lock for the duration of a call but we don't want to hold it until the future it returns is resolved, so we have to resort to a minor trick.	2022-10-13 13:03:29 +03:00
Avi Kivity	aad4c1c5e9	dirty_memory_manager: move region_group::_releaser after _shutdown_requested The function that is attached to _releaser depends on _shutdown_requested. There is currently now use-before-init, since the function (release_queued_allocations) starts with a yield(), moving the first use to until after the initialization. Since I want to get rid of the yield, reorder the fields so that they are initialized in the right order.	2022-10-13 13:00:50 +03:00
Raphael S. Carvalho	ec79ac46c9	db/view: Add visibility to view updating of Staging SSTables Today, we're completely blind about the progress of view updating on Staging files. We don't know how long it will take, nor how much progress we've made. This patch adds visibility with a new metric that will inform the number of bytes to be processed from Staging files. Before any work is done, the metric tell us the total size to be processed. As view updating progresses, the metric value is expected to decrease, unless work is being produced faster than we can consume them. We're piggybacking on sstables::read_monitor, which allows the progress metric to be updated whenever the SSTable reader makes progress. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11751	2022-10-12 16:57:37 +03:00
Avi Kivity	2e79bb431c	tools: change source_location location std::experimental::source_location is provided by <experimental/source_location>, not <source_location>. libstdc++ 12 insists, so change the header. Closes #11766	2022-10-12 15:29:14 +03:00
Takuya ASADA	6b246dc119	locator::ec2_snitch: Retry HTTP request to EC2 instance metadata service EC2 instance metadata service can be busy, ret's retry to connect with interval, just like we do in scylla-machine-image. Fixes #10250 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #11688	2022-10-12 13:59:06 +03:00
Kamil Braun	92dd1f7307	service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr `timestamped_entry` had two fields: ``` optional<clock_time_point> _last_accessed expiring_entry_ptr* _lru_entry ``` The `raft_address_map` data structure maintained an invariant: `_last_accessed` is set if and only if `_lru_entry` is not null. This invariant could be broken for a while when constructing an expiring `timestamped_entry`: the constructor was given an `expiring = true` flag, which set the `_last_accessed` field; this was redundant, because immediately after a corresponding `expiring_entry_ptr` was constructed which again reset the `_last_accessed` field and set `_lru_entry`. The code becomes simpler and shorter when we move `_last_accessed` field into `expiring_entry_ptr`. The invariant is now guaranteed by the type system: `_last_accessed` is no longer `optional`.	2022-10-12 12:22:57 +02:00
Kamil Braun	262b9473d5	service/raft: raft_address_map: don't use intrusive set for timestamped entries Intrusive data structures are harder to reason about. In `raft_address_map` there's a good reason to use an intrusive list for storing `expiring_entry_ptr`s: we move the entries around in the list (when their expiration times change) but we want for the objects to stay in place because `timestamped_entry`s may point to them (although we could simply update the pointers using the existing back-reference...) However, there's not much reason to store `timestamped_entry` in an intrusive set. It was basically used in one place: when dropping expired entries, we iterate over the list of `expiring_entry_ptr`s and we want to drop the corresponding `timestamped_entry` as well, which is easy when we have a pointer to the entry and it's a member of an intrusive container. But we can deal with it when using non-intrusive containers: just `find` the element in the container to erase it. The code becomes shorter with this change. I also use a map instead of a set because we need to modify the `timestamped_entry` which wouldn't be possible if it was used as an `unordered_set` key. In fact using map here makes more sense: we were using the intrusive set similarly to a map anyway because all lookups were performed using the `_id` field of `timestamped_entry` (now the field was moved outside the struct, it's used as the map's key).	2022-10-12 12:22:50 +02:00
Kamil Braun	3e84b1f69c	Merge 'test.py: topology fix ssl var and improve pylint score' from Alecco When code was moved to the new directory, a bug was reintroduced with `ssl` local hiding `ssl` module. Fix again. Closes #11755 * github.com:scylladb/scylladb: test.py: improve pylint score for conftest test.py: fix variable name collision with ssl	2022-10-12 11:41:11 +02:00
Avi Kivity	f673d0abbe	build: support fmt 9 ostream formatter deprecation fmt 9 deprecates automatic fallback to std::ostream formatting. We should migrate, but in order to do so incrementally, first enable the deprecated fallback so the code continues to compile. Closes #11768	2022-10-12 09:27:36 +03:00
Avi Kivity	0952cecfc9	build: mark abseil as a system header Abseil is not under our control, so if a header generates a warning, we can do nothing about it. So far this wasn't a problem, but under clang 15 it spews a harmless deprecation warning. Silence the warning by treating the header as a system header (which it is, for us). Closes #11767	2022-10-12 09:27:36 +03:00
Kamil Braun	0c13c85752	service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr` The class was storing a pointer which couldn't be null. A reference is a better fit in this case.	2022-10-11 17:21:01 +02:00
Asias He	810b424a8c	storage_service: Reject to bootstrap new node when node has unknown gossip status - Start a cluster with n1, n2, n3 - Full cluster shutdown n1, n2, n3 - Start n1, n2 and keep n3 as shutdown - Add n4 Node n4 will learn the ip and uuid of n3 but it does not know the gossip status of n3 since gossip status is published only by the node itself. After full cluster shutdown, gossip status of n3 will not be present until n3 is restarted again. So n4 will not think n3 is part of the ring. In this case, it is better to reject the bootstrap. With this patch, one would see the following when adding n4: ``` ERROR 2022-09-01 13:53:14,480 [shard 0] init - Startup failed: std::runtime_error (Node 127.0.0.3 has gossip status=UNKNOWN. Try fixing it before adding new node to the cluster.) ``` The user needs to perform either of the following before adding a new node: 1) Run nodetool removenode to remove n3 2) Restart n3 to get it back to the cluster Fixes #6088 Closes #11425	2022-10-11 15:47:34 +03:00
Botond Dénes	378c6aeebd	Merge 'More Raft upgrade tests' from Kamil Braun Refactor the existing upgrade tests, extracting some common functionality to helper functions. Add more tests. They are checking the upgrade procedure and recovery from failure in scenarios like when a node fails causing the procedure to get stuck or when we lose a majority in a fully upgraded cluster. Add some new functionalities to `ScyllaRESTAPIClient` like injecting errors and obtaining gossip generation numbers. Extend the removenode function to allow ignoring dead nodes. Improve checking for CQL availability when starting nodes to speed up testing. Closes #11725 * github.com:scylladb/scylladb: test/topology_raft_disabled: more Raft upgrade tests test/topology_raft_disabled: refactor `test_raft_upgrade` test/pylib: scylla_cluster: pass a list of ignored nodes to removenode test/pylib: rest_client: propagate errors from put_json test/pylib: fix some type hints test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up	2022-10-11 15:30:00 +03:00
Kamil Braun	08e654abf5	Merge 'raft: (service) cleanups on the path for dynamic IP address support' from Konstantin Osipov In preparation for supporting IP address changes of Raft Group 0: 1) Always use start_server_for_group0() to start a server for group 0. This will provide a single extension point when it's necessary to prompt raft_address_map with gossip data. 2) Don't use raft::server_address in discovery, since going forward discovery won't store raft::server_address. On the same token stop using discovery::peer_set anywhere outside discovery (for persistence), use a peer_list instead, which is easier to marshal. Closes #11676 * github.com:scylladb/scylladb: raft: (discovery) do not use raft::server_address to carry IP data raft: (group0) API refactoring to avoid raft::server_address raft: rename group0_upgrade.hh to group0_fwd.hh raft: (group0) move the code around raft: (discovery) persist a list of discovered peers, not a set raft: (group0) always start group0 using start_server_for_group0()	2022-10-11 13:43:41 +02:00
Asias He	58c65954b8	storage_service: Reject decommission if nodes are down - Start n1, n2, n3 - Apply network nemesis as below: + Block gossip traffic going from nodes 1 and 2 to node 3. + All the other rpc traffic flows normally, including gossip traffic from node 3 to nodes 1 and 2 and responses to node_ops commands from nodes 1 and 2 to node 3. - Decommission n3 Currently, the decommission will be successful because all the network traffic is ok. But n3 could not advertise status STATUS_LEFT to the rest of the cluster due to the network nemesis applied. As a result, n1 and n3 could not move the n3 from STATUS_LEAVING to STATUS_LEFT, so n3 will stay in DL forever. I know why the node stays DL forever. The problem is that with node_ops_cmd based node operation, we still rely on the gossip status of STATUS_LEFT from the node being decommissioned to notify other nodes this node has finished decommission and can be moved from STATUS_LEAVING to STATUS_LEFT. This patch fixes by checking gossip liveness before running decommission. Reject if required peer nodes are down. With the fix, the decommission of n3 will fail like this: $ nodetool decommission -p 7300 nodetool: Scylla API server HTTP POST to URL '/storage_service/decommission' failed: std::runtime_error (decommission[adb3950e-a937-4424-9bc9-6a75d880f23d]: Rejected decommission operation, removing node=127.0.0.3, sync_nodes=[127.0.0.2, 127.0.0.3, 127.0.0.1], ignore_nodes=[], nodes_down={127.0.0.1}) Fixes #11302 Closes #11362	2022-10-11 14:09:28 +03:00
Botond Dénes	917fdb9e53	Merge "Cut database-system_keyspace circular dependency" from Pavel Emelyanov " There's one via the database's compaction manager and large data handler sub-services. Both need system keyspace to put their info into, but the latter needs database naturally via query_processor->storage_proxy link. The solution is to make c.m. \| l.d.h. -> sys.ks. dependency be weak with the help of shared_from_this(), described in details in patch #2 commit message. As a (not-that-)side effect this set removes a bunch of global qctx calls. refs: #11684 (this set seem to increase the chance of stepping on it) " * 'br-sysks-async-users' of https://github.com/xemul/scylla: large_data_handler: Use local system_keyspace to update entries system_keyspace: De-static compaction history update compaction_manager: Relax history paths database: Plug/unplug system_keyspace system_keyspace: Add .shutdown() method	2022-10-11 08:52:04 +03:00
Nadav Har'El	ef0da14d6f	test/cql-pytest: add simple tests for USE statement This patch adds a couple of simple tests for the USE statement: that without USE one cannot create a table without explicitly specifying a keyspace name, and with USE, it is possible. Beyond testing these specific feature, this patch also serves as an example of how to write more tests that need to control the effective USE setting. Specifically, it adds a "new_cql" function that can be used to create a new connection with a fresh USE setting. This is necessary in such tests, because if multiple tests use the same cql fixture and its single connection, they will share their USE setting and there is no way to undo or reset it after being set. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11741	2022-10-11 08:20:19 +03:00
Kamil Braun	df2fb21972	test/topology: reenable test_remove_node_add_column After #11691 was merged the test should no longer be flaky. Reenable it. Closes #11754	2022-10-11 08:18:20 +03:00
Pavel Emelyanov	8b8b37cdda	system_keyspace: Dont maintain dc/rack cache Some good news finally. The saved dc/rack info about the ring is now only loaded once on start. So the whole cache is not needed and the loading code in storage_service can be greatly simplified Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:18:31 +03:00
Pavel Emelyanov	775f42c8d1	system_keyspace: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:18:31 +03:00
Pavel Emelyanov	8f1df240c7	system_keyspace: Coroutinuze build_dc_rack_info() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:18:31 +03:00
Pavel Emelyanov	b6061bb97d	topology: Move all post-configuration to topology::config Because of snitch ex-dependencies some bits on topology were initialized with nasty post-start calls. Now it all can be removed and the initial topology information can be provided by topology::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:18:31 +03:00
Pavel Emelyanov	56d4863eb6	snitch: Start early Snitch code doesn't need anything to start working, but it is needed by the low-level token-metadata, so move the snitch to start early (and to stop late) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:18:31 +03:00
Pavel Emelyanov	16188a261e	gossiper: Do not export system keyspace No users of it left. Despite the gossiper->system_keyspace dependency is not needed either, keep it alive because gossiper still updates system keyspace with feature masks, so chances are it will be reactivated some time later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	2bb354b2e7	snitch: Remove gossiper reference It doesn't need gossiper any longer. This change will allow starting snitch early by the next patch, and eventually improving the token-metadata start-up sequence Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	26f9472f21	snitch: Mark get_datacenter/_rack methods const They are in fact such, but wasn't marked as const before because they wanted to talk to non-const gossiper and system_keyspaces methods and updated snitch internal caches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	e9bd912f79	snitch: Drop some dead dependency knots After previous patches and merged branches snitch no longer needs its method that gets dc/rack for endpoints from gossiper, system keyspace and its internal caches. This cuts the last but the biggest snitch->gossiper dependency. Also this removes implicit snitch->system_keyspace dependency loop Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	4206b1f98f	snitch, code: Make get_datacenter() report local dc only The continuation of the previous patch -- all the code uses topology::get_datacenter(endpoint) to get peers' dc string. The topology still uses snitch for that, but it already contains the needed data. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	6c6711404f	snitch, code: Make get_rack() report local rack only All the code out there now calls snitch::get_rack() to get rack for the local node. For other nodes the topology::get_rack(endpoint) is used. Since now the topology is properly populated with endpoints, it can finally be patched to stop using snitch and get rack from its internal collections Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	bc813771e8	storage_service: Populate pending endpoint in on_alive() A special-purpose add-on to the previous patch. When messaging service accepts a new connection it sometimes may want to drop it early based on whether the client is from the same dc/rack or not. However, at this stage the information might have not yet had chances to be spread via storage service pending-tokens updating paths, so here's one more place -- the on_alive() callback Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	1be97a0a76	code: Populate pending locations Previous patches added the concept of pending endpoints in the topology, this patch populates endpoints in this state. Also, the set_pending_ranges() is patched to make sure that the tokens added for the enpoint(s) are added for something that's known by the topology. Same check exists in update_normal_tokens() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	b61bd6cf56	topology: Put local dc/rack on topology early Startup code needs to know the dc/rack of the local node early, way before nodes starts any communication with the ring. This information is available when snitch activates, but it starts _after_ token-metadata, so the only way to put local dc/rack in topology is via a startup-time special API call. This new init_local_endpoint() is temporary and will be removed later in this set Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	da75552e1f	topology: Add pending locations collection Nowadays the topology object only keeps info about nodes that are normal members of the ring. Nodes that are joining or bootstrapping or leaving are out of it. However, one of the goals of this patchset is to make topology object provide dc/rack info for _all_ nodes, even those in transitive state. The introduced _pending_locations is about to hold the dc/rack info for transitive endpoints. When a node becomes member of the ring it is moved from pending (if it's there) to current locations, when it leaves the ring it's moved back to pending. For now the new collection is just added and the add/remove/get API is extended to maintain it, but it's not really populated. It will come in the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	fa613285e7	topology: Make get_location() errors more verbose Currently if topology.get_location() doesn't find an entry in its collection(s) it throws standard out-of-range exception which's very hard to debug. Also, next patches will extend this method, the introduced here if (_current_locations.contains()) makes this future change look nicer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	d60ebc5ace	token_metadata: Add config, spread everywhere Next patches will need to provide some early-start data for topology. The standard way of doing it is via service config, so this patch adds one. The new config is empty in this patch, to be filled later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	7c211e8e50	token_metadata: Hide token_metadata_impl copy constructor Copying of token_metadata_impl is heavy operation and it's performed internally with the help of the dedicated clone_async() method. This method, in turn, doesn't copy the whole object in its copy-ctor, but rather default-initializes it and carries the remaining fields later. Having said that, the standart copy-ctor is better to be made private and, for the sake of being more explicit, marked as shallow-copy-ctor Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	072ef88ed1	gosspier: Remove messaging service getter No code needs to borrow messaging from gossiper, which is nice Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	66bc84d217	snitch: Get local address to gossip via config The property-file snitch gossips listen_address as internal-IP state. To get this value it gets it from snitch->gossiper->messaging_service chain. This change provides the needed value via config thus cutting yet another snitch->gossiper dependency and allowing gossiper not to export messaging service in the future Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	77bde21024	storage_service: Shuffle on_alive() callback No functional changes, just keep some conditions from if()s as local variables. This is the churn-reducing preparation for one of the the next patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	583204972e	api: Don't report dc/rack for endpoints not in ring When an endpoint is not in ring the snitch/get_{rack\|datacenter} API still return back some value. The value is, in fact, the default one, because this is how snitch resolves it -- when it cannot find a node in gossiper and system keyspace it just returns defaults. When this happens the API should better return some error (bad param?) but there's a bug in nodetool -- when the 'status' command collects info about the ring it first collects the endpoints, then gets status for each. If between getting an endpoint and getting its status the endpoint disappears, the API would fail, but nodetool doesn't handle it. Next patches will make .get_rack/_dc calls use in-topology collections that don't fall-back to default values if the entry is not found in it, so prepare the API in advance to return back defaults. refs: #11706 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:12:47 +03:00
Konstantin Osipov	3e46c32d7b	raft: (discovery) do not use raft::server_address to carry IP data We plan to remove IP information from Raft addresses. raft::server_address is used in Raft configuration and also in discovery, which is a separate algorithm, as a handy data structure, to avoid having new entities in RPC. Since we plan to remove IP addresses from Raft configuration, using raft::server_address in discovery and still storing IPs in it would create ambiguity: in some uses raft::server_address would store an IP, and in others - would not. So switch to an own data structure for the purposes of discovery, discovery_peer, which contains a pair ip, raft server id. Note to reviewers: ideally we should switch to URIs in discovery_peer right away. Otherwise we may have to deal with incompatible changes in discovery when adding URI support to Scylla.	2022-10-10 16:24:33 +03:00
Pavel Emelyanov	b1f4273f0d	large_data_handler: Use local system_keyspace to update entries The l._d._h.'s way to update system keyspace is not like in other code. Instead of a dedicated helper on the system_keyspace's side it executes the insertion query directly with the help of qctx. Now when the l._d._h. has the weak system keyspace reference it can execute queries on _it_ rather than on the qctx. Just like in previous patch, it needs to keep the sys._k.s. weak reference alive until the query's future resolves. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Pavel Emelyanov	907fd2d355	system_keyspace: De-static compaction history update Compaction manager now has the weak reference on the system keyspace object and can use it to update its stats. It only needs to take care and keep the shared pointer until the respective future resolves. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Pavel Emelyanov	3e0b61d707	compaction_manager: Relax history paths There's a virtual method on table_state to update the entry in system keyspace. It's an overkill to facilitate tests that don't want this. With new system_keyspace weak referencing it can be made simpled by moving the updating call to the compaction_manager itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Pavel Emelyanov	f9b57df471	database: Plug/unplug system_keyspace There's a circular dependency between system_keyspace and database. The former needs the latter because it needs to execula local requests via query_processor. The latter needs the former via compaction manager and large data handler, database depends on both and these too need to insert their entries into system keyspace. To cut this loop the compaction manager and large data handler both get a weak reference on the system keysace. Once system keyspace starts is activcates this reference via the database call. When system keyspace is shutdown-ed on stop, it deactivates the reference. Technically the weak reference is implemented by marking the system_k.s. object as async_sharded_service, and the "reference" in question is the shared_from_this() pointer. When compaction manager or large data handler need to update a system keyspace's table, they both hold an extra reference on the system keyspace until the entry is committed, thus making sure that sys._k.s. doesn't stop from under their feet. At the same time, unplugging the reference on shutdown makes sure that no new entries update will appear and the system_k.s. will eventually be released. It's not a C++ classical reference, because system_keyspace starts after and stops before database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Konstantin Osipov	8857e017c7	raft: (group0) API refactoring to avoid raft::server_address Replace raft::server_address in a few raft_group0 API calls with raft::server_id. These API calls do not need raft::server_address, i.e. the address part, anyway, and since going forward raft::server_address will not contain the IP address, stop using it in these calls. This is a beginning of a multi-patch series to reduce raft::server_address usage to core raft only.	2022-10-10 15:58:48 +03:00
Konstantin Osipov	224dd9ce1e	raft: rename group0_upgrade.hh to group0_fwd.hh The plan is to add other group-0-related forward declarations to this file, not just the ones for upgrade.	2022-10-10 15:58:48 +03:00
Konstantin Osipov	e226624daf	raft: (group0) move the code around Move load/store functions for discovered peers up, since going forward they'll be used to in start_server_for_group0(), to extend the address map prior to start (and thus speed up bootstrap).	2022-10-10 15:58:48 +03:00
Konstantin Osipov	199b6d6705	raft: (discovery) persist a list of discovered peers, not a set We plan to reuse the discovery table to store the peers after discovery is over, so load/store API must be generalized to use outside discovery. This includes sending the list of persisted peers over to a new member of the cluster.	2022-10-10 15:58:48 +03:00
Konstantin Osipov	746322b740	raft: (group0) always start group0 using start_server_for_group0() When IP addresses are removed from raft::configuration, it's key to initialize raft_address_map with IP addresses before we start group 0. Best place to put this initialization is start_server_for_group0(), so make sure all paths which create group 0 use start_server_for_group0().	2022-10-10 15:58:48 +03:00
Kamil Braun	4974a31510	test/topology_raft_disabled: more Raft upgrade tests The tests are checking the upgrade procedure and recovery from failure in scenarios like when a node fails causing the procedure to get stuck or when we lose a majority in a fully upgraded cluster. Added some new functionalities to `ScyllaRESTAPIClient` like injecting errors and obtaining gossip generation numbers.	2022-10-10 14:32:10 +02:00
Pavel Emelyanov	caed12c8f2	system_keyspace: Add .shutdown() method Many services out there have one (sometimes called .drain()) that's called early on stop and that's responsible for prearing the service for stop -- aborting pending/in-flight fibers and alike. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 15:29:33 +03:00
Kamil Braun	4460b4e63c	test/topology_raft_disabled: refactor `test_raft_upgrade` Take reusable parts out of the test to helper functions.	2022-10-10 12:59:12 +02:00
Kamil Braun	fa8dcb0d54	test/pylib: scylla_cluster: pass a list of ignored nodes to removenode The `removenode` operation normally requires the removing node to contact every node in the cluster except the one that is being removed. But if more than 1 node is down it's possible to specify a list of nodes to ignore for the operation; the `/storage_service/remove_node` endpoint accepts an `ignore_nodes` param which is a comma-separated list of IPs. Extend `ScyllaRESTAPIClient`, `ScyllaClusterManager` and `ManagerClient` so it's possible to pass the list of ignored nodes. We also modify the `/cluster/remove-node` Manager endpoint to use `put_json` instead of `get_text` and pass all parameters except the initiator IP (the IP of the node who coordinates the `removenode` operation) through JSON. This simplifies the URL greatly (it was already messy with 3 parameters) and more closely resembles Scylla's endpoint.	2022-10-10 12:59:12 +02:00
Kamil Braun	130ab1d312	test/pylib: rest_client: propagate errors from put_json	2022-10-10 12:59:12 +02:00
Kamil Braun	63892326d5	test/pylib: fix some type hints	2022-10-10 12:59:12 +02:00
Kamil Braun	6e3fe13fcf	test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up Do a simple `SELECT` instead. This speeds up tests - creating and dropping keyspaces is relatively expensive, and we did this on every server restart.	2022-10-10 12:59:12 +02:00
Alejo Sanchez	7e2a3f2040	test.py: improve pylint score for conftest Remove unused imports, fix long lines, add ignore flags. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-10 12:07:41 +02:00
Alejo Sanchez	aa1f4a321c	test.py: fix variable name collision with ssl Change variable name to avoid collision with module ssl. This bug was reintroduced when moving code. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-10 11:59:13 +02:00
Pavel Emelyanov	53bad617c0	virtual_tables: Use token_metadata.is_member() This method just jumps into topology.has_endpoint(). The change is for consistency with other users of it and as a preparation for topology.has_endpoint() future enhancements Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 12:16:19 +03:00
Tomasz Grabiec	fcf0628bc5	dbuild: Use .gdbinit from the host Useful when starting gdb inside the dbuild container. Message-Id: <20221007154230.1936584-1-tgrabiec@scylladb.com>	2022-10-09 11:14:33 +03:00
Petr Gusev	0923cb435f	raft: mark removed servers as expiring instead of dropping them There is a flaw in how the raft rpc endpoints are currently managed. The io_fiber in raft::server is supposed to first add new servers to rpc, then send all the messages and then remove the servers which have been excluded from the configuration. The problem is that the send_messages function isn't synchronous, it schedules send_append_entries to run after all the current requests to the target server, which can happen after we have already removed the server from address_map. In this patch the remove_server function is changed to mark the server_id as expiring rather than synchronously dropping it. This means all currently scheduled requests to that server will still be able to resolve the ip address for that server_id. Fixes: #11228 Closes #11748	2022-10-07 19:08:34 +02:00
Avi Kivity	55606a51cb	dirty_memory_manager: move region_group queued allocation releasing into a function It's nicer to see a function release_queued_allocations() in a stack trace rather than start_releaser(), which has done its work during initialization.	2022-10-07 17:27:43 +03:00
Avi Kivity	3e60d6c243	dirty_memory_manager: fold allocation_queue into region_group allocation_queue was extracted out of region_group in `71493c253` and `34d532236`. But now that region_group refactoring is mostly done, we can move them back in. allocation_queue has just one user and is not useful standalone.	2022-10-07 17:27:40 +03:00
Avi Kivity	01368830b5	dirty_memory_manager: don't ignore timeout in allocation_queue::push_back() In `34d5322368` ("dirty_memory_manager: move more allocation_queue functions out of region_group") we accidentally started ignoring the timeout parameter. Fix that. No release branch has the breakage.	2022-10-07 17:19:56 +03:00
Kamil Braun	06b87869ba	Merge 'Raft transport error' from Gusev Petr The `add_entry` and `modify_config` methods sometimes do an rpc to execute the request on the current leader. If the tcp connection was broken, a `seastar::rpc::closed_error` would be thrown to the client. This exception was not documented in the method comments and the client could have missed handling it. For example, this exception was not handled when calling `modify_config` in `raft_group0`, which sometimes broke the `removenode` command. An `intermittent_connection_error` exception was added earlier to solve a similar problem with the `read_barrier` method. In this patch it is renamed to `transport_error`, as it seems to better describe the situation, and an explicit specification for this exception was added - the rpc implementation can throw it if it is not known whether the call reached the destination and whether any mutations were made. In case of `read_barrier` it does not matter and we just retry, in case of `add_entry` and `modify_config` we cannot retry because of possible mutations, so we convert this exception to `commit_status_unknown`, which the client has to handle. Explicit comments have also been added to `raft::server` methods describing all possible exceptions. Closes #11691 * github.com:scylladb/scylladb: raft_group0: retry modify_config on commit_status_unknown raft: convert raft::transport_error to raft::commit_status_unknown	2022-10-07 15:53:22 +02:00
Petr Gusev	12bb8b7c8d	raft_group0: retry modify_config on commit_status_unknown modify_config can throw commit_status_unknown in case of a leader change or when the leader is unavailable, but the information about it has not yet reached the current node. In this patch modify_config is run again after some time in this case.	2022-10-07 13:34:23 +04:00
Petr Gusev	d79fbab682	raft: convert raft::transport_error to raft::commit_status_unknown The add_entry and modify_config methods sometimes do an rpc to execute the request on the current leader. If the tcp connection was broken, a seastar::rpc::closed_error would be thrown to the client. This exception was not documented in the method comments and the client could have missed handling it. For example, this exception was not handled when calling modify_config in raft_group0, which sometimes broke the removenode command. An intermittent_connection_error exception was added earlier to solve a similar problem with the read_barrier method. In this patch it is renamed to transport_error, as it seems to better describe the situation, and an explicit specification for this exception was added - the rpc implementation can throw it if it is not known whether the call reached the target node and whether any actions were performed on it. In case of read_barrier it does not matter and we just retry. In case of add_entry and modify_config we cannot retry because the rpc calls are not idempotent, so we convert this exception to commit_status_unknown, which the client has to handle. Explicit comments have also been added to raft::server methods describing all possible exceptions.	2022-10-07 13:34:16 +04:00
Botond Dénes	b247f29881	Merge 'De-static system_keyspace::get_{saved\|local}_tokens()' from Pavel Emelyanov Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env Closes #11738 * github.com:scylladb/scylladb: system_keyspace: Make get_{local\|saved}_tokens non static size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges() cql_test_env: Keep sharded<system_keyspace> reference size_estimate_virtual_reader: Keep system_keyspace reference system_keyspace: Pass sys_ks argument to install_virtual_readers() system_keyspace: Make make() non-static distributed_loader: Pass sys_ks argument to init_system_keyspace() system_keyspace: Remove dangling forward declaration	2022-10-07 11:28:32 +03:00
Botond Dénes	992afc5b8c	Merge 'storage_proxy: coroutinize some functions with do_with' from Avi Kivity do_with() is a sure indicator for coroutinization, since it adds an allocation (like the coroutine does with its frame). Therefore translating a function with do_with is at least a break-even, and usually a win since other continuations no longer allocate. This series converts most of storage_proxy's function that have do_with to coroutines. Two remain, since they are not simple to convert (the do_with() is kept running in the background and its future is discarded). Individual patches favor minimal changes over final readability, and there is a final patch that restores indentation. The patches leave some moves from coroutine reference parameters to the coroutine frame, this will be cleaned up in a follow-up. I wanted this series not to touch headers to reduce rebuild times. Closes #11683 * github.com:scylladb/scylladb: storage_proxy: reindent after coroutinization storage_proxy: convert handle_read_digest() to a coroutine storage_proxy: convert handle_read_mutation_data() to a coroutine storage_proxy: convert handle_read_data() to a coroutine storage_proxy: convert handle_write() to a coroutine storage_proxy: convert handle_counter_mutation() to a coroutine storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine	2022-10-07 07:37:37 +03:00
Nadav Har'El	72dbce8d46	docs, alternator: mention S3 Import feature in compatibility.md In August 2022, DynamoDB added a "S3 Import" feature, which we don't yet support - so let's document this missing feature in the compatibility document. Refs #11739. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11740	2022-10-06 19:50:16 +03:00
Avi Kivity	20bad62562	Merge 'Detect and record large collections' from Benny Halevy This series adds support for detecting collections that have too many items and recording them in `system.large_cells`. A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000. Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively. A new column was added to system.large_cells: `collection_items`. Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold. Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred. Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it. The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled. In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items. Closes #11449 Closes #11674 * github.com:scylladb/scylladb: sstables: mx/writer: optimize large data stats members order sstables: mx/writer: keep large data stats entry as members db: large_data_handler: dynamically update config thresholds utils/updateable_value: add transforming_value_updater db/large_data_handler: cql_table_large_data_handler: record large_collections db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler db/large_data_handler: cql_table_large_data_handler: move ctor out of line docs: large-rows-large-cells-tables: fix typos db/system_keyspace: add collection_elements column to system.large_cells gms/feature_service: add large_collection_detection cluster feature test: sstable_3_x_test: add test_sstable_too_many_collection_elements test: lib: simple_schema: add support for optional collection column test: lib: simple_schema: build schema in ctor body test: lib: simple_schema: cql: define s1 as static only if built this way db/large_data_handler: maybe_record_large_cells: consider collection_elements db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells sstables: mx/writer: add large_data_type::elements_in_collection db/large_data_handler: get the collection_elements_count_threshold db/config: add compaction_collection_elements_count_warning_threshold test: sstable_3_x_test: add test_sstable_write_large_cell test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random	2022-10-06 18:28:21 +03:00
Avi Kivity	62a4d2d92b	Merge 'Preliminary changes for multiple Compaction Groups' from Raphael "Raph" Carvalho What's contained in this series: - Refactored compaction tests (and utilities) for integration with multiple groups - The idea is to write a new class of tests that will stress multiple groups, whereas the existing ones will still stress a single group. - Fixed a problem when cloning compound sstable set (cannot be triggered today so I didn't open a GH issue) - Many changes in replica::table for allowing integration with multiple groups Next: - Introduce for_each_compaction_group() for iterating over groups wherever needed. - Use for_each_compaction_group() in replica::table operations spanning all groups (API, readers, etc). - Decouple backlog tracker from compaction strategy, to allow for backlog isolation across groups - Introduce static option for defining number of compaction groups and implement function to map a token to its respective group. - Testing infrastructure for multiple compaction groups (helpful when testing the dynamic behavior: i.e. merging / splitting). Closes #11592 * github.com:scylladb/scylladb: sstable_resharding_test: Switch to table_for_tests replica: Move compacted_undeleted_sstables into compaction group replica: Use correct compaction_group in try_flush_memtable_to_sstable() replica: Make move_sstables_from_staging() robust and compaction group friendly test: Rename column_family_for_tests to table_for_tests sstable_compaction_test: Use column_family_for_tests::as_table_state() instead test: Don't expose compound set in column_family_for_tests test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user() sstable_compaction_test: Merge table_state_for_test into column_family_for_tests sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables() sstable_compaction_test: Switch to table_state in compact_sstables() sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests	2022-10-06 18:23:47 +03:00
Kamil Braun	f94d547719	test.py: include modes in log file name Instead of `test.py.log`, use: `test.py.dev.log` when running with `--mode dev`, `test.py.dev-release.log` when running with `--mode dev --mode release`, and so on. This is useful in Jenkins which is running test.py multiple times in different modes; a later run would overwrite a previous run's test.py file. With this change we can preserve the test.py files of all of these runs. Closes #11678	2022-10-06 18:20:39 +03:00
Kamil Braun	3af68052c4	test/topology: disable flaky `test_remove_node_add_column` test The test was added recently and since then causes CI failures. We suspect that it happens if the node being removed was the Raft group 0 leader. The removenode coordinator tries to send to it the `remove_from_group0` request and fails. A potential fix is in review: #11691.	2022-10-06 17:04:42 +02:00
Pavel Emelyanov	59da903054	system_keyspace: Make get_{local\|saved}_tokens non static Now all callers have system_keyspace reference at hand. This removes one more user of the global qctx object Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 18:02:09 +03:00
Pavel Emelyanov	b03f1e7b17	size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges() This method static calls system_keyspace::get_local_tokens(). Having the system_keyspace reference will make this method non-static Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 18:00:09 +03:00
Pavel Emelyanov	4c099bb3ed	cql_test_env: Keep sharded<system_keyspace> reference There's a test_get_local_ranges() call in size-estimate reader which will need system keyspace reference. There's no other place for tests to get it from but the cql_test_env thing Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:59:21 +03:00
Pavel Emelyanov	34e8e5959f	size_estimate_virtual_reader: Keep system_keyspace reference The s._e._v._reader::fill_buffer() method needs system keyspace to get node's local tokens. Now it's a static method, having system_keyspace reference will make it non-static Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:58:07 +03:00
Pavel Emelyanov	04552f2d58	system_keyspace: Pass sys_ks argument to install_virtual_readers() The size-estimate-virtual-reader will need it, now it's available as "this" from system_keyspace::make() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:57:13 +03:00
Pavel Emelyanov	1938412d7a	system_keyspace: Make make() non-static This helper needs system_keyspace reference and using "this" as this looks natural. Also this de-static-ification makes it possible to put some sense into the invoke_on_all() call from init_system_keyspace() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:56:11 +03:00
Pavel Emelyanov	9f79525f8e	distributed_loader: Pass sys_ks argument to init_system_keyspace() It's final destination is virtual tabls registration code called from init_system_keyspace() eventually Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:55:03 +03:00
Pavel Emelyanov	e996503f0d	system_keyspace: Remove dangling forward declaration It doesn't match the real system_keyspace_make() definition and is in fact not needed, as there's another "real" one in database.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:54:22 +03:00
Vlad Zolotarov	8195dab92a	scylla_prepare: correctly handle a former 'MQ' mode Fixes a regression introduced in `80917a1054`: "scylla_prepare: stop generating 'mode' value in perftune.yaml" When cpuset.conf contains a "full" CPU set the negation of it from the "full" CPU set is going to generate a zero mask as a irq_cpu_mask. This is an illegal value that will eventually end up in the generated perftune.yaml, which in line will make the scylla service fail to start until the issue is resolved. In such a case a irq_cpu_mask must represent a "full" CPU set mimicking a former 'MQ' mode. Fixes #11701 Tested: - Manually on a 2 vCPU VM in an 'auto-selection' mode. - Manually on a large VM (48 vCPUs) with an 'MQ' manually enforced. Message-Id: <20221004004237.2961246-1-vladz@scylladb.com>	2022-10-06 17:43:37 +03:00
Avi Kivity	9932c4bd62	Merge 'cql3: Make CONTAINS NULL and CONTAINS KEY NULL return false' from Jan Ciołek Currently doing `CONTAINS NULL` or `CONTAINS KEY NULL` on a collection evaluates to `true`. This is a really weird behaviour. Collections can't contain `NULL`, even if they wanted to. Any operation that has a NULL on either side should evaluate to `NULL`, which is interpreted as `false`. In Cassandra trying to do `CONTAINS NULL` causes an error. Fixes: #10359 The only problem is that this change is not backwards compatible. Some existing code might break. Closes #11730 * github.com:scylladb/scylladb: cql3: Make CONTAINS KEY NULL return false cql3: Make CONTAINS NULL return false	2022-10-06 17:08:56 +03:00
Petr Gusev	40bd9137f8	removenode: add warning in case of exception The removenode_abort logic that follows the warning may throw, in which case information about the original exception was lost. Fixes: #11722 Closes #11735	2022-10-06 13:49:26 +02:00
Benny Halevy	480b4759a9	idl: streaming: include stream_fwd.hh To keep the idl definition of plan_id from getting out of sync with the one in stream_fwd.hh. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11720	2022-10-06 13:49:26 +02:00
Kamil Braun	962ee9ba7b	Merge 'Make raft_group0 -> system_keyspace dependency explicit' from Pavel Emelyanov The raft_group0 code needs system_keyspace and now it gets one from gossiper. This gossiper->system_keyspace dependency is in fact artificial, gossiper doesn't need system ks, it's there only to let raft and snitch call gossiper.get_system_keyspace(). This makes raft use system ks directly, snitch is patched by another branch Closes #11729 * github.com:scylladb/scylladb: raft_group0: Use local reference raft_group0: Add system keyspace reference	2022-10-06 13:49:26 +02:00
Tomasz Grabiec	023f78d6ae	test: lib: random_mutation_generator: Introduce a switch for generating simpler mutations for easier debugging Closes #11731	2022-10-06 13:49:26 +02:00
Raphael S. Carvalho	14d6459efc	compaction: Make compaction_manager stop more robust Commit `aba475fe1d` accidentally fixed a race, which happens in the following sequence of events: 1) storage service starts drain() via API for example 2) main's abort source is triggered, calling compaction_manager's do_stop() via subscription. 2.1) do_stop() initiates the stop but doesn't wait for it. 2.2) compaction_manager's state is set to stopped, such that compaction_manager::stop() called in defer_verbose_shutdown() will wait for the stop and not start a new one. 3) drain() calls compaction_manager::drain() changing the state from stopped to disabled. 4) main calls compaction_manager::stop() (as described in 2.2) and incorrectly tries to stop the manager again, because the state was changed in step 3. `aba475fe1d` accidentally fixed this problem because drain() will no longer take place if it detects the shutdown process was initiated (it does so by ignoring drain request if abort source's subscription was unlinked). This shows us that looking at the state to determine if stop should be performed is fragile, because once the state changes from A to B, manager doesn't know the state was A. To make it robust, we can instead check if the future that stores stop's promise is engaged, meaning that the stop was already initiated and we don't have to start a new one. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11711	2022-10-06 13:49:26 +02:00
Botond Dénes	753f671eaa	Merge 'dirty_memory_manager: simplify, clarify, and document' from Avi Kivity This series undoes some recent damage to clarity, then goes further by renaming terms around dirty_memory_manager to be clearer. Documentation is added. Closes #11705 * github.com:scylladb/scylladb: dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty" dirty_memory_manager: rename _virtual_region_group api: column_family: fix memtable off-heap memory reporting dirty_memory_manager: unscramble terminology	2022-10-06 13:49:26 +02:00
Tomasz Grabiec	4c8dc41f75	Merge 'Handle storage_io_error's ENOSPC when flushing' from Pavel Emelyanov This is the continuation of the `a980510654` that tries to catch ENOSPCs reported via storage_io_error similarly to how defer_verbose_shutdown() does on stop Closes #11664 * github.com:scylladb/scylladb: table: Handle storage_io_error's ENOSPC when flushing table: Rewrap retry loop	2022-10-06 13:49:26 +02:00
Raphael S. Carvalho	fcdff50a35	sstable_resharding_test: Switch to table_for_tests Important step for multiple compaction groups. As a bonus, lots of boilerplate is removed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	cf3f93304e	replica: Move compacted_undeleted_sstables into compaction group Compacted undeleted sstables are relevant for avoiding data resurrection in the purge path. As token ranges of groups won't overlap, it's better to isolate this data, so to prevent one group from interfering with another. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	56ac62bbd6	replica: Use correct compaction_group in try_flush_memtable_to_sstable() We need to pass the compaction_group received as a param, not the one retrieved via as_table_state(). Needed for supporting multiple groups. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	707ebf9cf7	replica: Make move_sstables_from_staging() robust and compaction group friendly Off-strategy can happen in parallel to view building. A semaphore is used to ensure they don't step on each other's toe. If off-strategy completes first, then move_sstables_from_staging() won't find the SSTable alive and won't reach code to add the file to the backlog tracker. If view building completes first, the SSTable exists, but it's not reshaped yet (has repair origin) and shouldn't be added to the backlog tracker. Off-strategy completion code will make sure new sstables added to main set are accounted by the backlog tracker, so move_sstables_from_staging() only need to add to tracker files which are certainly not going through a reshape compaction. So let's take these facts into account to make the procedure more robust and compaction group friendly. Very welcome change for when multiple groups are supported. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	7d82373e3a	test: Rename column_family_for_tests to table_for_tests To avoid confusion, as replica::column_family was already renamed to replica::table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	e56bfecd8d	sstable_compaction_test: Use column_family_for_tests::as_table_state() instead That's important for multiple compaction groups. Once replica::table supports multiple groups, there will be no table::as_table_state(), so for testing table with a single group, we'll be relying on column_family_for_tests::as_table_state(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	5a028ca4dc	test: Don't expose compound set in column_family_for_tests The compound set shouldn't be exposed in main_sstables() because once we complete the switch to column_family_for_tests::table_state, can happen compaction will try to remove or add elements to its set snapshot, and compound set isn't allowed to either ops. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	b16d6c55b1	test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user() Needed once we switch to column_family_for_tests::table_state, so unit tests relying on correct value will still work Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	a6d24a763a	sstable_compaction_test: Merge table_state_for_test into column_family_for_tests This change will make table_state_for_test the table_state of column_family_for_tests. Today, an unit test has to keep a reference to them both and logically couple them, but that's error prone. This change is also important when replica::table supports multiple compaction groups, so unit tests won't have to directly reference the table_state of table, but rather use the one managed by column_family_for_tests. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	6a0eabd17a	sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	a6affea008	sstable_compaction_test: Switch to table_state in compact_sstables() The switch is important once we have multiple compaction groups, as a single table may own several groups. There will no longer be a replica::table::as_table_state(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	2aa6518486	sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests Lots of boilerplate is reduced, and will also help to complete the switch from replica::table to compaction::table_state in the unit tests. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:18 -03:00
Jan Ciolek	a2c359a741	cql3: Make CONTAINS KEY NULL return false A binary operator like this: {1: 2, 3: 4} CONTAINS KEY NULL used to evaluate to `true`. This is wrong, any operation involving null on either side of the operator should evaluate to NULL, which is interpreted as false. This change is not backwards compatible. Some existing code might break. partially fixes: #10359 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-05 18:15:44 +02:00
Jan Ciolek	bbfef4b510	cql3: Make CONTAINS NULL return false A binary operator like this: [1, 2, 3] CONTAINS NULL used to evaluate to `true`. This is wrong, any operation involving null on either side of the operator should evaluate to NULL, which is interpreted as false. This change is not backwards compatible. Some existing code might break. partially fixes: #10359 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-10-05 18:15:15 +02:00
Pavel Emelyanov	fb8ed684fa	raft_group0: Use local reference It now grabs one from gossiper which is weird. A bit later it will be possible to remove gossiper->system_keyspace dependency Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-05 17:35:58 +03:00
Pavel Emelyanov	8570fe3c30	raft_group0: Add system keyspace reference The sharded<system_keyspace> is already started by the time raft_group0 is created Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-05 17:35:13 +03:00
Michał Chojnowski	a0204c17c5	treewide: remove mentions of seastar::thread::should_yield() thread_scheduling_group has been retired many years ago. Remove the leftovers, they are confusing. Closes #11714	2022-10-05 12:26:37 +03:00
Michał Chojnowski	8aa24194b7	row_cache: remove a dead try...catch block in eviction All calls in the try block have been noexcept for some time. Remove the try...catch and the associated misleading comment to avoid confusing source code readers. Closes #11715	2022-10-05 12:23:47 +03:00
Benny Halevy	7286f5d314	sstables: mx/writer: optimize large data stats members order Since `_partition_size_entry` and `_rows_in_partition_entry` are accessed at the same time when updated, and similarly `_cell_size_entry` and `_elements_in_collection_entry`, place the member pairs closely together to improve data cache locality. Follow the same order when preparing the `scylla_metadata::large_data_stats` map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:54:04 +03:00
Benny Halevy	8c8a0adb40	sstables: mx/writer: keep large data stats entry as members To save the map lookup on the hot write path, keep each large data stats entry as a member in the writer object and build a map for storing the disk_hash in the scylla metadata only when finalizing it in consume_end_of_stream. Fixes #11686 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:54:04 +03:00
Benny Halevy	2c4ff71d2b	db: large_data_handler: dynamically update config thresholds make the various large data thresholds live-updateable and construct the observers and updaters in cql_table_large_data_handler to dynamically update the base large_data_handler class threshold members. Fixes #11685 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:53:40 +03:00
Benny Halevy	6d582054c0	utils/updateable_value: add transforming_value_updater Automatically updates a value from a utils::updateable_value Where they can be of different types. An optional transfom function can provide an additional transformation when updating the value, like multiplying it by a factor for unit conversion, for example. To be used for auto-updating the large data thresholds from the db::config. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:52:49 +03:00
Botond Dénes	4c13328788	Merge 'Return all sstables in table::get_sstable_set()' from Raphael "Raph" Carvalho This fixes a regression introduced by `1e7a444`, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set. Fixes https://github.com/scylladb/scylladb/issues/11681. Closes #11682 * github.com:scylladb/scylladb: replica: Return all sstables in table::get_sstable_set() sstables: Fix cloning of compound_sstable_set	2022-10-05 06:55:50 +03:00
Pavel Emelyanov	2c1ef0d2b7	sstables.hh: Remove unused headers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11709	2022-10-04 23:37:07 +02:00
Raphael S. Carvalho	827750c142	replica: Return all sstables in table::get_sstable_set() get_sstable_set() as its name implies is not confined to the main or maintenance set, nor to a specific compaction group, so let's make it return the compound set which spans all groups, meaning all sstables tracked by a table will be returned. This is a regression introduced in `1e7a444`. It affects the API to return sstable list containing a partition key, as sstables in maintenance would be missed, fooling users of the API like tools that could trust the output. Each compaction group is returning the main and maintenance set in table_state's main_sstable_set() and maintenance_sstable_set(), respectively. Fixes #11681. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-04 10:43:27 -03:00
Raphael S. Carvalho	eddf32b94c	sstables: Fix cloning of compound_sstable_set The intention was that its clone() would actually clone the content of an existing set into a new one, but the current impl is actually moving the sets instead of copying them. So the original set becomes invalid. Luckily, this problem isn't triggered as we're not exposing the compound set in the table's interface, so the compound_sstable_set::clone() method isn't being called. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-04 10:43:25 -03:00
Felipe Mendes	f67bb43a7a	locator: ec2_snitch: IMDSv2 support Access to AWS Metadata may be configured in three distinct ways: 1 - Optional HTTP tokens and HTTP endpoint enabled: The default as it works today 2 - Required HTTP tokens and HTTP endpoint enabled: Which support is entirely missing today 3 - HTTP endpoint disabled: Which effectively forbids one to use Ec2Snitch or Ec2MultiRegionSnitch This commit makes the 2nd option the default which is not only AWS recommended option, but is also entirely compatible with the 1st option. In addition, we now validate the HTTP response when querying the IMDS server. Therefore - should a HTTP 403 be received - Scylla will properly notify users on what they are trying to do incorrectly in their setup. The commit was tested under the following circumstances (covering all 3 variants): - Ec2Snitch: IMDSv2 optional & required, and HTTP server disabled. - Ec2MultiRegionSnitch: IMDSv2 optional & required, and HTTP server disabled. Refs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html https://github.com/scylladb/scylladb/issues/9987 Fixes: https://github.com/scylladb/scylladb/issues/10490 Closes: https://github.com/scylladb/scylladb/issues/10490 Closes #11636	2022-10-04 15:48:42 +03:00
Avi Kivity	37c6b46d26	dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty" The "virtual dirty" term is not very informative. "Virtual" means "not real", but it doesn't say in which way it isn't real. In this case, virtual dirty refers to real dirty memory, minus the portion of memtables that has been written to disk (but not yet sealed - in that case it would not be dirty in the first place). I chose to call "the portion of memtables that has been written to disk" as "spooled memory". At least the unique term will cause people to look it up and may be easier to remember. From that we have "unspooled memory". I plan to further change the accounting to account for spooled memory rather than unspooled, as that is a more natural term, but that is left for later. The documentation, config item, and metrics are adjusted. The config item is practically unused so it isn't worth keeping compatibility here.	2022-10-04 14:03:59 +03:00
Avi Kivity	d02c407769	dirty_memory_manager: rename _virtual_region_group Since we folded _real_region_group into _virtual_region_group, the "virtual" tag makes no sense any more, so remove it.	2022-10-04 14:01:45 +03:00
Avi Kivity	b0814bdd42	api: column_family: fix memtable off-heap memory reporting We report virtual memory used, but that's not a real accounting of the actual memory used. Use the correct real_memory_used() instead. Note that this isn't a recent regression and was probably broken forever. However nobody looks at this measure (and it's usually close to the correct value) so nobody noticed. Since it's so minor, I didn't bother filing an issue.	2022-10-04 13:56:29 +03:00
Avi Kivity	bc2fcf5187	dirty_memory_manager: unscramble terminology Before `95f31f37c1` ("Merge 'dirty_memory_manager: simplify region_group' from Avi Kivity"), we had two region_group objects, one _real_region_group and another _virtual_region_group, each with a set of "soft" and "hard" limits and related functions and members. In `95f31f37c1`, we merged _real_region_group into _virtual_region_group, but unfortunately the _real_region_group members received the "hard" prefix when they got merged. This overloads the meaning of "hard" - is it related to soft/hard limit or is it related to the real/virtual distinction? This patch applied some renaming to restore consistency. Anything that came from _virtual_region_group now has "virtual" in its name. Anything that came from _real_region_group now has "real" in its name. The terms are still pretty bad but at least they are consistent.	2022-10-04 13:56:28 +03:00
Kamil Braun	c200ae2228	Merge 'test.py topology Scylla REST API client' from Alecco - Separate `aiohttp` client code - Helper to access Scylla server REST API - Use helper both in `ScyllaClusterManager` (test.py process) and `ManagerClient` (pytest process) - Add `removenode` and `decommission` operations. Closes #11653 * github.com:scylladb/scylladb: test.py: Scylla REST methods for topology tests test.py: rename server_id to server_ip test.py: HTTP client helper test.py: topology pass ManagerClient instead of... test.py: delete unimplemented remove server test.py: fix variable name ssl name clash	2022-10-04 11:50:18 +02:00
Botond Dénes	169a8a66f2	compatible_ring_position_or_view: make it cheap to copy This class exists for one purpose only: to serve as glue code between dht::ring_position and boost::icl::interval_map. The latter requires that keys in its intervals are: * default constructible * copyable * have standalone compare operations For this reason we have to wrap `dht::ring_position` in a class, together with a schema to provide all this. This is `compatible_ring_position`. There is one further requirement by code using the interval map: it wants to do lookups without copying the lookup key(s). To solve this, we came up with `compatible_ring_position_or_view` which is a union of a key or a key view + schema. As we recently found out, boost::icl copies its keys a lot. It seems to assume these keys are cheap to copy and carelessly copies them around even when iterating over the map. But `compatible_ring_position_or_view` is not cheap to copy as it copies a `dht::ring_position` which allocates, and it does that via an `std::optional` and `std::variant` to add insult to injury. This patch make said class cheap to copy, by getting rid of the variant and storing the `dht::ring_position` via a shared pointer. The view is stored separately and either points to the ring position stored in the shared pointer or to an outside ring position (for lookups). Fixes: #11669 Closes #11670	2022-10-04 12:00:21 +03:00
Piotr Dulikowski	51f813d89b	storage_proxy: update rate limited reads metric when coordinator rejects The decision to reject a read operation can either be made by replicas, or by the coordinator. In the second case, the scylla_storage_proxy_coordinator_read_rate_limited metric was not incremented, but it should. This commit fixes the issue. Fixes: #11651 Closes #11694	2022-10-04 10:33:58 +03:00
Pavel Emelyanov	9cd1f777a5	database.hh: Remove unused headers Use forward declarations when needed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11667	2022-10-04 09:01:38 +03:00
Botond Dénes	5fd4b1274e	Merge 'compaction_manager: Don't let ENOSPC throw out of ::stop() method' from Pavel Emelyanov The seastar defer_stop() helper is cool, but it forwards any exception from the .stop() towards the caller. In case the caller is main() the exception causes Scylla to abort(). This fires, for example, in compaction_manager::stop() when it steps on ENOSPC Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11662 * github.com:scylladb/scylladb: compaction_manager: Swallow ENOSPCs in ::stop() exceptions: Mark storage_io_error::code() with noexcept	2022-10-04 08:54:22 +03:00
Nadav Har'El	3a30fbd56c	test/alternator: fix timeout in flaky test test_ttl_stats The test `test_metrics.py::test_ttl_stats` tests the metrics associated with Alternator TTL expiration events. It normally finishes in less than a second (the TTL scanning is configured to run every 0.5 seconds), so we arbitrarily set a 60 second timeout for this test to allow for extremely slow test machines. But in some extreme cases even this was not enough - in one case we measured the TTL scan to take 63 seconds. So in this patch we increase the timeout in this test from 60 seconds to 120 seconds. We already did the same change in other Alternator TTL tests in the past - in commit `746c4bd`. Fixes #11695 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11696	2022-10-04 08:50:51 +03:00
Benny Halevy	46ebffcc93	db/large_data_handler: cql_table_large_data_handler: record large_collections When the large_collection_detection cluster feature is enabled, select the internal_record_large_cells_and_collections method to record the large collection cell, storing also the collection_elements column. We want to do that only when the cluster feature is enabled to facilitate rollback in case rolling upgrade is aborted, otherwise system.large_cells won't be backward compatible and will have to be deleted manually. Delete the sstable from system.large_cells if it contains elements_in_collection above threshold. Closes #11449 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:10 +03:00
Benny Halevy	3f8bba202f	db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler For recording collection_elements of large_collections when the large_collection_detection feature is enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:10 +03:00
Benny Halevy	dc4e7d8e01	db/large_data_handler: cql_table_large_data_handler: move ctor out of line Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:09 +03:00
Benny Halevy	f4c3070002	docs: large-rows-large-cells-tables: fix typos Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:09 +03:00
Benny Halevy	2f49eebb04	db/system_keyspace: add collection_elements column to system.large_cells And bump the schema version offset since the new schema should be distinguishable from the previous one. Refs scylladb/scylladb#11660 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:08 +03:00
Benny Halevy	9ad41c700e	gms/feature_service: add large_collection_detection cluster feature And a corresponding db::schema_feature::SCYLLA_LARGE_COLLECTIONS We want to enable the schema change supporting collection_elements only when all nodes are upgraded so that we can roll back if the rolling upgrade process is aborted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:07 +03:00
Benny Halevy	9eeb8f2971	test: sstable_3_x_test: add test_sstable_too_many_collection_elements Test that collections with too many elements are detected properly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:07 +03:00
Benny Halevy	3c11937b00	test: lib: simple_schema: add support for optional collection column Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:06 +03:00
Benny Halevy	7b5f2d2e53	test: lib: simple_schema: build schema in ctor body Rather when initializing _s. Prepare for adding an optional collection column. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:06 +03:00
Benny Halevy	db01641a44	test: lib: simple_schema: cql: define s1 as static only if built this way Keep the with_static ctor parameter as private member to be used by the cql() method to define s1 either as static or not. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:05 +03:00
Benny Halevy	6dadca2648	db/large_data_handler: maybe_record_large_cells: consider collection_elements Detect large_collections when the number of collection_elements is above the configured threshold. Next step would be to record the number of collection_elements in the system.large_cells table, when the respective cluster feature is enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:05 +03:00
Benny Halevy	27ee75c54e	db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries Log in debug level when deleting large data entry from system table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:04 +03:00
Benny Halevy	7dead10742	sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells And update the sstable elements_in_collection stats entry. Next step would be to forward it to large_data_handler().maybe_record_large_cells(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:41:58 +03:00
Benny Halevy	54ab038825	sstables: mx/writer: add large_data_type::elements_in_collection Add a new large_data_stats type and entry for keeping the collection_elements_count_threshold and the maximum value of collection_elements. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:41:56 +03:00
Benny Halevy	a107f583fd	db/large_data_handler: get the collection_elements_count_threshold Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:11 +03:00
Benny Halevy	167ec84eeb	db/config: add compaction_collection_elements_count_warning_threshold Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:10 +03:00
Benny Halevy	5e88e6267e	test: sstable_3_x_test: add test_sstable_write_large_cell based on cell size threshold. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:09 +03:00
Benny Halevy	3980415d97	test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:09 +03:00
Benny Halevy	3eb4cda8ea	test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:08 +03:00
Benny Halevy	0a9d3f24e6	test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T This way, the boost infrastructure prints the offending values if the test assertion fails. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:07 +03:00
Benny Halevy	9668dd0e2d	test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random So it would be reproducible based on the test random-seed Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:30:51 +03:00
Kamil Braun	114419d6ab	service/raft: raft_group0_client: read on-disk an in-memory group0 upgrade atomically `set_group0_upgrade_state` writes the on-disk state first, then in-memory state second, both under a write lock. `get_group0_upgrade_state` would only take the lock if the in-memory state was `use_pre_raft_procedures`. If there's an external observer who watches the on-disk state to decide whether Raft upgrade finished yet, the following could happen: 1. The node wrote `use_post_raft_procedures` to disk but didn't update the in-memory state yet, which is still `synchronize`. 2. The external client reads the table and sees that the state is `use_post_raft_procedures`, and deduces that upgrade has finished. 3. The external client immediately tries to perform a schema change. The schema change code calls `get_group0_upgrade_state` which does not take the read lock and returns `synchronize`. The schema change gets denied because schema changes are not allowed in `synchronize`. Make sure that `get_group0_upgrade_state` cannot execute in-between writing to disk and updating the in-memory state by always taking the read lock before reading the in-memory state. As it was before, it will immediately drop the lock if the state is not `use_pre_raft_procedures`. This is useful for upgrade tests, which read the on-disk state to decide whether upgrade has finished and often try to perform a schema change immediately afterwards. Closes #11672	2022-10-03 19:04:16 +02:00
Alejo Sanchez	abf1425ad4	test.py: Scylla REST methods for topology tests Provide a helper client for Scylla REST requests. Use it on both ScyllaClusterManager (e.g. remove node, test.py process) and ManagerClient (e.g. get uuid, pytest process). For now keep using IPs as key in ScyllaCluster, but this will be changed to UUID -> IP in the future. So, for now, pass both independently. Note the UUID must be obtained from the server before stopping it. Refresh client driver connection when decommissioning or removing a node. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-03 19:01:03 +02:00
Alejo Sanchez	86c752c2a0	test.py: rename server_id to server_ip In ScyllaCluster currently servers are tracked by the host IP. This is not the host id (UUID). Fix the variable name accordingly Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-03 19:01:03 +02:00
Alejo Sanchez	a7a0b446f0	test.py: HTTP client helper Split aiohttp client to a shared helper file. While there, move aiohttp session setup back to constructors. When there were teardown issues it looked it could be caused by aiohttp session being created outside a coroutine. But this is proven not to be the case after recent fixes. So move it back to the ManagerClient constructor. On th other hand, create a close() coroutine to stop the aiohttp session. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-03 19:01:03 +02:00
Alejo Sanchez	41dbdf0f70	test.py: topology pass ManagerClient instead of... cql connection When there are topology changes, the driver needs to be updated. Instead of passing the CassandraCluster.Connection, pass the ManagerClient instance which manages the driver connection inside of it. Remove workaround for test_raft_upgrade. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-03 19:00:47 +02:00
Alejo Sanchez	0c3a06d0d7	test.py: delete unimplemented remove server Delete of Unused and unimplemented broken version of remove server. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-03 18:57:38 +02:00
Alejo Sanchez	98bc4c198f	test.py: fix variable name ssl name clash Change variable ssl to use_ssl to avoid clash with ssl module. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-10-03 18:57:38 +02:00
Avi Kivity	7626fd573a	storage_proxy: reindent after coroutinization	2022-10-03 19:33:39 +03:00
Avi Kivity	019b18b232	storage_proxy: convert handle_read_digest() to a coroutine The do_with() makes it at least a break-even, but there's some allocating continuations that make it a win. A variable named cmd had two different definitions (a value and a lw_shared_ptr) that lived in different scopes. I renamed one to cmd1 to disambiguate. We should probably move that to the caller, but that is not done here.	2022-10-03 19:33:39 +03:00
Avi Kivity	aa5f4bf1f3	storage_proxy: convert handle_read_mutation_data() to a coroutine The do_with() makes it at least a break-even, but there's some allocating continuations that make it a win. A variable named cmd had two different definitions (a value and a lw_shared_ptr) that lived in different scopes. I renamed one to cmd1 to disambiguate. We should probably move that to the caller, but that is not done here.	2022-10-03 19:33:39 +03:00
Avi Kivity	bcd134e9b8	storage_proxy: convert handle_read_data() to a coroutine The do_with() makes it at least a break-even, but there's some allocating continuations that make it a win. A variable named cmd had two different definitions (a value and a lw_shared_ptr) that lived in different scopes. I renamed one to cmd1 to disambiguate. We should probably move that to the caller, but that is not done here.	2022-10-03 19:33:39 +03:00
Avi Kivity	167c8b1b5e	storage_proxy: convert handle_write() to a coroutine A do_with() makes this at least a break-even. Some internal lambdas were not converted since they commonly do not allocate or block. A finally() continuation is converted to seastar::defer().	2022-10-03 19:33:39 +03:00
Avi Kivity	741d6609a5	storage_proxy: convert handle_counter_mutation() to a coroutine The do_with means the coroutine conversion is free, and conversion of parallel_for_each to coroutine::parallel_for_each saves a possible allocation (though it would not have been allocated usually. An inner continuation is not converted since it usually doesn't block, and therefore doesn't allocate.	2022-10-03 19:33:39 +03:00
Avi Kivity	ac5fae4b93	storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine It's simpler, and the do_with() allocation + task cancels out the coroutine allocation + task.	2022-10-03 19:33:29 +03:00
Pavel Emelyanov	d22b130af1	compaction_manager: Swallow ENOSPCs in ::stop() When being stopped compaction manager may step on ENOSPC. This is not a reason to fail stopping process with abort, better to warn this fact in logs and proceed as if nothing happened refs: #11245 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-03 18:54:48 +03:00
Pavel Emelyanov	7ba1f551f3	exceptions: Mark storage_io_error::code() with noexcept Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-03 18:50:06 +03:00
Kamil Braun	67ee6500e3	service/raft: raft_group_registry: pass `direct_fd_pinger` by reference It was passed to `raft_group_registry::direct_fd_proxy` by value. That is a bug, we want to pass a reference to the instance that is living inside `gossiper`. Fortunately this bug didn't cause problems, because the pinger is only used for one function, `get_address`, which looks up an address in a map and if it doesn't find it, accesses the map that lives inside `gossiper` on shard 0 (and then caches it in the local copy). Explicitly delete the copy constructor of `direct_fd_pinger` so this doesn't happen again. Closes #11661	2022-10-03 16:40:35 +02:00
Tomasz Grabiec	9dae2b9c02	Merge 'mutation_fragment_stream_validator: various API improvements' from Botond Dénes The low-level `mutation_fragment_stream_validator` gets `reset()` methods that until now only the high-level `mutation_fragment_stream_validating_filter` had. Active tombstone validation is pushed down to the low level validator. The low level validator, which was a pain to use until now due to being very fussy on which subset of its API one used, is made much more robust, not requiring the user to stick to a subset of its API anymore. Closes #11614 * github.com:scylladb/scylladb: mutation_fragment_stream_validator: make interface more robust mutation_fragment_stream_validator: add reset() to validating filter mutation_fragment_stream_validator: move active tomsbtone validation into low level validator	2022-10-03 16:23:46 +02:00
Botond Dénes	95f31f37c1	Merge 'dirty_memory_manager: simplify region_group' from Avi Kivity region_group evolved as a tree, each node of which contains some regions (memtables). Each node has some constraints on memory, and can start flushing and/or stop allocation into its memtables and those below it when those constraints are violated. Today, the tree has exactly two nodes, only one of which can hold memtables. However, all the complexity of the tree remains. This series applies some mechanical code transformations that remove the tree structure and all the excess functionality, leaving a much simpler structure behind. Before: - a tree of region_group objects - each with two parameters: soft limit and hard limit - but only two instances ever instantiated After: - a single region_group object - with three parameters - two from the bottom instance, one from the top instance Closes #11570 * github.com:scylladb/scylladb: dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config dirty_memory_manager: simplify region_group::update() dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief() dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group dirty_memory_manager: remove accessors around region_group::_under_hard_pressure dirty_memory_manager: merge memory_hard_limit into region_group dirty_memory_manager: rename members in memory_hard_limit dirty_memory_manager: fold do_update() into region_group::update() dirty_memory_manager: simplify memory_hard_limit's do_update dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit) dirty_memory_manager: adjust soft_limit threshold check dirty_memory_manager: drop memory_hard_limit::_name dirty_memory_manager: simplify memory_hard_limit configuration dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group} dirty_memory_manager: stop inheriting from region_group_reclaimer dirty_memory_manager: test: unwrap region_group_reclaimer dirty_memory_manager: change region_group_reclaimer configuration to a struct dirty_memory_manager: convert region_group_reclaimer to callbacks dirty_memory_manager: consolidate region_group_reclaimer constructors dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief dirty_memory_manager: drop unused parameter to memory_hard_limit constructor dirty_memory_manager: drop memory_hard_limit::shutdown() dirty_memory_manager: split region_group hierarchy into separate classes dirty_memory_manager: extract code block from region_group::update dirty_memory_manager: move more allocation_queue functions out of region_group dirty_memory_manager: move some allocation queue related function definitions outside class scope dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue dirty_memory_manager: remove support for multiple subgroups	2022-10-03 13:22:47 +03:00
Anna Stuchlik	3950a1cac8	doc: apply the feedback to improve clarity	2022-10-03 11:14:51 +02:00
Botond Dénes	5621cdd7f9	db/view/view_builder: don't drop partition and range tombstones when resuming The view builder builds the views from a given base table in view_builder::batch_size batches of rows. After processing this many rows, it suspends so the view builder can switch to building views for other base tables in the name of fairness. When resuming the build step for a given base table, it reuses the reader used previously (also serving the role of a snapshot, pinning sstables read from). The compactor however is created anew. As the reader can be in the middle of a partition, the view builder injects a partition start into the compactor to prime it for continuing the partition. This however only included the partition-key, crucially missing any active tombstones: partition tombstone or -- since the v2 transition -- active range tombstone. This can result in base rows covered by either of this to be resurrected and the view builder to generate view updates for them. This patch solves this by using the detach-state mechanism of the compactor which was explicitly developed for situations like this (in the range scan code) -- resuming a read with the readers kept but the compactor recreated. Also included are two test cases reproducing the problem, one with a range tombstone, the other with a partition tombstone. Fixes: #11668 Closes #11671	2022-10-03 11:28:22 +03:00
Avi Kivity	2c744628ae	Update abseil submodule * abseil 9e408e05...7f3c0d78 (193): > Allows absl::StrCat to accept types that implement AbslStringify() > Merge pull request #1283 from pateldeev:any_inovcable_rename_true > Cleanup: SmallMemmove nullify should also be limited to 15 bytes > Cleanup: implement PrependArray and PrependPrecise in terms of InlineData > Cleanup: Move BitwiseCompare() to InlineData, and make it layout independent. > Change kPower10Table bounds to be half-open > Cleanup some InlineData internal layout specific details from cord.h > Improve the comments on the implementation of format hooks adl tricks. > Expand LogEntry method docs. > Documentation: Remove an obsolete note about the implementation of `Cord`. > `absl::base_internal::ReadLongFromFile` should use `O_CLOEXEC` and handle interrupts to `read` > Allows absl::StrFormat to accept types which implement AbslStringify() > Add common_policy_traits - a subset of hash_policy_traits that can be shared between raw_hash_set and btree. > Split configuration related to cycle clock into separate headers > Fix -Wimplicit-int-conversion and -Wsign-conversion warnings in btree. > Implement Eisel-Lemire for from_chars<float> > Import of CCTZ from GitHub. > Adds support for "%v" in absl::StrFormat and related functions for bool values. Note that %v prints bool values as "true" and "false" rather than "1" and "0". > De-pointerize LogStreamer::stream_, and fix move ctor/assign preservation of flags and other stream properties. > Explicitly disallows modifiers for use with %v. > Change the macro ABSL_IS_TRIVIALLY_RELOCATABLE into a type trait - absl::is_trivially_relocatable - and move it from optimization.h to type_traits.h. > Add sparse and string copy constructor benchmarks for hash table. > Make BTrees work with custom allocators that recycle memory. > Update the readme, and (internally) fix some export processes to better keep it up-to-date going forward. > Add the fact that CHECK_OK exits the program to the comment of CHECK_OK. > Adds support for "%v" in absl::StrFormat and related functions for numeric types, including integer and floating point values. Users may now specify %v and have the format specifier deduced. Integer values will print according to %d specifications, unsigned values will use %u, and floating point values will use %g. Note that %v does not work for `char` due to ambiguity regarding the intended output. Please continue to use %c for `char`. > Implement correct move constructor and assignment for absl::strings_internal::OStringStream, and mark that class final. > Add more options for `BM_iteration` in order to see better picture for choosing trade off for iteration optimizations. > Change `EndComparison` benchmark to not measure iteration. Also added `BM_Iteration` separately. > Implement Eisel-Lemire for from_chars<double> > Add `-llog` to linker options when building log_sink_set in logging internals. > Apply clang-format to btree.h. > Improve failure message: tell the values we don't like. > Increase the number of per-ObjFile program headers we can expect. > Fix "unsafe narrowing" warnings in absl, 8/n. > Fix format string error with an explicit cast > Add a case to detect when the Bazel compiler string is explicitly set to "gcc", instead of just detecting Bazel's default "compiler" string. > Fix "unsafe narrowing" warnings in absl, 10/n. > Fix "unsafe narrowing" warnings in absl, 9/n. > Fix stacktrace header includes > Add a missing dependency on :raw_logging_internal > CMake: Require at least CMake 3.10 > CMake: install artifacts reflect the compiled ABI > Fixes bug so that `%v` with modifiers doesn't compile. `%v` is not intended to work with modifiers because the meaning of modifiers is type-dependent and `%v` is intended to be used in situations where the type is not important. Please continue using if `%s` if you require format modifiers. > Convert algorithm and container benchmarks to cc_binary > Merge pull request #1269 from isuruf:patch-1 > InlinedVector: Small improvement to the max_size() calculation > CMake: Mark hash_testing as a public testonly library, as it is with Bazel > Remove the ABSL_HAVE_INTRINSIC_INT128 test from pcg_engine.h > Fix ClangTidy warnings in btree.h and btree_test.cc. > Fix log StrippingTest on windows when TCHAR = WCHAR > Refactors checker.h and replaces recursive functions with iterative functions for readability purposes. > Refactors checker.h to use if statements instead of ternary operators for better readability. > Import of CCTZ from GitHub. > Workaround for ASAN stack safety analysis problem with FixedArray container annotations. > Rollback of fix "unsafe narrowing" warnings in absl, 8/n. > Fix "unsafe narrowing" warnings in absl, 8/n. > Changes mutex profiling > InlinedVector: Correct the computation of max_size() > Adds support for "%v" in absl::StrFormat and related functions for string-like types (support for other builtin types will follow in future changes). Rather than specifying %s for strings, users may specify %v and have the format specifier deduced. Notably, %v does not work for `const char` because we cannot be certain if %s or %p was intended (nor can we be certain if the `const char` was properly null-terminated). If you have a `const char` you know is null-terminated and would like to work with %v, please wrap it in a `string_view` before using it. > Fixed header guards to match style guide conventions. > Typo fix > Added some more no_test.. tags to build targets for controlling testing. > Remove includes which are not used directly. > CMake: Add an option to build the libraries that are used for writing tests without requiring Abseil's tests be built (default=OFF) > Fix "unsafe narrowing" warnings in absl, 7/n. > Fix "unsafe narrowing" warnings in absl, 6/n. > Release the Abseil Logging library > Switch time_state to explicit default initialization instead of value initialization. > spinlock.h: Clean up includes > Fix minor typo in absl/time/time.h comment: "ToDoubleNanoSeconds" -> "ToDoubleNanoseconds" > Support compilers that are unknown to CMake > Import of CCTZ from GitHub. > Change bit_width(T) to return int rather than T. > Import of CCTZ from GitHub. > Merge pull request #1252 from jwest591:conan-fix > Don't try to enable use of ARM NEON intrinsics when compiling in CUDA device mode. They are not available in that configuration, even if the host supports them. > Fix "unsafe narrowing" warnings in absl, 5/n. > Fix "unsafe narrowing" warnings in absl, 4/n. > Import of CCTZ from GitHub. > Update Abseil platform support policy to point to the Foundational C++ Support Policy > Import of CCTZ from GitHub. > Add --features=external_include_paths to Bazel CI to ignore warnings from dependencies > Merge pull request #1250 from jonathan-conder-sm:gcc_72 > Merge pull request #1249 from evanacox:master > Import of CCTZ from GitHub. > Merge pull request #1246 from wxilas21:master > remove unused includes and add missing std includes for absl/status/status.h > Sort INTERNAL_DLL_TARGETS for easier maintenance. > Disable ABSL_HAVE_STD_IS_TRIVIALLY_ASSIGNABLE for clang-cl. > Map the absl::is_trivially_ functions to their std impl > Add more SimpleAtod / SimpleAtof test coverage > debugging: handle alternate signal stacks better on RISCV > Revert change "Fix "unsafe narrowing" warnings in absl, 4/n.". > Fix "unsafe narrowing" warnings in absl, 3/n. > Fix "unsafe narrowing" warnings in absl, 4/n. > Fix "unsafe narrowing" warnings in absl, 2/n. > debugging: honour `STRICT_UNWINDING` in RISCV path > Fix "unsafe narrowing" warnings in absl, 1/n. > Add ABSL_IS_TRIVIALLY_RELOCATABLE and ABSL_ATTRIBUTE_TRIVIAL_ABI macros for use with clang's __is_trivially_relocatable and [[clang::trivial_abi]]. > Merge pull request #1223 from ElijahPepe:fix/implement-snprintf-safely > Fix frame pointer alignment check. > Fixed sign-conversion warning in code. > Import of CCTZ from GitHub. > Add missing include for std::unique_ptr > Do not re-close files on EINTR > Renamespace absl::raw_logging_internal to absl::raw_log_internal to match (upcoming) non-raw logging namespace. > Check for negative return values from ReadFromOffset > Use HTTPS RFC URLs, which work regardless of the browser's locale. > Avoid signedness change when casting off_t > Internal Cleanup: removing unused internal function declaration. > Make Span complain if constructed with a parameter that won't outlive it, except if that parameter is also a span or appears to be a view type. > any_invocable_test: Re-enable the two conversion tests that used to fail under MSVC > Add GetCustomAppendBuffer method to absl::Cord > debugging: add hooks for checking stack ranges > Minor clang-tidy cleanups > Support [[gnu::abi_tag("xyz")]] demangling. > Fix -Warray-parameter warning > Merge pull request #1217 from anpol:macos-sigaltstack > Undo documentation change on erase. > Improve documentation on erase. > Merge pull request #1216 from brjsp:master > string_view: conditional constexpr is no longer needed for C++14 > Make exponential_distribution_test a bigger test (timeout small -> moderate). > Move Abseil to C++14 minimum > Revert commit f4988f5bd4176345aad2a525e24d5fd11b3c97ea > Disable C++11 testing, enable C++14 and C++20 in some configurations where it wasn't enabled > debugging: account for differences in alternate signal stacks > Import of CCTZ from GitHub. > Run flaky test in fewer configurations > AnyInvocable: Move credits to the top of the file > Extend visibility of :examine_stack to an upcoming Abseil Log. > Merge contiguous mappings from the same file. > Update versions of WORKSPACE dependencies > Use ABSL_INTERNAL_HAS_SSE2 instead of __SSE2__ > PR #1200: absl/debugging/CMakeLists.txt: link with libexecinfo if needed > Update GCC floor container to use Bazel 5.2.0 > Update GoogleTest version used by Abseil > Release absl::AnyInvocable > PR #1197: absl/base/internal/direct_mmap.h: fix musl build on mips > absl/base/internal/invoke: Ignore bogus warnings on GCC >= 11 > Revert GoogleTest version used by Abseil to commit 28e1da21d8d677bc98f12ccc7fc159ff19e8e817 > Update GoogleTest version used by Abseil > explicit_seed_seq_test: work around/disable bogus warnings in GCC 12 > any_test: expand the any emplace bug suppression, since it has gotten worse in GCC 12 > absl::Time: work around bogus GCC 12 -Wrestrict warning > Make absl::StdSeedSeq an alias for std::seed_seq > absl::Optional: suppress bogus -Wmaybe-uninitialized GCC 12 warning > algorithm_test: suppress bogus -Wnonnull warning in GCC 12 > flags/marshalling_test: work around bogus GCC 12 -Wmaybe-uninitialized warning > counting_allocator: suppress bogus -Wuse-after-free warning in GCC 12 > Prefer to fallback to UTC when the embedded zoneinfo data does not contain the requested zone. > Minor wording fix in the comment for ConsumeSuffix() > Tweak the signature of status_internal::MakeCheckFailString as part of an upcoming change > Fix several typos in comments. > Reformulate documentation of ABSL_LOCKS_EXCLUDED. > absl/base/internal/invoke.h: Use ABSL_INTERNAL_CPLUSPLUS_LANG for language version guard > Fix C++17 constexpr storage deprecation warnings > Optimize SwissMap iteration by another 5-10% for ARM > Add documentation on optional flags to the flags library overview. > absl: correct the stack trace path on RISCV > Merge pull request #1194 from jwnimmer-tri:default-linkopts > Remove unintended defines from config.h > Ignore invalid TZ settings in tests > Add ABSL_HARDENING_ASSERTs to CordBuffer::SetLength() and CordBuffer::IncreaseLengthBy() > Fix comment typo about absl::Status<T*> > In b-tree, support unassignable value types. > Optimize SwissMap for ARM by 3-8% for all operations > Release absl::CordBuffer > InlinedVector: Limit the scope of the maybe-uninitialized warning suppression > Improve the compiler error by removing some noise from it. The "deleted" overload error is useless to users. By passing some dummy string to the base class constructor we use a valid constructor and remove the unintended use of the deleted default constructor. > Merge pull request #714 from kgotlinux:patch-2 > Include proper #includes for POSIX thread identity implementation when using that implementation on MinGW. > Rework NonsecureURBGBase seed sequence. > Disable tests on some platforms where they currently fail. > Fixed typo in a comment. > Rollforward of commit ea78ded7a5f999f19a12b71f5a4988f6f819f64f. > Add an internal helper for logging (upcoming). > Merge pull request #1187 from trofi:fix-gcc-13-build > Merge pull request #1189 from renau:master > Allow for using b-tree with `value_type`s that can only be constructed by the allocator (ignoring copy/move constructors). > Stop using sleep timeouts for Linux futex-based SpinLock > Automated rollback of commit f2463433d6c073381df2d9ca8c3d8f53e5ae1362. > time.h: Use uint32_t literals for calls to overloaded MakeDuration > Fix typos. > Clarify the behaviour of `AssertHeld` and `AssertReaderHeld` when the calling thread doesn't hold the mutex. > Enable __thread on Asylo > Add implementation of is_invocable_r to absl::base_internal for C++ < 17, define it as alias of std::is_invocable_r when C++ >= 17 > Optimize SwissMap iteration for aarch64 by 5-6% > Fix detection of ABSL_HAVE_ELF_MEM_IMAGE on Haiku > Don’t use generator expression to build .pc Libs lines > Update Bazel used on MacOS CI > Import of CCTZ from GitHub. Closes #11687	2022-10-03 11:06:37 +03:00
Botond Dénes	f4540ef0d6	Merge 'Upgrade nix devenv' from Michael Livshin To recap: the Nix devenv ({default,shell,flake}.nix and friends) in Scylla is a nicer (for those who consider it so, that is) alternative to dbuild: a completely deterministic build environment without Docker. In theory we could support much more (creating installable packages, container images, various deployment affordances, etc. -- Nix is, among other things, a kind of parallel-to-everything-else devops realm) but there is clearly no demand and besides duplicating the work the release team is already doing (and doing just fine, needless to say) would be pointless and wasteful. This PR reflects the accumulated changes that I have been carrying locally for the past year or so. The version currently in master _probably_ can still build Scylla, but that Scylla certainly would not pass unit tests. What the previous paragraph seems to mean is, apparently I'm the only active user of Nix devenv for Scylla. Which, in turn, presents some obvious questions for the maintainers: - Does this need to live in the Scylla source at all? (The changes to non-Nix-specific parts are minimal and unobtrusive, but they are still changes) - If it's left in, who is going to maintain it going forward, should more users somehow appear? (I'm perfectly willing to fix things up when alerted, but no timeliness guarantees) Closes #9557 * github.com:scylladb/scylladb: nix: add README.md build: improvements & upgrades to Nix dev environment build: allow setting SCYLLA_RELEASE from outside	2022-10-03 09:40:09 +03:00
Botond Dénes	2041744132	Merge 'readers/mutlishard: don't mix coroutines and continuations in the do_fill_buffer()' from Avi Kivity The combination is hard to read and modify. Closes #11665 * github.com:scylladb/scylladb: readers/multishard: restore shard_reader_v2::do_fill_buffer() indentation readers/multishard: convert shard_reader_v2::do_fill_buffer() to a pure coroutine	2022-10-03 06:51:20 +03:00
Nadav Har'El	b8f8eb8710	Merge 'Improve test.py logging' from Kamil Braun Include the unique test name (the unique name distinguishes between different test repeats) and the test case name where possible. Improve printing of clusters: include the cluster name and stopped servers. Fix some logging calls and add new ones. Examples: ``` ------ Starting test test_topology ------ ``` became this: ``` ------ Starting test test_topology.1::test_add_server_add_column ------ ``` This: ``` INFO> Leasing Scylla cluster {127.191.142.1, 127.191.142.2, 127.191.142.3} for test test_add_server_add_column ``` became this: ``` INFO> Leasing Scylla cluster ScyllaCluster(name: 02cdd180-40d1-11ed-8803-3c2c30d32d96, running: {127.144.164.1, 127.144.164.2, 127.144.164.3}, stopped: {}) for test test_topology.1::test_add_server_add_column ``` Closes #11677 * github.com:scylladb/scylladb: test/pylib: scylla_cluster: improve cluster printing test/pylib: don't pass test_case_name to after-test endpoint test/pylib: scylla_cluster: track current test case name and print it test.py: pass the unique test name (e.g. `test_topology.1`) to cluster manager test/pylib: scylla_cluster: pass the test case name to `before_test` test/pylib: use "test_case_name" variable name when talking about test cases	2022-10-02 20:48:50 +03:00
Pavel Emelyanov	2b8636a2a9	storage_proxy.hh: Remove unused headers Add needed forward declarations and fix indirect inclusions in some .ccs Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11679	2022-10-02 20:48:50 +03:00
Michał Chojnowski	4563cbe595	logalloc: prevent false positives in reclaim_timer reclaim_timer uses a coarse clock, but does not account for the measurement error introduced by that -- it can falsely report reclaims as stalls, even if they are shorter by a full coarse clock tick from the requested threshold (blocked-reactor-notify-ms). Notably, if the stall threshold happens to be smaller or equal to coarse clock resolution, Scylla's log gets spammed with false stall reports. The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is typically equal to 1 ms or 4 ms, and stall thresholds of this order can occur in practice. Eliminate false positives by requiring the measured reclaim duration to be at least 1 clock tick longer than the configured threshold for it to be considered a stall. Fixes #10981 Closes #11680	2022-10-02 13:41:40 +03:00
Avi Kivity	372eadf542	Merge "perftune related improvements in scylla_* scripts" from Vlad Zolotarov " This series adds a long waited transition of our auto-generation code to irq_cpu_mask instead of 'mode' in perftune.yaml. And then it fixes a regression in scylla_prepare perftune.yaml auto-generation logic. " * 'scylla_prepare_fix_regression-v1' of https://github.com/vladzcloudius/scylla: scylla_prepare + scylla_cpuset_setup: make scylla_cpuset_setup idempotent without introducing regressions scylla_prepare: stop generating 'mode' value in perftune.yaml	2022-10-02 13:25:13 +03:00
Michael Livshin	d178ac17dc	nix: add README.md Signed-off-by: Michael Livshin <repo@cmm.kakpryg.net>	2022-10-02 12:26:02 +03:00
Michael Livshin	7bd13be3f2	build: improvements & upgrades to Nix dev environment * Add some more useful stuff to the shell environment, so it actually works for debugging & post-mortem analysis. * Wrap ccache & distcc transparently (distcc will be used unless NODISTCC is set to a non-empty value in the environment; ccache will be used if CCACHE_DIR is not empty). * Package the Scylla Python driver (instead of the C* one). * Catch up to misc build/test requirements (including optional) by requiring or custom-packaging: wasmtime 0.29.0, cxxbridge, pytest-asyncio, liburing. * Build statically-linked zstd in a saner and more idiomatic fashion. * In pure builds (where sources lack Git metadata), derive SCYLLA_RELEASE from source hash. * Refactor things for more parameterization. * Explicitly stub out installPhase (seeing that "nix build" succeeds up to installPhase means we didn't miss any dependencies). * Add flake support. * Add copious comments. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-10-02 11:47:16 +03:00
Michael Livshin	839d8f40e6	build: allow setting SCYLLA_RELEASE from outside The extant logic for deriving the value of SCYLLA_RELEASE from the source tree has those assumptions: * The tree being built includes Git metadata. * The value of `date` is trustworthy and interesting. * There are no uncommitted changes (those relevant to building, anyway). The above assumptions are either irrelevant or problematic in pure build environments (such as the sandbox set up by `nix-build`): * Pure builds use cleaned-up sources with all timestamps reset to Unix time 0. Those cleaned-up sources are saved (in the Nix store, for example) and content-hashed, so leaving the (possibly huge) Git metadata increases the time to copy the sources and wastes disk space (in fact, Nix in flake mode strips `.git` unconditionally). * Pure builds run in a sandbox where time is, likewise, reset to Unix time 0, so the output of `date` is neither informative nor useful. Now, the only build step that uses Git metadata in the first place is the SCYLLA_RELEASE value derivation logic. So, essentially, answering the question "is the Git metadata needed to build Scylla" is a matter of definition, and is up to us. If we elect to ignore Git metadata and current time, we can derive SCYLLA_RELEASE value from the content hash of the cleaned-up tree, regardless of the way that tree was arrived at. This change makes it possible to skip the derivation of SCYLLA_RELEASE value from Git metadata and current time by way of setting SCYLLA_RELEASE in the environment. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-10-02 11:47:16 +03:00
Avi Kivity	17b1cb4434	dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config Place it along the other parameters.	2022-09-30 22:17:37 +03:00
Avi Kivity	ecf30ee469	dirty_memory_manager: simplify region_group::update() We notice there are two separate conditions controlling a call to a single outcome, notify_pressure_relief(). Merge them into a single boolean variable.	2022-09-30 22:15:45 +03:00
Avi Kivity	230fff299a	dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers It is trivial.	2022-09-30 22:11:01 +03:00
Avi Kivity	12b81173b9	dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief() Remove synthetic "rg" local.	2022-09-30 22:09:09 +03:00
Avi Kivity	e1bad8e883	dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group It started life as something shared between memory_hard_limit and region_group, but now that they are back being the same thing, we can make it a member again.	2022-09-30 22:04:26 +03:00
Avi Kivity	6b21c10e9e	dirty_memory_manager: remove accessors around region_group::_under_hard_pressure It is now only accessed from within the class, so the accessors don't help anything.	2022-09-30 21:59:46 +03:00
Avi Kivity	6a02bb7c2b	dirty_memory_manager: merge memory_hard_limit into region_group The two classes always have a 1:1 or 0:1 relationship, and so we can just move all the members of memory_hard_limit into region_group, with the functions that track the relationship (memory_hard_limit::{add,del}()) removed. The 0:1 relationship is maintained by initializing the hard limit parameter with std::numeric_limits<size_t>::max(). The _hard_total_memory variable is always checked if it is greater than this parameter in order to do anything, and with this default it can never be.	2022-09-30 21:59:38 +03:00
Avi Kivity	45ab24e43d	dirty_memory_manager: rename members in memory_hard_limit In preparation for merging memory_hard_limit into region_group, disambiguate similarly named members by adding the word "hard" in random places. memory_hard_limit and region_group are candidates for merging because they constantly reference each other, and memory_hard_limit does very little by itself.	2022-09-30 21:47:33 +03:00
Avi Kivity	aca96c4103	readers/multishard: restore shard_reader_v2::do_fill_buffer() indentation	2022-09-30 19:19:51 +03:00
Avi Kivity	b08196f3b3	readers/multishard: convert shard_reader_v2::do_fill_buffer() to a pure coroutine do_full_buffer() is an eclectic mix of coroutines and continuations. That makes it hard to follow what is running sequentially and concurrently. Convert it into a pure coroutine by changing internal continuations to lambda coroutines. These lambda coroutines are guarded with seastar::coroutine::lambda. Furthermore, a future that is co_awaited is converted to immediate co_await (without an intermediate future), since seastar::coroutine::lambda only works if the coroutine is awaited in the same statement it is defined on.	2022-09-30 19:19:48 +03:00
Kamil Braun	b2cf610567	test/pylib: scylla_cluster: improve cluster printing Print the cluster name and stopped servers in addition to the running servers. Fix a logging call which tried to print a server in place of a cluster and even at that it failed (the server didn't have a hostname yet so it printed as an empty string). Add another logging call.	2022-09-30 17:00:05 +02:00
Kamil Braun	05ed3769dd	test/pylib: don't pass test_case_name to after-test endpoint It's redundant now, the manager tracks the current test case using before-test endpoint calls.	2022-09-30 16:41:45 +02:00
Kamil Braun	dc6f37b7f7	test/pylib: scylla_cluster: track current test case name and print it Use `_before_test` calls to track the current test case name. Concatenate it with the unique test name like this: `test_topology.1::test_add_server_add_column`, and print it instead of the test case name.	2022-09-30 16:38:35 +02:00
Kamil Braun	5be818d73b	test.py: pass the unique test name (e.g. `test_topology.1`) to cluster manager This helps us distinguish the different repeats of a test in logs. Rename the variable accordingly in `ScyllaClusterManager`.	2022-09-30 16:24:10 +02:00
Kamil Braun	fde4642472	test/pylib: scylla_cluster: pass the test case name to `before_test` We pass the test case name to `after_test` - so make it consistent. Arguably, the test case name is more useful (as it's more precise) than the test name.	2022-09-30 16:17:59 +02:00
Kamil Braun	43d8b4a214	test/pylib: use "test_case_name" variable name when talking about test cases Distinguish "test name" (e.g. `test_topology`) from "test case name" (e.g. `test_add_server_add_column` - a test case inside `test_topology`).	2022-09-30 16:15:48 +02:00
Botond Dénes	060dda8e00	Merge 'Reduce dependencies on large data handler header' from Benny Halevy Reduce the false dependencies on db/large_data_handler.hh by not including it from commonly used header files, and rather including it only in the source files that actually need it. The is in preparation for https://github.com/scylladb/scylladb/issues/11449 Closes #11654 * github.com:scylladb/scylladb: test: lib: do not include db/large_data_handler.hh in test_service.hh test: lib: move sstable test_env::impl ctor out of line sstables: do not include db/large_data_handler.hh in sstables.hh api/column_family: add include db/system_keyspace.hh	2022-09-30 13:27:38 +03:00
Tomasz Grabiec	5268f0f837	test: lib: random_mutation_generator: Don't generate mutations with marker uncompacted with shadowable tombstone The generator was first setting the marker then applied tombstones. The marker was set like this: row.marker() = random_row_marker(); Later, when shadowable tombstones were applied, they were compacted with the marker as expected. However, the key for the row was chosen randomly in each iteration and there are multiple keys set, so there was a possibility of a key clash with an earlier row. This could override the marker without applying any tombstones, which is conditional on random choice. This could generate rows with markers uncompacted with shadowable tombstones. This broken row_cache_test::test_concurrent_reads_and_eviction on comparison between expected and read mutations. The latter was compacted because it went through an extra merge path, which compacts the row. Fix by making sure there are no key clashes. Closes #11663	2022-09-30 11:27:01 +03:00
Kamil Braun	1793d43b15	test/pylib: scylla_cluster: mark `server_remove` as not implemented The `server_remove` function did a very weird thing: it shut down a server and made the framework 'forget' about it. From the point of view of the Scylla cluster and the driver the server was still there. Replace the function's body with `raise NotImplementedError`. In the future it can be replaced with an implementation that calls `removenode` on the Scylla cluster. Remove `test_remove_server_add_column` from `test_topology`. It effectively does the same thing as `test_stop_server_add_column`, except that the framework also 'forgets' about the stopped server. This could lead to weird situations because the forgotten server's IP could be reused in another test that was running concurrently with this test. Closes #11657	2022-09-29 21:03:18 +03:00
Pavel Emelyanov	6a5b0d6c70	table: Handle storage_io_error's ENOSPC when flushing Commit `a9805106` (table: seal_active_memtable: handle ENOSPC error) made memtable flushing code stand ENOSPC and continue flusing again in the hope that the node administrator would provide some free space. However, it looks like the IO code may report back ENOSPC with some exception type this code doesn't expect. This patch tries to fix it refs: #11245 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-29 19:16:30 +03:00
Pavel Emelyanov	826244084e	table: Rewrap retry loop The existing loop is very branchy in its attempts to find out whether or not to abort. The "allowed_retries" count can be a good indicator of the decision taken. This makes the code notably shorter and easier to extend Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-29 19:14:46 +03:00
Benny Halevy	776b009c0f	test: lib: do not include db/large_data_handler.hh in test_service.hh It was needed for defining and referencing nop_lp_handler and in sstable_3_x_test for testing the large_data_handler. Remove the include from the commonly used header file to reduce the false dependencies on large_data_handler.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 18:36:16 +03:00
Benny Halevy	678d88576b	test: lib: move sstable test_env::impl ctor out of line To prepare for removing the include of db/large_data_handler.hh from test/lib/test_services.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 18:35:12 +03:00
Botond Dénes	ad04f200d3	Merge 'database: automatically take snapshot of base table views' from Benny Halevy The logic to reject explicit snapshot of views/indexes was improved in `aa127a2dbb`. However, we never implemented auto-snapshot of view/indexes when taking a snapshot of the base table. This is implemented in this patch. The implementation is built on top of `ba42852b0e` so it would be hard to backport to 5.1 or earlier releases. Fixes #11612 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11616 * github.com:scylladb/scylladb: database: automatically take snapshot of base table views api: storage_service: reject snapshot of views in api layer	2022-09-29 13:33:31 +03:00
Benny Halevy	ae7fd1c7b2	sstables: do not include db/large_data_handler.hh in sstables.hh Reduce dependencies by only forward-declaring class db::large_data_handler in sstables.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 12:42:58 +03:00
Benny Halevy	fb7e55b0a8	api/column_family: add include db/system_keyspace.hh For db::system_keyspace::load_view_build_progress that currently indirectly satisfied via sstables/sstables.hh -> db/large_data_handler.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 12:42:54 +03:00
Nadav Har'El	4a7794fb64	alternator: better error message when adding a GSI to an existing table Due to issue #11567, Alternator do not yet support adding a GSI to an existing table via UpdateTable with the GlobalSecondaryIndexUpdates parameter. However, currently, we print a misleading error message in this case, complaining about the AttributeDefinitions parameter. This parameter is also required with GlobalSecondaryIndexUpdates, but it's not the main problem, and the user is likely to be confused why the error message points to that specific paramter and what it means that this parameter is claimed to be "not supported" (while it is supported, in CreateTable). With this patch, we report that GlobalSecondaryIndexUpdates is not supported. This patch does not fix the unsupported feature - it just improves the error message saying that it's not supported. Refs #11567 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11650	2022-09-29 09:00:31 +03:00
Asias He	c194c811df	repair: Yield in repair_service::do_decommission_removenode_with_repair When walking through the ranges, we should yield to prevent stalls. We do similar yield in other node operations. Fix a stall in 5.1.dev.20220724.f46b207472a3 with build-id d947aaccafa94647f71c1c79326eb88840c5b6d2 ``` !INFO \| scylla[6551]: Reactor stalled for 10 ms on shard 0. Backtrace: 0x4bbb9d2 0x4bba630 0x4bbb8e0 0x7fd365262a1f 0x2face49 0x2f5caff 0x36ca29f 0x36c89c3 0x4e3a0e1 ```` Fixes #11146 Closes #11160	2022-09-28 18:21:35 +03:00
Avi Kivity	cf3830a249	Merge 'Add support for TRUNCATE USING TIMEOUT' from Benny Halevy Extend the cql3 truncate statement to accept attributes, similar to modification statements. To achieve that we define cql3::statements::raw::truncate_statement derived from raw::cf_statement, and implement its pure virtual prepare() method to make a prepared truncate_statement. The latter is no longer derived from raw::cf_statement, and just stores a schema_ptr to get to the keyspace and column_family. `test_truncate_using_timeout` cql-pytest was added to test the new USING TIMEOUT feature. Fixes #11408 Also, update docs/cql/ddl.rst truncate-statement section respectively. Closes #11409 * github.com:scylladb/scylladb: docs: cql-extensions: add TRUNCATE to USING TIMEOUT section. docs: cql: ddl: add support for TRUNCATE USING TIMEOUT cql3, storage_proxy: add support for TRUNCATE USING TIMEOUT cql3: selectStatement: restrict to USING TIMEOUT in grammar cql3: deleteStatement: restrict to USING TIMEOUT\|TIMESTAMP in grammar	2022-09-28 18:19:03 +03:00
Avi Kivity	19374779bb	Merge 'Fix large data warning and docs' from Benny Halevy The series contains fixes for system.large_* log warning and respective documentation. This prepares the way for adding a new system.large_collections table (See #11449): Fixes #11620 Fixes #11621 Fixes #11622 the respective fixes should be backported to different release branches, based on the respective patches they depend on (mentioned in each issue). Closes #11623 * github.com:scylladb/scylladb: docs: adjust to sstable base name docs: large-partition-table: adjust for additional rows column docs: debugging-large-partition: update log warning example db/large_data_handler: print static cell/collection description in log warning db/large_data_handler: separate pk and ck strings in log warning with delimiter	2022-09-28 17:52:23 +03:00
Nadav Har'El	de1bc147bc	Merge 'test.py: cleanups in topology test suites' from Kamil Braun Fix the type of `create_server`, rename `topology_for_class` to `get_cluster_factory`, simplify the suite definitions and parameters passed to `get_cluster_factory` Closes #11590 * github.com:scylladb/scylladb: test.py: replace `topology` with `cluster_size` in Topology tests test.py: rename `topology_for_class` to `get_cluster_factory` test/pylib: ScyllaCluster: fix create_server parameter type	2022-09-28 15:19:54 +03:00
Kamil Braun	1bcc28b48b	test/topology_raft_disabled: reenable `test_raft_upgrade` The test was disabled due to a bug in the Python driver which caused the driver not to reconnect after a node was restarted (see scylladb/python-driver#170). Introduce a workaround for that bug: we simply create a new driver session after restarting the nodes. Reenable the test. Closes #11641	2022-09-28 15:13:42 +03:00
Mikołaj Grzebieluch	be8fcba8c1	raft: broadcast_tables: add support for bind variables Extended the queries language to support bind variables which are bound in the execution stage, before creating a raft command. Adjusted `test_broadcast_tables.py` to prepare statements at the beginning of the test. Fixed a small bug in `strongly_consistent_modification_statement::check_access`. Closes #11525	2022-09-28 09:54:59 +03:00
Alejo Sanchez	02933c9b82	test.py: close aiohttp session for topology tests Close the aiohttp ClientSession after pytest session finishes. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11648	2022-09-27 18:09:08 +02:00
Kamil Braun	82481ae31b	Merge 'raft server, log size limit in bytes' from Gusev Petr Before this patch we could get an OOM if we received several big commands. The number of commands was small, but their total size in bytes was large. snapshot_trailing_size is needed to guarantee progress. Without this limit the fsm could get stuck if the size of the next item is greater than max_log_size - (size of trailing entries). Closes #11397 * github.com:scylladb/scylladb: raft replication_test, make backpressure test to do actual backpressure raft server, shrink_to_fit on log truncation raft server, release memory if add_entry throws raft server, log size limit in bytes	2022-09-27 14:25:08 +02:00
Kamil Braun	ed67f0e267	Merge 'test.py: fix topology init error handling' from Alecco When there are errors starting the first cluster(s) the logs of the server logs are needed. So move `.start()` to the `try` block in `test.py` (out of `asynccontextmanager`). While there, make `ScyllaClusterManager.start()` idempotent. Closes #11594 * github.com:scylladb/scylladb: test.py: fix ScyllaClusterManager start/stop test.py: fix topology init error handling	2022-09-27 11:36:07 +02:00
Petr Gusev	bc50b7407f	raft replication_test, make backpressure test to do actual backpressure Before this patch this test didn't actually experience any backpressure since all the commands were executed sequentially.	2022-09-27 12:04:14 +04:00
Petr Gusev	cbfe033786	raft server, shrink_to_fit on log truncation We don't want to keep memory we don't use, shrink_to_fit guarantees that. In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.	2022-09-27 12:02:36 +04:00
Petr Gusev	b34dfed307	raft server, release memory if add_entry throws We consume memory from semaphore in add_entry_on_leader, but never release it if add_entry throws.	2022-09-27 12:02:34 +04:00
Benny Halevy	b178813cba	docs: cql-extensions: add TRUNCATE to USING TIMEOUT section. List the queries that support the TIMEOUT parameter. Mention the newly added support for TRUNCATE USING TIMEOUT. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 18:30:39 +03:00
Benny Halevy	b0bad0b153	docs: cql: ddl: add support for TRUNCATE USING TIMEOUT Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 18:30:39 +03:00
Benny Halevy	64140ccf05	cql3, storage_proxy: add support for TRUNCATE USING TIMEOUT Extend the cql3 truncate statement to accept attributes, similar to modification statements. To achieve that we define cql3::statements::raw::truncate_statement derived from raw::cf_statement, and implement its pure virtual prepare() method to make a prepared truncate_statement. The latter, statements::truncate_statement, is no longer derived from raw::cf_statement, and just stores a schema_ptr to get to the keyspace and column_family names. `test_truncate_using_timeout` cql-pytest was added to test the new USING TIMEOUT feature. Fixes #11408 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 18:30:39 +03:00
Benny Halevy	27d3e48005	cql3: selectStatement: restrict to USING TIMEOUT in grammar It is preferred to reject USING TLL / TIMESTAMP at the grammar level rather than functionally validating the USING attributes. test_using_timeout was adjusted respectively to expect the `SyntaxException` error rather than `InvalidRequest`. Note that cql3::statements::raw::select_statement validate_attrs now asserts that the ttl or the timestamp attributes aren't set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 18:30:39 +03:00
Benny Halevy	0728d33d5f	cql3: deleteStatement: restrict to USING TIMEOUT\|TIMESTAMP in grammar It is preferred to reject USING TLL / TIMESTAMP at the grammar level rather than functionally validating the USING attributes. test_using_timeout was adjusted respectively to expect the `SyntaxException` error rather than `InvalidRequest`. Note that now delete_statement ctor asserts that the ttl attribute is not set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 18:30:39 +03:00
Kamil Braun	696bdb2de7	test.py: replace `topology` with `cluster_size` in Topology tests First, a reminder of a few basic concepts in Scylla: - "topology" is a mapping: for each node, its DC and Rack. - "replication strategy" is a method of calculating replica sets in a cluster. It is not a cluster-global property; each keyspace can have a different replication strategy. A cluster may have multiple keyspaces. - "cluster size" is the number of nodes in a cluster. Replication strategy is orthogonal to topology. Cluster size can be derived from topology and is also orthogonal to replication strategy. test.py was confusing the three concepts together. For some reason, Topology suites were specifying a "topology" parameter which contained replication strategy details - having nothing to do with topology. Also it's unclear why a test suite would specify anything to do with replication strategies - after all, a test may create keyspaces with different replication strategies, and a suite may contain multiple different tests. Get rid of the "topology" parameter, replace it with a simple "cluster_size". In the future we may re-introduce it when we actually implement the possibility to start clusters with custom topologies (which involves configuring the snitch etc.) Simplify the test.py code.	2022-09-26 15:17:50 +02:00
Botond Dénes	895522db23	mutation_fragment_stream_validator: make interface more robust The validator has several API families with increasing amount of detail. E.g. there is an `operator()(mutation_fragment_v2::kind)` and an overload also taking a position. These different API families currently cannot be mixed. If one uses one overload-set, one has to stick with it, not doing so will generate false-positive failures. This is hard to explain in documentation to users (provided they even read it). Instead, just make the validator robust enough such that the different API subsets can be mixed in any order. The validator will try to make most of the situation and validate as much as possible. Behind the scenes all the different validation methods are consolidated into just two: one for the partition level, the other for the intra-partition level. All the different overloads just call these methods passing as much information as they have. A test is also added to make sure this works.	2022-09-26 13:26:26 +03:00
Kamil Braun	0725ab3a3e	test.py: rename `topology_for_class` to `get_cluster_factory` The previous name had nothing to do with what the function calculated and returned (it returned a `create_cluster` function; the standard name for a function that constructs objects would be 'factory', so `get_cluster_factory` is an appropriate name for a function that returns cluster factories).	2022-09-26 11:45:44 +02:00
Kamil Braun	06cc4f9259	test/pylib: ScyllaCluster: fix create_server parameter type The only usage of `ScyllaCluster` constructor passed a `create_server` function which expected a `List[str]` for the second parameter, while the constructor specified that the function should expect an `Optional[List[str]]`. There was no reason for the latter, we can easily fix this type error. Also give a type hint for `create_cluster` function in `PythonTestSuite.topology_for_class`. This is actually what catched the type error.	2022-09-26 11:45:44 +02:00
Petr Gusev	27e60ecbf4	raft server, log size limit in bytes Before this patch we could get an OOM if we received several big commands. The number of commands was small, but their total size in bytes was large. snapshot_trailing_size is needed to guarantee progress. Without this limit the fsm could get stuck if the size of the next item is greater than max_log_size - (size of trailing entries).	2022-09-26 13:10:10 +04:00
Benny Halevy	d32c497cd9	database: automatically take snapshot of base table views The logic to reject explicit snapshot of views/indexes was improved in `aa127a2dbb`. However, we never implemented auto-snapshot of view/indexes when taking a snapshot of the base table. This is implemented in this patch. The implementation is built on top of `ba42852b0e` so it would be hard to backport to 5.1 or earlier releases. Fixes #11612 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 11:02:54 +03:00
Benny Halevy	55b0b8fe2c	api: storage_service: reject snapshot of views in api layer Rather than pushing the check to `snapshot_ctl::take_column_family_snapshot`, just check that explcitly when taking a snapshot of a particular table by name over the api. Other paths that call snapshot_ctl::take_column_family_snapshot are internal and use it to snap views already. With that, we can get rid of the allow_view_snapshots flag that was introduced in `aab4cd850c`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 10:44:56 +03:00
Botond Dénes	4d017b6d7e	mutation_fragment_stream_validator: add reset() to validating filter Allow the high level filtering validator to be reset() to a certain position, so it can be used in situations where the consumption is not continuous (fast-forwarding or paging).	2022-09-26 10:17:28 +03:00
Botond Dénes	a8cbf66573	mutation_fragment_stream_validator: move active tomsbtone validation into low level validator Currently the active range tombstone change is validated in the high level `mutation_fragment_stream_validating_stream`, meaning that users of the low-level `mutation_fragment_stream_validator` don't benefit from checking that tombstones are properly closed. This patch moves the validation down to the low-level validator (which is what the high-level one uses under the hood too), and requires all users to pass information about changes to the active tombstone for each fragment.	2022-09-26 10:17:27 +03:00
Nadav Har'El	868a884b79	test/cql-pytest: add reproducer for ignored IS NOT NULL This test reproduces issue #10365: It shows that although "IS NOT NULL" is not allowed in regular SELECT filters, in a materialized view it is allowed, even for non-key columns - but then outright ignored and does not actually filter out anything - a fact which already surprised several users. The test also fails on Cassandra - it also wrongly allows IS NOT NULL on the non-key columns but then ignores this in the filter. So the test is marked with both xfail (known to fail on Scylla) and cassandra_bug (fails on Cassandra because of what we consider to be a Cassandra bug). Refs #10365 Refs #11606 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11615	2022-09-26 09:02:08 +03:00
Anna Stuchlik	c5285bcb14	doc: remove the section about updating OS packages during upgrade from upgrade guides for Ubunut and Debian (from 4.5 to 4.6) Closes #11629	2022-09-26 08:04:02 +03:00
Avi Kivity	ad2f1dc704	Merge 'Avoid default initialization of token_metadata and topology when not needed' from Pavel Emelyanov The goal is not to default initialize an object when its fields are about to be immediately overwritten by the consecutive code. Closes #11619 * github.com:scylladb/scylladb: replication_strategy: Construct temp tokens in place topology: Define copy-sonctructor with init-lists	2022-09-25 18:08:42 +03:00
Jan Ciolek	ac152af88c	expression: Add for_each_boolean factor boolean_factors is a function that takes an expression and extracts all children of the top level conjunction. The problem is that it returns a vector<expression>, which is inefficent. Sometimes we would like to iterate over all boolean factors without allocations. for_each_boolean_factor is implemented for this purpose. boolean_factors() can be implemented using for_each_boolean_factor, so it's done to reduce code duplication. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-09-25 16:34:22 +03:00
Avi Kivity	2a538f5543	Merge 'Cut one of snitch->gossiper links' from Pavel Emelyanov Snitch uses gossiper for several reasons, one of is to re-gossip the topology-related app states when property-file snitch config changes. This set cuts this link by moving re-gossiping into the existing storage_service::snitch_reconfigured() subscription. Since initial snitch state gossiping happens in storage service as well, this change is not unreasonable. Closes #11630 * github.com:scylladb/scylladb: storage_service: Re-gossiping snitch data in reconfiguration callback storage_service: Coroutinize snitch_reconfigured() storage_service: Indentation fix after previous patch storage_service: Reshard to shard-0 earlier storage_service: Refactor snitch reconfigured kick	2022-09-25 16:08:48 +03:00
Benny Halevy	a1adbf1f59	docs: adjust to sstable base name Since `244df07771` (scylla 5.1), only the sstable basename is kept in the large_* system tables. The base path can be determined from the keyspace and table name. Fixes #11621 Adjust the examples in documentation respectively. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-25 14:38:13 +03:00
Benny Halevy	33924201cc	docs: large-partition-table: adjust for additional rows column Since `a7511cf600` (scylla 5.0), sstables containing partitions with too many rows are recorded in system.large_partitions. Adjust the doc respectively. Fixes #11622 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-25 14:38:13 +03:00
Benny Halevy	92ff17c6e3	docs: debugging-large-partition: update log warning example The log warning format has changed since `f3089bf3d1` and was fixed in the previous patch to include a delimiter between the partition key, clustering key, and column name. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-25 14:38:13 +03:00
Benny Halevy	fcbbc3eb9c	db/large_data_handler: print static cell/collection description in log warning When warning about a large cell/collection in a static row, print that fact in the log warning to make it clearer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-25 14:37:42 +03:00
Benny Halevy	4670829502	db/large_data_handler: separate pk and ck strings in log warning with delimiter Currently (since `f3089bf3d1`), when printing a warning to the log about large rows and/or cells the clustering key string is concatenated to the partition key string, rendering the warning confsing and much less useful. This patch adds a '/' delimiter to separate the fields, and also uses one to separate the clustering key from the column name for large cells. In case of a static cell, the clustering key is null hence the warning will look like: `pk//column`. This patch does NOT change anything in the large_* system table schema or contents. It changes only the log warning format that need not be backward compatible. Fixes #11620 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-25 14:36:41 +03:00
Pavel Emelyanov	47958a4b37	storage_service: Re-gossiping snitch data in reconfiguration callback Nowadays it's done inside snitch, and snitch needs to carry gossiper refernece for that. There's an ongoing effort in de-globalizing snitch and fixing its dependencies. This patch cuts this snitch->gossiper link to facilitate the mentioned effort. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-23 14:31:55 +03:00
Pavel Emelyanov	932566d448	storage_service: Coroutinize snitch_reconfigured() Next patch will add more sleeping code to it and it's simpler if the new call is co_await-ed rather than .then()-ed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-23 14:31:55 +03:00
Pavel Emelyanov	7fee98cad0	storage_service: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-23 14:31:55 +03:00
Pavel Emelyanov	3d4ea2c628	storage_service: Reshard to shard-0 earlier It makes next patch shorter and nicer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-23 14:31:55 +03:00
Pavel Emelyanov	11b79f9f80	storage_service: Refactor snitch reconfigured kick The snitch_reconfigured calls update_topology with local node bcast address argument. Things get simpler if the callee gets the address itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-23 14:31:43 +03:00
Anna Stuchlik	46f0e99884	doc: add the link to the new Troubleshooting section and replace Scylla with ScyllaDB	2022-09-23 11:46:15 +02:00
Anna Stuchlik	af2a85b191	doc: add the new page to the toctree	2022-09-23 11:37:38 +02:00
Anna Stuchlik	b034e2856e	doc: add a troubleshooting article about the missing configuration files	2022-09-23 11:17:18 +02:00
Botond Dénes	b9d55ee02f	Merge 'Add cassandra functional - show warn/err when tombstone threshold reached.' from Taras Borodin Add cassandra functional - show warn/err when tombstone_warn_threshold/tombstone_failure_threshold reached on select, by partitions. Propagate raw query_string from coordinator to replicas. Closes #11356 * github.com:scylladb/scylladb: add utf8:validate to operator<< partition_key with_schema. Show warn message if `tombstone_warn_threshold` reached on querier.	2022-09-23 05:53:47 +03:00
Pavel Emelyanov	9e7407ff91	replication_strategy: Construct temp tokens in place Otherwise, the token_metadata object is default-initialized, then it's move-assigned from another object. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-22 19:19:32 +03:00
Pavel Emelyanov	d540af2cb0	topology: Define copy-sonctructor with init-lists Otherwise the topology is default-constructed, then its fields are copy-assigned with the data from the copy-from reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-22 19:18:58 +03:00
Tomasz Grabiec	ccbfe2ef0d	Merge 'Fix invalid mutation fragment stream issues' from Botond Dénes Found by a fragment stream validator added to the mutation-compactor (https://github.com/scylladb/scylladb/pull/11532). As that PR moves very slowly, the fixes for the issues found are split out into a PR of their own. The first two of these issues seems benign, but it is important to remember that how benign an invalid fragment stream is depends entirely on the consumer of said stream. The present consumer of said streams may swallow the invalid stream without problem now but any future change may cause it to enter into a corrupt state. The last one is a non-benign problem (again because the consumer reacts badly already) causing problems when building query results for range scans. Closes #11604 * github.com:scylladb/scylladb: shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied readers/mutation_readers: compacting_reader: remember injected partition-end db/view: view_builder::execute(): only inject partition-start if needed	2022-09-22 17:57:27 +02:00
Taras Borodin	c155ae1182	add utf8:validate to operator<< partition_key with_schema.	2022-09-22 16:42:31 +03:00
TarasBor	1f4a93da78	Show warn message if `tombstone_warn_threshold` reached on querier. When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs. If `tombstone_warn_threshold:0` feature disabled. Refs scylladb#11410	2022-09-22 16:42:31 +03:00
Avi Kivity	0cbaef31c1	dirty_memory_manager: fold do_update() into region_group::update() There is just one caller, and folding the two functions enables simplification.	2022-09-22 15:51:19 +03:00
Avi Kivity	8672f2248c	dirty_memory_manager: simplify memory_hard_limit's do_update do_update() has an output parameter (top_relief) which can either be set to an input parameter or left alone. Simplify it by returning bool and letting the caller reuse the parameter's value instead.	2022-09-22 15:50:48 +03:00
Avi Kivity	1858268377	dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit They are write-only. This corresponds to the fact that memory_hard_limit does not do flushing (which is initiated by crossing the soft limit), it only blocks new allocations.	2022-09-22 14:59:38 +03:00
Asias He	9ed401c4b2	streaming: Add finished percentage metrics for node ops using streaming We have added the finished percentage for repair based node operations. This patch adds the finished percentage for node ops using the old streaming. Example output: scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000 scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000 scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945 scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000 scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000 scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000 In addition to the metrics, log shows the percentage is added. [shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646 Fixes #11600 Closes #11601	2022-09-22 14:19:34 +03:00
Avi Kivity	8369741063	dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit) We made this function a template to prevent code duplication, but now memory_hard_limit was sufficiently simplified so that the implementations can start to diverge.	2022-09-22 14:16:43 +03:00
Avi Kivity	76ced5a60c	dirty_memory_manager: adjust soft_limit threshold check Use `>` rather than `>=` to match the hard limit check. This will aid simplification, since for memory_hard_limit the soft and hard limits are identical. This should not cause any material behavior change, we're not sensitive to single byte accounting. Typical limits are on the order of gigabytes.	2022-09-22 14:06:01 +03:00
Avi Kivity	b9eb26cd77	dirty_memory_manager: drop memory_hard_limit::_name It's write-only.	2022-09-22 14:01:57 +03:00
Avi Kivity	c64fb66cc3	dirty_memory_manager: simplify memory_hard_limit configuration We observe that memory_hard_limit's reclaim_config is only ever initialized as default, or with just the hard_limit parameter. Since soft_limit defaults to hard_limit, we can collapse the two into a limit. The reclaim callbacks are always left as the default no-op functions, so we can eliminate them too. This fits with memory_hard_limit only being responsible for the hard limit, and for it not having any memtables to reclaim on its own.	2022-09-22 13:56:59 +03:00
Avi Kivity	2f907dc47d	dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group} region_group_reclaimer is used to initialize (by reference) instances of memory_hard_limit and region_group. Now that it is a final class, we can fold it into its users by pasting its contents into those users, and using the initializer (reclaim_config) to initialize the users. Note there is a 1:1 relationship between a region_group_reclaimer instance and a {memory_hard_limit,region_group} instance. It may seem like code duplication to paste the contents of one class into two, but the two classes use region_group_reclaimer differently, and most of the code is just used to glue different classes together, so the next patches will be able to get rid of much of it. Some notes: - no_reclaimer was replaced by a default reclaim_config, as that's how no_reclaimer was initialized - all members were added as private, except when a caller required one to be public - an under_presssure() member already existed, forwarding to the reclaimer; this was just removed.	2022-09-22 13:56:59 +03:00
Avi Kivity	d8f857e74b	dirty_memory_manager: stop inheriting from region_group_reclaimer This inheritance makes it harder to get rid of the class. Since there are no longer any virtual functions in the class (apart from the destructor), we can just convert it to a data member. In a few places, we need forwarding functions to make formerly-inherited functions visible to outside callers. The virtual destructor is removed and the class is marked final to verify it is no longer a base class anywhere.	2022-09-22 13:56:59 +03:00
Avi Kivity	26f3a123a5	dirty_memory_manager: test: unwrap region_group_reclaimer In one test, region_group_reclaimer is wrapped in another class just to toggle a bool, but with the new callbacks it's easy to just use a bool instead.	2022-09-22 13:56:59 +03:00
Avi Kivity	1d3508e02c	dirty_memory_manager: change region_group_reclaimer configuration to a struct It's just so much nicer. The "threshold" limit was renamed to "hard_limit" to contrast it with "soft_limit" (in fact threshold is a good name for soft_limit, since it's a point where the behavior begins to change, but that's too much of a change).	2022-09-22 13:56:59 +03:00
Avi Kivity	2c54c7d51e	dirty_memory_manager: convert region_group_reclaimer to callbacks region_group_reclaimer is partially policy (deciding when to reclaim) and partially mechanism (implementing reclaim via virtual functions). Move the mechanism to callbacks. This will make it easy to fold the policy part into region_group and memory_hard_limit. This folding is expected to simplify things since most of region_group_reclaimer is cross-class communication.	2022-09-22 13:56:59 +03:00
Avi Kivity	8fa0652e68	dirty_memory_manager: consolidate region_group_reclaimer constructors Delegate to other constructors rather than repeating the code. Doesn't help much here, but simplifies the next patch.	2022-09-22 13:56:59 +03:00
Avi Kivity	5efbfa4cab	dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief It clashes with region_group_reclaimer::notify_relief, which does something different. Since we plan to merge region_group_reclaimer into memory_hard_limit and region_group (this can simplify the code), we need to avoid duplicate function names.	2022-09-22 13:56:59 +03:00
Avi Kivity	a72ac14154	dirty_memory_manager: drop unused parameter to memory_hard_limit constructor	2022-09-22 13:56:59 +03:00
Avi Kivity	fca5689052	dirty_memory_manager: drop memory_hard_limit::shutdown() It is empty.	2022-09-22 13:56:59 +03:00
Avi Kivity	152136630c	dirty_memory_manager: split region_group hierarchy into separate classes Currently, region_group forms a hierarchy. Originally it was a tree, but previous work whittled it down to a parent-child relationship (with a single, possible optional parent, and a single child). The actual behavior of the parent and child are very different, so it makes sense to split them. The main difference is that the parent does not contain any regions (memtables), but the child does. This patch mechanically splits the class. The parent is named memory_hard_limit (reflecting its role to prevent lsa allocation above the memtable configured hard limit). The child is still named region_group. Details of the transformation: - each function or data member in region_group is either moved to memory_hard_limit, duplicated in memory_hard_limit, or left in region_group. - the _regions and _blocked_requests members, which were always empty in the parent, were not duplicated. Any member that only accessed them was similarly left alone. - the "no_reclaimer" static member which was only used in the parent was moved there. Similarly the constructor which accepted it was moved. - _child was moved to the parent, and _parent was kept in the child (more or less the defining change of the split) Similarly add(region_group) and del(region_group) (which manage _child) were moved. - do_for_each_parent(), which iterated to the top of the tree, was removed and its callers manually unroll the loop. For the parent, this is just a single iteration (since we're iterating towards the root), for the child, this can be two iterations, but the second one is usually simpler since the parent has many members removed. - do_update(), introduced in the previous patch, was made a template that can act on either the parent or the child. It will be further simplified later. - some tests that check now-impossible topologies were removed. - the parent's shutdown() is trivial since it has no _blocked_requests, but it was kept to reduce churn in the callers.	2022-09-22 13:56:59 +03:00
Avi Kivity	009bd63217	dirty_memory_manager: extract code block from region_group::update A mechanical transformation intended to allow reuse later. The function doesn't really deserve to exist on its own, so it will be swallowed back by its callers later.	2022-09-22 13:56:59 +03:00
Avi Kivity	34d5322368	dirty_memory_manager: move more allocation_queue functions out of region_group More mechanical changes, reducing churn for later patches.	2022-09-22 13:56:59 +03:00
Avi Kivity	4bc2638cf9	dirty_memory_manager: move some allocation queue related function definitions outside class scope It's easier to move them to a new owner (allocation_queue) if they are not defined in the class.	2022-09-22 13:56:59 +03:00
Avi Kivity	71493c2539	dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue region_group currently fulfills two roles: in one role, when instantiated as dirty_memory_manager::_virtual_region_group, it is responsible for holding functions that allocate memtable memory (writes) and only allowing them to run when enough dirty memory has been flushed from other memtables. The other role, when instantiated as dirty_memory_manager::_real_region_group, is to provide a hard stop when the total amount of dirty memory exceeds the limit, since the other limit is only estimated. We want to simplify the whole thing, which means not using the same class for two different roles (or rather, we can use it for both roles if we simplify the internals significantly). As a first step towards clarifying what functionality is used in what role, move some classes related to holding allocating functions to a new class allocation_queue. We will gradually move move content there, reducing the amount of role confusion in region_group. Type aliases are added to reduce churn.	2022-09-22 13:56:59 +03:00
Avi Kivity	d21d2cdb3e	dirty_memory_manager: remove support for multiple subgroups We only have one parent/child relationship in the region group hierarchy, so support for more is unneeded complexity. Replace the subgroup vector with a single pointer, and delete a test for the removed functionality.	2022-09-22 13:56:59 +03:00
Botond Dénes	0ccb23d02b	shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied Commit `8ab57aa` added a yield to the buffer-copy loop, which means that the copy can yield before done and the multishard reader might see the half-copied buffer and consider the reader done (because `_end_of_stream` is already set) resulting in the dropping the remaining part of the buffer and in an invalid stream if the last copied fragment wasn't a partition-end. Fixes: #11561	2022-09-22 13:54:36 +03:00
Botond Dénes	16a0025dc3	readers/mutation_readers: compacting_reader: remember injected partition-end Currently injecting a partition-end doesn't update `_last_uncompacted_kind`, which will allow for a subsequent `next_partition()` call to trigger injecting a partition-end, leading to an invalid mutation fragment stream (partition-end after partition-end). Fix by changing `_last_uncompacted_kind` to `partition_end` when injecting a partition-end, making subsequent injection attempts noop. Fixes: #11608	2022-09-22 13:54:36 +03:00
Botond Dénes	681e6ae77f	db/view: view_builder::execute(): only inject partition-start if needed When resuming a build-step, the view builder injects the partition-start fragment of the last processed partition, to bring the consumer (compactor) into the correct state before it starts to consume the remainder of the partition content. This results in an invalid fragment stream when the partition was actually over or there is nothing left for the build step. Make the inject conditional on when the reader contains more data for the partition. Fixes: #11607	2022-09-22 13:54:36 +03:00
Nadav Har'El	517c1529aa	docs: update docs/alternator/getting-started.md Update several aspects of the alternator/getting-started.md which were not up-to-date: * When the documented was written, Alternator was moving quickly so we recommended running a nightly version. This is no longer the case, so we should recommend running the latest stable build. * The link to the download link is no longer helpful for getting Docker instructions (it shows some generic download options). Instead point to our dockerhub page. * Replace mentions of "Scylla" by the new official name, "ScyllaDB". * Miscelleneous copy-edits. Fixes #11218 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11605	2022-09-22 11:08:05 +03:00
Piotr Sarna	481240b8b4	Merge 'Alternator: Run more TTL tests by default (and add a test for metrics)' from Nadav Har'El We had quite a few tests for Alternator TTL in test/alternator, but most of them did not run as part of the usual Jenkins test suite, because they were considered "very slow" (and require a special "--runveryslow" flag to run). In this series we enable six tests which run quickly enough to run by default, without an additional flag. We also make them even quicker - the six tests now take around 2.5 seconds. I also noticed that we don't have a test for the Alternator TTL metrics - and added one. Fixes #11374. Refs https://github.com/scylladb/scylla-monitoring/issues/1783 Closes #11384 * github.com:scylladb/scylladb: test/alternator: insert test names into Scylla logs rest api: add a new /system/log operation alternator ttl: log warning if scan took too long. alternator,ttl: allow sub-second TTL scanning period, for tests test/alternator: skip fewer Alternator TTL tests test/alternator: test Alternator TTL metrics	2022-09-22 09:47:50 +02:00
Botond Dénes	ef7471c460	readers/mutation_reader: stream validator: fix log level detection logic The mutation fragment stream validator filter has a detailed debug log in its constructor. To avoid putting together this message when the log level is above debug, it is enclosed in an if, activated when log level is debug or trace... at least that was intended. Actually the if is activated when the log level is debug or above (info, warn or error) but is only actually logged if the log level is exactly debug. Fix the logic to work as intended. Closes #11603	2022-09-22 09:41:45 +03:00
Pavel Emelyanov	7ae73c665b	gossiper: Remove some dead code Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11599	2022-09-22 06:58:29 +03:00
Pavel Emelyanov	5edeecf39b	token_metadata: Provide dc/rack for bootstrapping nodes The token_metadata::calculate_pending_ranges_for_bootstrap() makes a clone of itself and adds bootstrapping nodes to the clone to calculate ranges. Currently added nodes lack the dc/rack which affects the calculations the bad way. Unfortunately, the dc/rack for those nodes is not available on topology (yet) and needs pretty heavy patching to have. Fortunately, the only caller of this method has gossiper at hand to provide the dc/rack from. fixes: #11531 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11596	2022-09-22 06:55:52 +03:00
Petr Gusev	210d9dd026	raft: fix snapshots leak applier_fiber could create multiple snapshots between io_fiber run. The fsm_output.snp variable was overwritten by applier_fiber and io_fiber didn't drop the previous snapshot. In this patch we introduce the variable fsm_output.snps_to_drop, store in it the current snapshot id before applying a new one, and then sequentially drop them in io_fiber after storing the last snapshot_descriptor. _sm_events.signal() is added to fsm::apply_snapshot, since this method mutates the _output and thus gives a reason to run io_fiber. The new test test_frequent_snapshotting demonstrates the problem by causing frequent snapshots and setting the applier queue size to one. Closes #11530	2022-09-21 12:46:26 +02:00
Kamil Braun	3b096b71c1	test/topology_raft_disabled: disable `test_raft_upgrade` For some reason, the test is currently flaky on Jenkins. Apparently the Python driver does not reconnect to the cluster after the cluster restarts (well it does, but then it disconnects from one of the nodes and never reconnects again). This causes the test to hang on "waiting until driver reconnects to every server" until it times out. Disable it for now so it doesn't block next promotion.	2022-09-21 12:32:40 +02:00
Nadav Har'El	22bb35e2cb	Merge 'doc: update the "Counting all rows in a table is slow" page' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/11373 - Updated the information on the "Counting all rows in a table is slow" page. - Added COUNT to the list of selectors of the SELECT statement (somehow it was missing). - Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page. Closes #11417 * github.com:scylladb/scylladb: doc: add a comment to remove the note in version 5.1 doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB doc: add a note to the description of COUNT with a reference to the KB article doc: add COUNT to the list of acceptable selectors of the SELECT statement	2022-09-21 12:32:40 +02:00
Alejo Sanchez	510215d79a	test.py: fix ScyllaClusterManager start/stop Check existing is_running member to avoid re-starting. While there, set it to false after stopping. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-21 11:42:02 +02:00
Alejo Sanchez	933d93d052	test.py: fix topology init error handling Start ScyllaClusterManager within error handling so the ScyllaCluster logs are available in case of error starting up. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-21 09:15:25 +02:00
Nadav Har'El	91bccee9be	Update tools/java submodule * tools/java b004da9d1b...5f2b91d774 (1): > install.sh is using wrong permissions for install cqlsh files Fixes #11584	2022-09-20 14:42:34 +03:00
Avi Kivity	2cec417426	Merge 'tools: use the standard allocator' from Botond Dénes Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand. To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to not configure and use the seastar allocator and they need LSA to cooperate with the standard allocator. The former is provided by https://github.com/scylladb/seastar/pull/1211. The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator. Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend. Refs: https://github.com/scylladb/scylladb/issues/9882 This is the last major code piece in scylla for making tools production ready. Closes #11510 * github.com:scylladb/scylladb: test/boost: add alternative variant of logalloc test tools: use standard allocator utils/logalloc: add use_standard_allocator_segment_pool_backend() utils/logalloc: introduce segment store backend for standard allocator utils/logalloc: rebase release segment-store on segment-store-backend utils/logalloc: introduce segment_store_backend utils/logalloc: push segment alloc/dealloc to segment_store test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe	2022-09-20 12:59:34 +03:00
Nadav Har'El	4a453c411d	Merge 'doc: add the upgrade guide from 5.0 to 5.1' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/11376 This PR adds the upgrade guide from version 5.0 to 5.1. It involves adding new files (5.0-to-5.1) and language/formatting improvements to the existing content (shared by several upgrade guides). Closes #11577 * github.com:scylladb/scylladb: doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1 doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1	2022-09-20 11:52:59 +03:00
Nadav Har'El	d81bedd3be	Merge 'doc: add ScyllaDB image upgrade guides for patch releases' from Anna Stuchlik This PR adds the missing upgrade guides for upgrading the ScyllaDB image to a patch release: - ScyllaDB 5.0: /upgrade/upgrade-opensource/upgrade-guide-from-5.x.y-to-5.x.z/upgrade-guide-from-5.x.y-to-5.x.z-image/ - ScyllaDB Enterprise: /upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image/ (the file name is wrong and will be fixed with another PR) In addition, the section regarding the recommended upgrade procedure has been improved. Fixes https://github.com/scylladb/scylladb/issues/11450 Fixes https://github.com/scylladb/scylladb/issues/11452 Closes #11460 * github.com:scylladb/scylladb: doc: update the commands to upgrade the ScyllaDB image doc: fix the filename in the index to resolve the warnings and fix the link doc: apply feedback by adding she step fo load the new repo and fixing the links doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file doc: update the Enterprise guide to include the Enterprise-onlyimage file doc: update the image files doc: split the upgrade-image file to separate files for Open Source and Enterprise doc: clarify the alternative upgrade procedures for the ScyllaDB image doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z	2022-09-20 11:51:26 +03:00
Botond Dénes	4ef7b080e3	docs/using-scylla/migrate-scylla.rst: remove link to unirestore It points to a private scylladb repo, which has no place in user-facing documentation. For now there is no public replacement, but a similar functionality is in the works for Scylla Manager. Fixes: #11573 Closes #11580	2022-09-20 11:46:28 +03:00
Anna Stuchlik	7b2209f291	doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1	2022-09-20 10:42:47 +02:00
Anna Stuchlik	db75adaf9a	doc: update the commands to upgrade the ScyllaDB image	2022-09-20 10:36:18 +02:00
Nadav Har'El	4c93a694b7	cql: validate bloom_filter_fp_chance up-front Scylla's Bloom filter implementation has a minimal false-positive rate that it can support (6.71e-5). When setting bloom_filter_fp_chance any lower than that, the compute_bloom_spec() function, which writes the bloom filter, throws an exception. However, this is too late - it only happens while flushing the memtable to disk, and a failure at that point causes Scylla to crash. Instead, we should refuse the table creation with the unsupported bloom_filter_fp_chance. This is also what Cassandra did six years ago - see CASSANDRA-11920. This patch also includes a regression test, which crashes Scylla before this patch but passes after the patch (and also passes on Cassandra). Fixes #11524. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11576	2022-09-20 06:18:51 +03:00
Botond Dénes	60991358e8	Merge 'Improvements to test/lib/sstable_utils.hh' from Raphael "Raph" Carvalho Changes done to avoid pitfalls and fix issues of sstable-related unit tests Closes #11578 * github.com:scylladb/scylladb: test: Make fake sstables implicitly belong to current shard test: Make it clearer that sstables::test::set_values() modify data size	2022-09-20 06:14:07 +03:00
Nadav Har'El	a1ff865c77	Merge 'test/topology_raft_disabled: write basic raft upgrade test' from Kamil Braun The test changes the servers' configuration to include `raft` in the `experimental-features` list, then restarts them. It waits until driver reconnects to every server after restarting. Then it checks that upgrade eventually finishes on every server by querying `group0_upgrade_state` key in `system.scylla_local`. Finally, it performs a schema change and verifies that a corresponding entry has appeared in `system.group0_history`. The commit also increases the number of clusters in the suite cluster pool. Since the suite contains only one test at this time this only has an effect if we run the test multiple times (using `--repeat`). Closes #11563 * github.com:scylladb/scylladb: test/topology_raft_disabled: write basic raft upgrade test test: setup logging in topology suites	2022-09-19 20:27:08 +03:00
Alejo Sanchez	087ae521c5	test.py: make client fail if before test check fails Check if request to server side (test.py) failed and raise if so. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11575	2022-09-19 18:04:07 +02:00
Raphael S. Carvalho	2f52698a26	test: Make fake sstables implicitly belong to current shard Fake SSTables will be implicitly owned by the shard that created them, allowing them to be called on procedures that assert the SSTables are owned by the current shard, like the table's one that rebuilds the sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-19 12:05:24 -03:00
Raphael S. Carvalho	697f200319	test: Make it clearer that sstables::test::set_values() modify data size By adding a param with default value, we make it clear in the interface that the procedure modifies sstable data size. It can happen one calls this function without noticing it overrides the data size previously set using a different function. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-19 12:01:24 -03:00
Anna Stuchlik	2513497f9a	doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1	2022-09-19 16:06:24 +02:00
Kamil Braun	b770443300	test/topology_raft_disabled: write basic raft upgrade test The test changes the servers' configuration to include `raft` in the `experimental-features` list, then restarts them. It waits until driver reconnects to every server after restarting. Then it checks that upgrade eventually finishes on every server by querying `group0_upgrade_state` key in `system.scylla_local`. Finally, it performs a schema change and verifies that a corresponding entry has appeared in `system.group0_history`. The commit also increases the number of clusters in the suite cluster pool. Since the suite contains only one test at this time this only has an effect if we run the test multiple times (using `--repeat`).	2022-09-19 13:29:35 +02:00
Kamil Braun	fd986bfed1	test: setup logging in topology suites Make it possible to use logging from within tests in the topology suites. The tests are executed using `pytest`, which uses a `pytest.ini` file for logging configuration. Also cleanup the `pytest.ini` files a bit.	2022-09-19 12:23:11 +02:00
Nadav Har'El	711dcd56b6	docs/alternator: refer to the right issue In compatibility.md where we refer to the missing ability to add a GSI to an existing table - let's refer to a new issue specifically about this feature, instead of the old bigger issue about UpdateItem. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11568	2022-09-19 11:05:07 +03:00
Piotr Sarna	5597bc8573	Merge 'Alternator: test and fix crashes and errors... when using ":attrs" attribute' from Nadav Har'El This PR improves the testing for issue #5009 and fixes most of it (but not all - see below). Issue #5009 is about what happens when a user tries to use the name `:attrs` for an attribute - while Alternator uses a map column with that name to hold all the schema-less attributes of an item. The tests we had for this issue were partial, and missed the worst cases which could result in Scylla crashing on specially-crafted PutItem or UpdateItem requests. What the tests missed were the cases that `:attrs` is used as a non-key. So in this PR we add additional tests for this case, several of them fail or even crash Scylla, and then we fix all these cases. Issue #5009 remains open because using `:attrs` as the name of a key is still not allowed. But because it results in a clean error message when attempting to create a table with such a key, I consider this remaining problem very minor. Refs #5009. Closes #11572 * github.com:scylladb/scylladb: alternator: fix crashes an errors when using ":attrs" attribute alternator: improve tests for reserved attribute name ":attrs"	2022-09-19 09:48:06 +02:00
Nadav Har'El	999ca2d588	alternator: fix crashes an errors when using ":attrs" attribute Alternator uses a single column, a map, with the deliberately strange name ":attrs", to hold all the schema-less attributes of an item. The existing code is buggy when the user tries to write to an attribute with this strange name ":attrs". Although it is extremely unlikely that any user would happen to choose such a name, it is nevertheless a legal attribute name in DynamoDB, and should definitely not cause Scylla to crash as it does in some cases today. The bug was caused by the code assuming that to check whether an attribute is stored in its own column in the schema, we just need to check whether a column with that name exists. This is almost true, except for the name ":attrs" - a column with this name exists, but it is a map - the attribute with that name should be stored in the map, not as the map. The fix is to modify that check to special-case ":attrs". This fix makes the relevant tests, which used to crash or fail, now pass. This fix solves most of #5009, but one point is not yet solved (and perhaps we don't need to solve): It is still not allowed to use the name ":attrs" for a key attribute. But trying to do that fails cleanly (during the table creation) with an appropriate error message, so is only a very minor compatibility issue. Refs #5009 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-19 10:30:11 +03:00
Nadav Har'El	6f8dca3760	alternator: improve tests for reserved attribute name ":attrs" As explained in issue #5009, Alternator currently forbids the special attribute name ":attrs", whereas DynamoDB allows any string of approriate length (including the specific string ":attrs") to be used. We had only a partial test for this incompatibility, and this patch improves the testing of this issue. In particular, we were missing a test for the case that the name ":attrs" was used for a non-key attribute (we only tested the case it was used as a sort key). It turns out that Alternator crashes on the new test, when the test tries to write to a non-key attribute called ":attrs", so we needed to mark the new test with "skip". Moreover, it turns out that different code paths handle the attribute name ":attrs" differently, and also crash or fail in other ways - so we added more than one xfailing and skipped tests that each fails in a different place (and also a few tests that do pass). As usual, the new tests we checked to pass on DynamoDB. Refs #5009 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-19 10:30:06 +03:00
Botond Dénes	3003f1d747	Merge 'alternator: small documentation and comment fixes' from Nadav Har'El This tiny series fixes some small error and out-of-date information in Alternator documentation and code comments. Closes #11547 * github.com:scylladb/scylladb: alternator ttl: comment fixes docs/alternator: fix mention of old alternator-test directory	2022-09-19 09:27:53 +03:00
Kamil Braun	348582c4c8	test/pylib: pool: make it possible to free up space Some tests mark clusters as 'dirty', which makes them non-reusable by later tests; we don't want to return them to the pool of clusters. This use-case was covered by the `add_one` function in the `Pool` class. However, it had the unintended side effect of creating extra clusters even if there were no more tests that were waiting for new clusters. Rewrite the implementation of `Pool` so it provides 3 interface functions: - `get` borrows an object, building it first if necessary - `put` returns a borrowed object - `steal` is called by a borrower to free up space in the pool; the borrower is then responsible for cleaning up the object. Both `put` and `steal` wake up any outstanding `get` calls. Objects are built only in `get`, so no objects are built if none are needed. Closes #11558	2022-09-18 12:05:57 +03:00
Botond Dénes	22128977e4	test/boost: add alternative variant of logalloc test Which intializes LSA with use_standard_allocator_segment_pool_backend() running the logalloc_test suite on the standard allocator segment pool backend. To avoid duplicating the test code, the new test-file pulls in the test code via #include. I'm not proud of it, but it works and we test LSA with both the debug and standard memory segment stores without duplicating code.	2022-09-16 14:57:23 +03:00
Botond Dénes	13ace7a05e	Merge "Fix RPC sockets configuration wrt topology" from Pavel Emelyanov " Messaging service checks dc/rack of the target node when creating a socket. However, this information is not available for all verbs, in particular gossiper uses RPC to get topology from other nodes. This generates a chicken-and-egg problem -- to create a socket messaging service needs topology information, but in order to get one gossiper needs to create a socket. Other than gossiper, raft starts sending its APPEND_ENTRY messages early enough so that topology info is not avaiable either. The situation is extra-complicated with the fact that sockets are not created for individual verbs. Instead, verbs are groupped into several "indices" and socket is created for it. Thus, the "gossiping" index that includes non-gossiper verbs will create topology-less socket for all verbs in it. Worse -- raft sends messages w/o solicited topology, the corresponding socket is created with the assumption that the peer lives in default dc and rack which doesn't matchthe local nodes' dc/rack and the whole index group gets the "randomly" configured socket. Also, the tcp-nodelay tries to implement similar check, but uses wrong index of 1, so it's also fixed here. " * 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla: messaging_service: Fix gossiper verb group messaging_service: Mind the absence of topology data when creating sockets messaging_service: Templatize and rename remove_rpc_client_one	2022-09-16 13:27:56 +03:00
Botond Dénes	6a0db84706	tools: use standard allocator Use the new seastar option to instruct seastar to not initialize and use the seastar allocator, relying on the standard allocator instead. Configure LSA with the standard allocator based segment store backend: * scylla-types reserves 1MB for LSA -- in theory nothing here should use LSA, but just in case... * scylla-sstable reserves 100MB for LSA, to avoid excessive trashing in the sstable index caches. With this, tools now should allocate memory on demand, without reserving a large chunk of (or all of) the available memory, as regular seastar apps do.	2022-09-16 13:07:01 +03:00
Botond Dénes	a55903c839	utils/logalloc: add use_standard_allocator_segment_pool_backend() Creating a standard-memory-allocator backend for the segment store. This is targeted towards tools, which want to configure LSA with a segment store backend that is appropriate for the standard allocator (which they want to use). We want to be able to use this in both release and debug mode. The former will be used by tools and the latter will be used to run the logalloc tests with this new backend, making sure it works and doesn't regress. For this latter, we have to allow the release and debug stores to coexist in the same build and for the debug store to be able to delegate to the release store when the standard allocator backend is used.	2022-09-16 13:02:40 +03:00
Kamil Braun	595472ac59	Merge 'Don't use qctx in CDC tables quering' from Pavel Emelyanov There's a bunch of helpers for CDC gen service in db/system_keyspace.cc. All are static and use global qctx to make queries. Fortunately, both callers -- storage_service and cdc_generation_service -- already have local system_keyspace references and can call the methods via it, thus reducing the global qctx usage. Closes #11557 * github.com:scylladb/scylladb: system_keyspace: De-static get_cdc_generation_id() system_keyspace: De-static cdc_is_rewritten() system_keyspace: De-static cdc_set_rewritten() system_keyspace: De-static update_cdc_generation_id()	2022-09-16 11:52:01 +02:00
Kamil Braun	0a6f601996	Merge 'Raft test topology fix request paths and API response handling' from Alecco - Raise on response not HTTP 200 for `.get_text()` helper - Fix API paths - Close and start a fresh driver when restarting a server and it's the only server in the cluster - Fix stop/restart response as text instead of inspecting (errors are status 500 and raise exceptions) Closes #11496 * github.com:scylladb/scylladb: test.py: handle duplicate result from driver test.py: log server restarts for topology tests test.py: log actions for topology tests Revert "test.py: restart stopped servers before... test.py: ManagerClient API fix return text test.py: ManagerClient raise on HTTP != 200 test.py: ManagerClient fix paths to updated resource	2022-09-16 11:29:10 +02:00
Botond Dénes	c1c74005b7	utils/logalloc: introduce segment store backend for standard allocator To be used by tools, this store backend is compatible with the standard allocator as it acquires the memory arena for segments via mmap().	2022-09-16 12:16:57 +03:00
Botond Dénes	d2a7ebbe66	utils/logalloc: rebase release segment-store on segment-store-backend Rebase the seastar allocator based segment store implementation on the recently introduced segment store backend which is now abstracts away how memory for segments is obtained. This patch also introduces an explicit `segment_npos` to be used for cases when a segment -> index mapping fails (segment doesn't belong to the store). Currently the seastar allocator based store simply doesn't handle this case, while the standard allocator based store uses 0 as the implicit invalid index.	2022-09-16 12:16:57 +03:00
Botond Dénes	3717f7740d	utils/logalloc: introduce segment_store_backend We want to make it possible to select the segment-store to be used for LSA -- the seastar allocator based one or the standard allocator based on -- at runtime. Currently this choice is made at compile time via preprocessor switches. The current standard memory based store is specialized for debug build, we want something more similar to the seastar standard memory allocator based one. So we introduce a segment store backend for the current seastar allocator based store, which abstracts how the backing memory for all segments is allocated/freed, while keeping the segment <-> index mapping common. In the next patches we will rebase the current seastar allocator based segment store on this backend and later introduce another backend for standard allocator, targeted for release builds.	2022-09-16 12:16:57 +03:00
Botond Dénes	5ea4d7fb39	utils/logalloc: push segment alloc/dealloc to segment_store Currently the actual alloc/dealloc of memory for segments is located outside the segment stores. We want to abstract away how segments are allocated, so we move this logic too into the segment store. For now this results in duplicate code in the two segment store implementations, but this will soon be gone.	2022-09-16 12:16:57 +03:00
Botond Dénes	e82ea2f3ad	test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe Said test creates two vectors, the vector storage being allocated with the default allocator, while its content being allocated on LSA. If an exception is thrown however, both are freed via the default allocator, triggering an assert in LSA code. Move the cleanup into a `defer()` so the correct cleanup sequence is executed even on exceptions.	2022-09-16 12:16:57 +03:00
Pavel Emelyanov	e221bb0112	system_keyspace: De-static get_cdc_generation_id() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-16 08:34:15 +03:00
Pavel Emelyanov	fe48b66c0a	cross-shard-barrier: Capture shared barrier in complete When cross-shard barrier is abort()-ed it spawns a background fiber that will wake-up other shards (if they are sleeping) with exception. This fiber is implicitly waited by the owning sharded service .stop, because barrier usage is like this: sharded<service> s; co_await s.invoke_on_all([] { ... barrier.abort(); }); ... co_await s.stop(); If abort happens, the invoke_on_all() will only resolve _after_ it queues up the waking lambdas into smp queues, thus the subseqent stop will queue its stopping lambdas after barrier's ones. However, in debug mode the queue can be shuffled, so the owning service can suddenly be freed from under the barrier's feet causing use after free. Fortunately, this can be easily fixed by capturing the shared pointer on the shared barrier instead of a regular pointer on the shard-local barrier. fixes: #11303 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11553	2022-09-16 08:21:02 +03:00
Michał Chojnowski	78850884d2	test: perf: perf_fast_forward: fix an error message The test is supposed to give a helpful error message when the user forgets to run --populate before the benchmark. But this must have become broken at some point, because execute_cql() terminates the program with an unhelpful ("unconfigured table config") message, which doesn't mention --populate. Fix that by catching the exception and adding the helpful tip. Closes #11533	2022-09-15 19:30:10 +02:00
Avi Kivity	d3b8c0c8a6	logalloc: don't crash while reporting reclaim stalls if --abort-on-seastar-bad-alloc is specified The logger is proof against allocation failures, except if --abort-on-seastar-bad-alloc is specified. If it is, it will crash. The reclaim stall report is likely to be called in low memory conditions (reclaim's job is to alleviate these conditions after all), so we're likely to crash here if we're reclaiming a very low memory condition and have a large stall simultaneously (AND we're running in a debug environment). Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily. Fixes #11549 Closes #11555	2022-09-15 19:24:39 +02:00
Pavel Emelyanov	4f67898e7b	system_keyspace: De-static cdc_is_rewritten() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-15 18:44:59 +03:00
Pavel Emelyanov	736021ee98	system_keyspace: De-static cdc_set_rewritten() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-15 18:44:53 +03:00
Pavel Emelyanov	b3d139bbdb	system_keyspace: De-static update_cdc_generation_id() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-15 18:44:40 +03:00
Michał Chojnowski	cdb3e71045	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202	2022-09-15 17:16:26 +03:00
David Garcia	3cc80da6af	docs: update theme 1.3 Update conf.py Closes #11330	2022-09-15 16:56:41 +03:00
Anna Stuchlik	e5c9f3c8a2	doc: fix the filename in the index to resolve the warnings and fix the link	2022-09-15 15:53:23 +02:00
Anna Stuchlik	338b45303a	doc: apply feedback by adding she step fo load the new repo and fixing the links	2022-09-15 15:40:20 +02:00
Alejo Sanchez	92129f1d47	test.py: handle duplicate result from driver Sometimes the driver calls twice the callback on ready done future with a None result. Log it and avoid setting the local future twice. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-15 15:12:50 +02:00
Alejo Sanchez	2da7304696	test.py: log server restarts for topology tests Add missing logging for server restart. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-15 15:10:29 +02:00
Alejo Sanchez	61a92afa2d	test.py: log actions for topology tests For debugging, log driver connection, before and after checks, and topology changes. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-15 15:10:29 +02:00
Botond Dénes	05ef13a627	Merge 'Add support to split large partitions across SSTables' from Raphael "Raph" Carvalho Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted. The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading. The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run. The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold. What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned. Closes #11233 * github.com:scylladb/scylladb: test: Add test for large partition splitting on compaction compaction: Add support to split large partitions sstable: Extend sstable_run to allow disjointness on the clustering level sstables: simplify will_introduce_overlapping() test: move sstable_run_disjoint_invariant_test into sstable_datafile_test test: lib: Fix inefficient merging of mutations in make_sstable_containing() sstables: Keep track of first partition's first pos and last partition's last pos sstables: Rename min/max position_range to a descriptive name sstables_manager: Add sstable metadata reader concurrency semaphore sstables: Add ability to find first or last position in a partition	2022-09-15 16:08:56 +03:00
Alejo Sanchez	604f7353ef	Revert "test.py: restart stopped servers before... teardown..." This reverts commit `df1ca57fda`. In order to prevent timeouts on teardown queries, the previous commit added functionality to restart servers that were down. This issue is fixed in fc0263fc9b so there's no longer need to restart stopped servers on test teardown.	2022-09-15 14:47:01 +02:00
Alejo Sanchez	ed81f1a85c	test.py: ManagerClient API fix return text For ManagerClient request API, don't return status, raise an exception. Server side errors are signaled by status 500, not text body. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-15 14:47:01 +02:00
Alejo Sanchez	4a5f2418ec	test.py: ManagerClient raise on HTTP != 200 Raise an exception if the request result is not HTTP 200 for .get() helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-15 14:47:01 +02:00
Alejo Sanchez	a84bde38c0	test.py: ManagerClient fix paths to updated resource Fix missing path renames for server-side rename "node" -> "server" API. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-15 14:47:01 +02:00
Kamil Braun	728161003a	Merge 'raft server, abort on background errors' from Gusev Petr Halted background fibers render raft server effectively unusable, so report this explicitly to the clients. Fix: #11352 Closes #11370 * github.com:scylladb/scylladb: raft server, status metric raft server, abort group0 server on background errors raft server, provide a callback to handle background errors raft server, check aborted state on public server public api's	2022-09-15 14:12:11 +02:00
Alejo Sanchez	b8f68729b0	test.py: Pool add fresh when item not returned Pool.get() might have waiting callers, so if an item is not returned to the pool after use, tell the pool to add a new one and tell the pool an entry was taken (used for total running entries, i.e. clusters). Use it when a ScyllaCluster is dirty and not returned. While there improve logging and docstrings. Issue reported by @kbr-. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11546	2022-09-15 13:56:44 +03:00
Pavel Emelyanov	82162be1f1	messaging_service: Remove init/uninit helpers These two are just getting in the way when touching inter-components dependencies around messaging service. Without it m.-s. start/stop just looks like any other service out there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11535	2022-09-15 11:54:46 +03:00
Raphael S. Carvalho	0a8afe18ca	cql: Reject create and alter table with DateTieredCompactionStrategy It's been ~1 year (`2bf47c902e`) since we set restrict_dtcs config option to WARN, meaning users have been warned about the deprecation process of DTCS. Let's set the config to TRUE, meaning that create and alter statements specifying DTCS will be rejected at the CQL level. Existing tables will still be supported. But the next step will be about throwing DTCS code into the shadow realm, and after that, Scylla will automatically fallback to STCS (or ICS) for users which ignored the deprecation process. Refs #8914. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11458	2022-09-15 11:46:18 +03:00
Alejo Sanchez	7e3389ee43	test.py: schema timeout less than request timeout When a server is down, the driver expects multiple schema timeouts within the same request to handle it properly. Found by @kbr- Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11544	2022-09-15 11:43:52 +03:00
Raphael S. Carvalho	a04047f390	compaction: Properly handle stop request for off-strategy If user stops off-strategy via API, compaction manager can decide to give up on it completely, so data will sit unreshaped in maintenance set, preventing it from being compacted with data in the main set. That's problematic because it will probably lead to a significant increase in read and space amplification until off-strategy is triggered again, which cannot happen anytime soon. Let's handle it by moving data in maintenance set into main one, even if unreshaped. Then regular compaction will be able to continue from where off-strategy left off. Fixes #11543. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11545	2022-09-15 09:21:22 +03:00
Nadav Har'El	33e6a88d9a	alternator ttl: comment fixes This patch fixes a few errors and out-of-date descriptions in comments in alternator/ttl.cc. No functional changes. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-15 00:03:43 +03:00
Nadav Har'El	8af9437508	docs/alternator: fix mention of old alternator-test directory The directory that used to be called alternator-test is now (and has been for a long time) really test/alternator. So let's fix the references to it in docs/alternator/alternator.md. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-15 00:03:43 +03:00
Pavel Emelyanov	2c74062962	messaging_service: Fix gossiper verb group When configuring tcp-nodelay unconditionally, messaging service thinks gossiper uses group index 1, though it had changed some time ago and now those verbs belong to group 0. fixes: #11465 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-14 20:40:47 +03:00
Pavel Emelyanov	7bdad47de2	messaging_service: Mind the absence of topology data when creating sockets When a socket is created to serve a verb there may be no topology information regarding the target node. In this case current code configures socket as if the peer node lived in "default" dc and rack of the same name. If topology information appears later, the client is not re-connected, even though it could providing more relevant configuration (e.g. -- w/o encryption) This patch checks if the topology info is needed (sometimes it's not) and if missing it configures the socket in the most restrictive manner, but notes that the socket ignored the topology on creation. When topology info appears -- and this happens when a node joins the cluster -- the messaging service is kicked to drop all sockets that ignored the topology, so thay they reconnect later. The mentioned "kick" comes from storage service on-join notification. More correct fix would be if topology had on-change notification and messaging service subscribed on it, but there are two cons: - currently dc/rack do not change on the fly (though they can, e.g. if gossiping property file snitch is updated without restart) and topology update effectively comes from a single place - updating topology on token-metadata is not like topology.update() call. Instead, a clone of token metadata is created, then update happens on the clone, then the clone is committed into t.m. Though it's possible to find out commit-time which nodes changed their topology, but since it only happens on join this complexity likely doesn't worth the effort (yet) fixes: #11514 fixes: #11492 fixes: #11483 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-14 20:30:51 +03:00
Pavel Emelyanov	5ffc9d66ec	messaging_service: Templatize and rename remove_rpc_client_one It actually finds and removes a client and in its new form it also applies filtering function it, so some better name is called for Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-14 20:30:07 +03:00
Raphael S. Carvalho	20a6483678	test: Add test for large partition splitting on compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:23:19 -03:00
Raphael S. Carvalho	e2ccafbe38	compaction: Add support to split large partitions Adds support for splitting large partitions during compaction. Large partitions introduce many problems, like memory overhead and breaks incremental compaction promise. We want to split large partitions across fixed-size fragments. We'll allow a partition to exceed size limit by 10%, as we don't want to unnecessarily split partitions that just crossed the limit boundary. To avoid having to open a minimal of 2 fragments in a read, partition tombstone will be replicated to every fragment storing the partition. The splitting isn't enabled by default, and can be used by strategies that are run aware like ICS. LCS still cannot support it as it's still using physical level metadata, not run id. An incremental reader for sstable runs will follow soon. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:23:16 -03:00
Raphael S. Carvalho	4bc24acf81	sstable: Extend sstable_run to allow disjointness on the clustering level After commit `0796b8c97a`, sstable_run won't accept a fragment that introduces key overlapping. But once we split large partitions, fragments in the same run may store disjoint clustering ranges of the same partition. So we're extending sstable_run to look at clustering dimension, so fragments storing disjoint clustering ranges of the same large partition can co-exist in the same run. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	574e656793	sstables: simplify will_introduce_overlapping() An element S1 is completely ordered before S2, if S1's last key is lower than S2's first key. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	13942ec947	test: move sstable_run_disjoint_invariant_test into sstable_datafile_test That's where it belongs. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	e1560c6b7f	test: lib: Fix inefficient merging of mutations in make_sstable_containing() make_sstable_containing() was absurdly slow when merging thousands of mutations belonging to the same key, as it was unnecessarily copying the mutation for every merge, producing bad complexity. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	5937765009	sstables: Keep track of first partition's first pos and last partition's last pos With first partition's first position and last partition's last partition, we'll be able to determine which fragments composing a sstable run store a large partition that was split. Then sstable run will be able to detect if all fragments storing a given large partition are disjoint in the clustering level. Fixes #10637. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	a4bbdfcc58	sstables: Rename min/max position_range to a descriptive name The new descriptive name is important to make a distinction when sstable stores position range for first and last rows instead of min and max. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	e099a9bf3b	sstables_manager: Add sstable metadata reader concurrency semaphore Let's introduce a reader_concurrency_semaphore for reading sstable metadata, to avoid an OOM due to unlimited concurrency. The concurrency on startup is not controlled, so it's important to enforce a limit on the amount of memory used by the parallel readers. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	9bcad9ffa8	sstables: Add ability to find first or last position in a partition This new method allows sstable to load the first row of the first partition and last row of last partition. That's useful for incremental reading of sstable run which will be split at clustering boundary. To get the first row, it consumes the first row (which can be either a clustering row or range tombstone change) and returns its position_in_partition. To get the last row, it does the same as above but in reverse mode instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:48 -03:00
Nadav Har'El	77467bcbcd	Merge 'test/pylib: APIs to read and modify configuration from tests' from Kamil Braun We introduce `server_get_config` to fetch the entire configuration dict and `update_config` to update a value under the given key. Closes #11493 * github.com:scylladb/scylladb: test/pylib: APIs to read and modify configuration from tests test/pylib: ScyllaServer: extract _write_config_file function test/pylib: ScyllaCluster: extend ActionReturn with dict data test/pylib: ManagerClient: introduce _put_json test/pylib: ManagerClient: replace `_request` with `_get`, `_get_text` test: pylib: store server configuration in `ScyllaServer`	2022-09-14 18:49:55 +03:00
Kefu Chai	2a74a0086f	docs: fix typos * s/udpates/updates/ * s/opetarional/operational/ Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes #11541	2022-09-14 17:04:05 +03:00
Kamil Braun	73bf781e17	test/pylib: APIs to read and modify configuration from tests We introduce `server_get_config` to fetch the entire configuration dict and `update_config` to update a value under the given key.	2022-09-14 12:46:41 +02:00
Kamil Braun	1f550428a9	test/pylib: ScyllaServer: extract _write_config_file function For refreshing the on-disk config file with the config stored in dict form in the `self.config` field.	2022-09-14 12:46:41 +02:00
Kamil Braun	52e52e8503	test/pylib: ScyllaCluster: extend ActionReturn with dict data For returning types more complex than text. Also specify a default empty string value for the `msg` field for non-text return values.	2022-09-14 12:46:41 +02:00
Kamil Braun	c9348ae8ea	test/pylib: ManagerClient: introduce _put_json For sending PUT requests to the Manager (such as updating configuration).	2022-09-14 12:46:41 +02:00
Kamil Braun	d81c722476	test/pylib: ManagerClient: replace `_request` with `_get`, `_get_text` `_request` performed a GET request and extracted a text body out of the response. Split it into `_get`, which only performs the request, and `_get_text`, which calls `_get` and extracts the body as text. Also extract a `_resource_uri` function which will be used for other request types.	2022-09-14 12:46:41 +02:00
Kamil Braun	9d39e14518	test: pylib: store server configuration in `ScyllaServer` In following commits we will make this configuration accessible from tests through the Manager (for fetching and updating).	2022-09-14 12:46:41 +02:00
Nadav Har'El	cf30432715	Merge 'test: add a topology suite with Raft disabled' from Kamil Braun Add a suite which is basically equivalent to `topology` except that it doesn't start servers with Raft enabled. The suite will be used to test the Raft upgrade procedure. The suite contains a basic test just to check the suite itself can run; the test will be removed when 'real' tests are added. Closes #11487 * github.com:scylladb/scylladb: test.py: PythonTestSuite: sum default config params with user-provided ones test: add a topology suite with Raft disabled test: pylib: use Python dicts to manipulate `ScyllaServer` configuration test: pylib: store `config_options` in `ScyllaServer`	2022-09-14 13:37:44 +03:00
Pavel Emelyanov	43131976e9	updateable_value: Update comment about cross-shard copying refs: #7316 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11538	2022-09-14 12:35:56 +02:00
Michał Chojnowski	9b6fc553b4	db: commitlog: don't print INFO logs on shutdown The intention was for these logs to be printed during the database shutdown sequence, but it was overlooked that it's not the only place where commitlog::shutdown is called. Commitlogs are started and shut down periodically by hinted handoff. When that happens, these messages spam the log. Fix that by adding INFO commitlog shutdown logs to database::stop, and change the level of the commitlog::shutdown log call to DEBUG. Fixes #11508 Closes #11536	2022-09-14 11:30:53 +03:00
Avi Kivity	a24a8fd595	Update seastar submodule * seastar cbb0e888d8...601e0776c0 (1): > coroutine: explain and mitigate the lambda coroutine fiasco Closes #11537	2022-09-13 22:37:29 +03:00
Petr Gusev	4ff0807cd0	raft server, status metric	2022-09-13 19:34:22 +04:00
Alejo Sanchez	6799e766ca	test.py: topology increment timeouts even more Due to slow debug machines timing out, bump up all timeouts significantly. The cause was ExecutionProfile request_timeout. Also set a high heartbeat timeout and bump already set timeouts to be safe, too. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11516	2022-09-13 11:57:31 +02:00
Piotr Dulikowski	e69b44a60f	exception: fix the error code used for rate_limit_exception Per-partition rate limiting added a new error type which should be returned when Scylla decides to reject an operation due to per-partition rate limit being exceeded. The new error code requires drivers to negotiate support for it, otherwise Scylla will report the error as `Config_error`. The existing error code override logic works properly, however due to a mistake Scylla will report the `Config_error` code even if the driver correctly negotiated support for it. This commit fixes the problem by specifying the correct error code in `rate_limit_exception`'s constructor. Tested manually with a modified version of the Rust driver which negotiates support for the new error. Additionally, tested what happens when the driver doesn't negotiate support (Scylla properly falls back to `Config_error`). Branches: 5.1 Fixes: #11517 Closes #11518	2022-09-13 11:46:15 +02:00
Nadav Har'El	8ece63c433	Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails' This series introduces two configurable options when working with TWCS tables: - `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting - `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check. Refs: #6923 Fixes: #9029 Closes #11445 * github.com:scylladb/scylladb: tests: cql_query_test: add mixed tests for verifying TWCS guard rails tests: cql_query_test: add test for TWCS window size tests: cql_query_test: add test for TWCS tables with no TTL defined cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables cql: add max window restriction for TimeWindowCompactionStrategy time_window_compaction_strategy: reject invalid window_sizes cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr	2022-09-12 23:55:51 +03:00
Botond Dénes	045b053228	Update seastar submodule * seastar 2b2f6c08...cbb0e888 (10): > memory: allow user to select allocator to be used at runtime > perftune.py: correct typos > Merge 'seastar-addr2line: support more flexible syslog-style backtraces' from Benny Halevy > Fix instruction count for start_measuring_time > build: s/c-ares::c-ares/c-ares::cares/ > Merge 'shared_ptr_debug_helper: turn assert into on_internal_error_abort' from Benny Halevy > test: fix use after free in the loopback socket > doc/tutorial.md: fix docker command for starting hello-world_demo > httpd: add a ctor without addr parameter > dns: dns_resolver: sock_entry: move-construct tcp/udp entries in place Closes #11526	2022-09-12 18:34:22 +03:00
Avi Kivity	62ac3432c9	Merge "Always notify dropped RPC connections" from Pavel E " This set makes messaging service notify connection drop listeners when connection is dropped for _any_ reason and cleans things up around it afterwards " * 'br-messaging-notify-connection-drop' of https://github.com/xemul/scylla: messaging_service: Relax connection drop on re-caching messaging_service: Simplify remove_rpc_client_one() messaging_service: Notify connection drop when connection is removed	2022-09-12 17:02:51 +03:00
Yaron Kaikov	27e326652b	build_docker.sh:fix python2 dependency Following the revert of `b004da9d1b` which solved https://github.com/scylladb/scylla-pkg/issues/3094 updating docker dependency to match `scylla-tools-java` requirements Closes #11522	2022-09-12 13:33:06 +03:00
Kamil Braun	2fe3e67a47	gms: feature_service: don't distinguish between 'known' and 'supported' features `feature_service` provided two sets of features: `known_feature_set` and `supported_feature_set`. The purpose of both and the distinction between them was unclear and undocumented. The 'supported' features were gossiped by every node. Once a feature is supported by every node in the cluster, it becomes 'enabled'. This means that whatever piece of functionality is covered by the feature, it can by used by the cluster from now on. The 'known' set was used to perform feature checks on node start; if the node saw that a feature is enabled in the cluster, but the node does not 'know' the feature, it would refuse to start. However, if the feature was 'known', but wasn't 'supported', the node would not complain. This means that we could in theory allow the following scenario: 1. all nodes support feature X. 2. X becomes enabled in the cluster. 3. the user changes the configuration of some node so feature X will become unsupported but still known. 4. The node restarts without error. So now we have a feature X which is enabled in the cluster, but not every node supports it. That does not make sense. It is not clear whether it was accidental or purposeful that we used the 'known' set instead of the 'supported' set to perform the feature check. What I think is clear, is that having two sets makes the entire thing unnecessarily complicated and hard to think about. Fortunately, at the base to which this patch is applied, the sets are always the same. So we can easily get rid of one of them. I decided that the name which should stay is 'supported', I think it's more specific than 'known' and it matches the name of the corresponding gossiper application state. Closes #11512	2022-09-12 13:09:12 +03:00
Takuya ASADA	cd5320fe60	install.sh: add --without-systemd option Since we fail to write files to $USER/.config on Jenkins jobs, we need an option to skip installing systemd units. Let's add --without-systemd to do that. Also, to detect the option availability, we need to increment relocatable package version. See scylladb/scylla-dtest#2819 Closes #11345	2022-09-12 13:04:00 +03:00
Avi Kivity	521127a253	Update tools/jmx submodule * tools/jmx 06f2735...88d9bdc (1): > install.sh: add --without-systemd option	2022-09-12 13:02:16 +03:00
Kamil Braun	ce7bb8b6d0	test.py: PythonTestSuite: sum default config params with user-provided ones Previously, if the suite.yaml file provided `extra_scylla_config_options` but didn't provide values for `authorizer` or `authenticator` inside the config options, the harness wouldn't give any defaults for these keys. It would only provide defaults for these keys if suite.yaml didn't specify `extra_scylla_config_options` at all. It makes sense to give the user the ability to provide extra options while relying on harness defaults for `authenticator` and `authorizer` if the user doesn't care about them.	2022-09-12 11:58:05 +02:00
Kamil Braun	1661fe9f37	test: add a topology suite with Raft disabled Add a suite which is basically equivalent to `topology` except that it doesn't start servers with Raft enabled. The suite will be used to test the Raft upgrade procedure. The suite contains a basic test just to check the suite itself can run; the test will be removed when 'real' tests are added.	2022-09-12 11:58:05 +02:00
Kamil Braun	311806244d	test: pylib: use Python dicts to manipulate `ScyllaServer` configuration Previously we used a formattable string to represent the configuration; values in the string were substituted by Python's formatting mechanism and the resulting string was stored to obtain the config file. This approach had some downsides, e.g. it required boilerplate work to extend: to add a new config options, you would have to modify this template string. Instead we can represent the configuration as a Python dictionary. Dicts are easy to manipulate, for example you can sum two dicts; if a key appears in both, the second dict 'wins': ``` {1:1} \| {1:2} == {1:2} ``` This makes the configuration easy to extend without having to write boilerplate: if the user of `ScyllaServer` wants to add or override a config option, they can simply add it to the `config_options` dict and that's it - no need to modify any internal template strings in `ScyllaServer` implementation like before. The `config_options` dict is simply summed with the 'base' config dict of `ScyllaServer` (`config_options` is the right summand so anything in there overrides anything in the base dict). An example of this extensibility is the `authenticator` and `authorizer` options which no longer appear in `scylla_cluster.py` module after this change, they only appear in the suite.yaml file. Also, use "workdir" option instead of specifying data dir, commitlog dir etc. separately.	2022-09-12 11:57:58 +02:00
Kamil Braun	fd19825eaa	test: pylib: store `config_options` in `ScyllaServer` Previously the code extracted `authenticator` and `authorizer` keys from the config options and stored them. Store the entire dict instead. The new code is easier to extend if we want to make more options configurable.	2022-09-12 11:57:18 +02:00
Pavel Emelyanov	5663b3eda5	messaging_service: Relax connection drop on re-caching When messaging_service::get_rpc_client() picks up cached socket and notices error on it, it drops the connection and creates a new one. The method used to drop the connection is the one that re-lookups the verb index again, which is excessive. Tune this up while at it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-12 12:05:02 +03:00
Nadav Har'El	b0371b6bf8	test/alternator: insert test names into Scylla logs The output of test/alternator/run ends in Scylla's full log file, where it is hard to understand which log messages are related to which test. In this patch, we add a log message (using the new /system/log REST API) every time a test is started and ends. The messages look like this: INFO 2022-08-29 18:07:15,926 [shard 0] api - /system/log: test/alternator: Starting test_ttl.py::test_describe_ttl_without_ttl ... INFO 2022-08-29 18:07:15,930 [shard 0] api - /system/log: test/alternator: Ended test_ttl.py::test_describe_ttl_without_ttl Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Nadav Har'El	a81310e23d	rest api: add a new /system/log operation Add a new REST API operation, taking a log level and a message, and printing it into the Scylla log. This can be useful when a test wants to mark certain positions in the log (e.g., to see which other log messages we get between the two positions). An alternative way to achieve this could have been for the test to write directly into the log file - but an on-disk log file is only one of the logging options that Scylla support, and the approach in this patch allows to add log message regardless of how Scylla keeps the logs. In motivation of this feature is that in the following patch the test/alternator framework will add log messages when starting and ending tests, which can help debug test failures. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Nadav Har'El	b9792ffb06	alternator ttl: log warning if scan took too long. Currently, we log at "info" level how much time remained at the end of a full TTL scan until the next scanning period (we sleep for that time). If the scan was slower than the period, we didn't print anything. Let's print a warning in this case - it can be useful for debugging, and also users should know when their desired scan period is not being honored because the full scan is taking longer than the desired scan period. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Nadav Har'El	e7e9adc519	alternator,ttl: allow sub-second TTL scanning period, for tests Alternator has the "alternator_ttl_period_in_seconds" parameter for controlling how often the expiration thread looks for expired items to delete. It is usually a very large number of seconds, but for tests to finish quickly, we set it to 1 second. With 1 second expiration latency, test/alternator/test_ttl.py took 5 seconds to run. In this patch, we change the parameter to allow a floating-point number of seconds instead of just an integer. Then, this allows us to halve the TTL period used by tests to 0.5 seconds, and as a result, the run time of test_ttl.py halves to 2.5 seconds. I think this is fast enough for now. I verified that even if I change the period to 0.1, there is no noticable slowdown to other Alternator tests, so 0.5 is definitely safe. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Nadav Har'El	746c4bd9eb	test/alternator: skip fewer Alternator TTL tests Most of the Alternator TTL tests are extremely slow on DynamoDB because item expiration may be delayed up to 24 hours (!), and in practice for 10 to 30 minutes. Because of this, we marked most of these tests with the "veryslow" mark, causing them to be skipped by default - unless pytest is given the "--runveryslow" option. The result was that the TTL tests were not run in the normal test runs, which can allow regressions to be introduced (luckily, this hasn't happened). However, this "veryslow" mark was excessive. Many of the tests are very slow only on DynamoDB, but aren't very slow on Scylla. In particular, many of the tests involve waiting for an item to expire, something that happens after the configurable alternator_ttl_period_in_seconds, which is just one second in our tests. So in this patch, we remove the "veryslow" mark from 6 tests of Alternator TTL tests, and instead use two new fixtures - waits_for_expiration and veryslow_on_aws - to only skip the test when running on DynamoDB or when alternator_ttl_period_in_seconds is high - but in our usual test environment they will not get skipped. Because 5 of these 6 tests wait for an item to expire, they take one second each and this patch adds 5 seconds to the Alternator test runtime. This is unfortunate (it's more than 25% of the total Alternator test runtime!) but not a disaster, and we plan to reduce this 5 second time futher in the following patch, but decreasing the TTL scanning period even further. This patch also increases the timeout of several of these tests, to 120 seconds from the previous 10 seconds. As mentioned above, normally, these tests should always finish in alternator_ttl_period_in_seconds (1 second) with a single scan taking less than 0.2 seconds, but in extreme cases of debug builds on overloaded test machines, we saw even 60 seconds being passed, so let's increase the maximum. I also needed to make the sleep time between retries smaller, not a function of the new (unrealistic) timeout. 4 more tests remain "veryslow" (and won't run by default) because they are take 5-10 seconds each (e.g., a test which waits to see that an item does not get expired, and a test involving writing a lot of data). We should reconsider this in the future - to perhaps run these tests in our normal test runs - but even for now, the 6 extra tests that we start running are a much better protection against regressions than what we had until now. Fixes #11374 Signed-off-by: Nadav Har'El <nyh@scylladb.com> x Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Nadav Har'El	297109f6ee	test/alternator: test Alternator TTL metrics This patch adds a test for the metrics generated by the background expiration thread run for Alternator's TTL feature. We test three of the four metrics: scylla_expiration_scan_passes, scylla_expiration_scan_table and scylla_expiration_items_deleted. The fourth metric, scylla_expiration_secondary_ranges_scanned, counts the number of times that this node took over another node's expiration duty. so requires a multi-node cluster to test, and we can't test it in the single-node cluster test framework. To see TTL expiration in action this test may need to wait up to the setting of alternator_ttl_period_in_seconds. For a setting of 1 second (the default set by test/alternator/run), this means this test can take up to 1 second to run. If alternator_ttl_period_in_seconds is set higher, the test is skipped unless --runveryslow is requested. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Botond Dénes	a0392bc1eb	Merge 'doc: update the default SStable format' from Anna Stuchlik The purpose of this PR is to update the information about the default SStable format. It Closes #11431 * github.com:scylladb/scylladb: doc: simplify the information about default formats in different versions doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page doc: fix additional information regarding the ME format on the SStable 3.x page doc: add the ME format to the table add a comment to remove the information when the documentation is versioned (in 5.1) doc: replace Scylla with ScyllaDB doc: fix the formatting and language in the updated section doc: fix the default SStable format	2022-09-12 09:50:01 +03:00
Pavel Emelyanov	f3dfc9dbd4	system_keyspace: Don't load preferred IPs if not asked for If snitch->prefer_local() is false, advertised (via gossiper) INTERNAL_IPs are not suggested to messaging service to use. The same should apply to boot-time when messaging service is loaded with those IPs taken from the system.peers table. fixes: #11353 tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2172/ Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220909144800.23122-1-xemul@scylladb.com>	2022-09-12 09:48:23 +03:00
Botond Dénes	9db940ff1b	Merge "Make network_topology_strategy_test use topology" from Pavel Emelyanov " The test in question plays with snitches to simulate the topology over which tokens are spread. This set replaces explicit snitch usage with temporary topology object. Some snitch traces are still left, but those are for token_metadata internal which still call global snitch for DC/RACK. " * 'br-tests-use-topology-not-snitch' of https://github.com/xemul/scylla: network_topology_strategy_test: Use topology instead of snitch network_topology_strategy_test: Populate explicit topology	2022-09-12 09:40:17 +03:00
Avi Kivity	6c797587c7	dirty_memory_manager: region_group: remove sorting of subgroups dirty_memory_manager tracks lsa regions (memtables) under region_group:s, in order to be able to pick up the largest memtable as a candidate for flushing. Just as region_group:s contain regions, they can also contain other region_group:s in a nested structure. It also tracks the nested region_group that contains the largest region in a binomial heap. This latter facility is no longer used. It saw use when we had the system dirty_memory_manager nested under the user dirty_memory_manager, but that proved too complicated so it was undone. We still nest a virtual region_group under the real region_group, and in fact it is the virtual region_group that holds the memtables, but it is accessed directly to find the largest memtable (region_group::get_largest_region) and so all the mechanism that sorts region_group:s is bypassed. Start to dismantle this house of cards by removing the subgroup sorting. Since the hierarchy has exactly one parent and one child, it's clearly useless. This is seen by the fact that we can just remove everything related. We still need the _subgroups member to hold the virtual region_group; it's replaced by a vector. I verified that the non-intrusive vector is exception safe since push_back() happens at the very end; in any case this is early during setup where we aren't under memory pressure. A few tests that check the removed functionality are deleted. Closes #11515	2022-09-12 09:29:08 +03:00
Botond Dénes	0e2d6cfd61	Merge 'Introduce Compaction Groups' from Raphael "Raph" Carvalho Compaction group can be defined as a set of files that can be compacted together. Today, all sstables belonging to a table in a given shard belong to the same group. So we can say there's one group per table per shard. As we want to eventually allow isolation of data that shouldn't be mixed, e.g. data from different vnodes, then we want to have more than one group per table per shard. That's why compaction groups is being introduced here. Today, all memtables and sstables are stored in a single structure per table. After compaction groups, there will be memtables and sstables for each group in the table. As we're taking an incremental approach, table still supports a single group. But work was done on preparing table for supporting multiple groups. Completing that work is actually the next step. Also, a procedure for deriving the group from token is introduced, but today it always return the single group owned by the table. Once multiple groups are supported, then that procedure should be implemented to map a token to a group. No semantics was changed by this series. Closes #11261 * github.com:scylladb/scylladb: replica: Move memtables to compaction_group replica: move compound SSTable set to compaction group replica: move maintenance SSTable set to compaction_group replica: move main SSTable set to compaction_group replica: Introduce compaction_group replica: convert table::stop() into coroutine compaction_manager: restore indentation compaction_manager: Make remove() and stop_ongoing_compactions() noexcept test: sstable_compaction_test: Don't reference main sstable set directly test: sstable_utils: Set data size fields for fake SSTable test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable	2022-09-12 09:28:44 +03:00
Botond Dénes	5374f0edbf	Merge 'Task manager' from Aleksandra Martyniuk Task manager for observing and managing long-running, asynchronous tasks in Scylla with the interface for the user. It will allow listing of tasks, getting detailed task status and progression, waiting for their completion, and aborting them. The task manager will be configured with a “task ttl” that determines how long the task status is kept in memory after the task completes. At first it will support repair and compaction tasks, and possibly more in the future. Currently: Sharded `task_manager` is started in `main.cc` where it is further passed to `http_context` for the purpose of user interface. Task manager's tasks are implemented in two two layers: the abstract and the implementation one. The latter is a pure virtual class which needs to be overriden by each module. Abstract layer provides the methods that are shared by all modules and the access to module-specific methods. Each module can access task manager, create and manage its tasks through `task_manager::module` object. This way data specific to a module can be separated from the other modules. User can access task manager rest api interface to track asynchronous tasks. The available options consist of: - getting a list of modules - getting a list of basic stats of all tasks in the requested module - getting the detailed status of the requested task - aborting the requested task - waiting for the requested task to finish To enable testing of the provided api, test specific task implementation and module are provided. Their lifetime can be simulated with the standalone test api. These components are compiled and the tests are run in all but release build modes. Fixes: #9809 Closes #11216 * github.com:scylladb/scylladb: test: task manager api test task_manager: test api layer implementation task_manager: add test specific classes task_manager: test api layer task_manager: api layer implementation task_manager: api layer task_manager: keep task_manager reference in http_context start sharded task manager task_manager: create task manager object	2022-09-12 09:26:46 +03:00
Petr Gusev	1b5fa4088e	raft server, abort group0 server on background errors	2022-09-12 10:16:43 +04:00
Petr Gusev	e92dc9c15b	raft server, provide a callback to handle background errors Fix: #11352	2022-09-12 10:16:43 +04:00
Petr Gusev	c57238d3d6	raft server, check aborted state on public server public api's Fix: #11352	2022-09-12 10:16:40 +04:00
Felipe Mendes	6a3d8607b4	tests: cql_query_test: add mixed tests for verifying TWCS guard rails This patch adds set of 10 cenarios that have been unveiled during additional testing. In particular, most of the scenarios cover ALTER TABLE statements, which - if not handled - may break the guardrails safe-mode. The situations covered are: - STCS->TWCS with no TTL defined - STCS->TWCS with small TTL - STCS->TWCS with large TTL value - TWCS table with small to large TTL - No TTL TWCS to large TTL and then small TTL - twcs_max_window_count LiveUpdate - Decrease TTL - twcs_max_window_count LiveUpdate - Switch CompactionStrategy - No TTL TWCS table to STCS - Large TTL TWCS table, modify attribute other than compaction and default_time_to_live - Large TTL STCS table, fail to switch to TWCS with no TTL explicitly defined	2022-09-11 17:57:14 -03:00
Felipe Mendes	a7a91e3216	tests: cql_query_test: add test for TWCS window size This patch adds a test for checking the validity of tables using TimeWindowCompactionStrategy with an incorrect number of compaction windows. The twcs_max_window_count LiveUpdate-able parameter is also disabled during the execution of the test in order to ensure that users can effectively disable the enforcement, should they want.	2022-09-11 17:38:25 -03:00
Felipe Mendes	1c5d46877e	tests: cql_query_test: add test for TWCS tables with no TTL defined This patch adds a testcase for TimeWindowCompactionStrategy tables created with no default_time_to_live defined. It makes use of the LiveUpdate-able restrict_twcs_default_ttl parameter in order to determine whether TWCS tables without TTL should be forbidden or not. The test replays all 3 possible variations of the tri_mode_restriction and verifies tables are correctly created/altered according to the current setting on the replica which receives the request.	2022-09-11 16:55:46 -03:00
Felipe Mendes	7fec4fcaa6	cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth. However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0. The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn": We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.	2022-09-11 16:50:42 -03:00
Felipe Mendes	a3356e866b	cql: add max window restriction for TimeWindowCompactionStrategy The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough. Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase. This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check. Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.	2022-09-11 16:50:22 -03:00
Felipe Mendes	f1ffb501f0	time_window_compaction_strategy: reject invalid window_sizes Scylla mistakenly allows an user to configure an invalid TWCS window_size <= 0, which effectively breaks the notion of compaction windows. Interestingly enough, a <= 0 window size should be considered an undefined behavior as either we would create a new window every 0 duration (?) or the table would behave as STCS, the reader is encouraged to figure out which one of these is true. :-) Cassandra, on the other hand, will properly throw a ConfigurationException when receiving such invalid window sizes and we now match the behavior to the same as Cassandra's. Refs: #2336	2022-09-11 16:40:03 -03:00
Raphael S. Carvalho	f5715d3f0b	replica: Move memtables to compaction_group Now memtables live in compaction_group. Also introduced function that selects group based on token, but today table always return the single group managed by it. Once multiple groups are supported, then the function should interpret token content to select the group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	f4579795e6	replica: move compound SSTable set to compaction group The group is now responsible for providing the compound set. table still has one compound set, which will span all groups for the cases we want to ignore the group isolation. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	6717d96684	replica: move maintenance SSTable set to compaction_group This commit is restricted to moving maintenance set into compaction_group. Next, we'll introduce compound set into it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	ce8e5f354c	replica: move main SSTable set to compaction_group This commit is restricted to moving main set into compaction_group. Next, we'll move maintenance set into it and finally the memtable. A method is introduced to figure out which group a sstable belongs to, but it's still unimplemented as table is still limited to a single group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	4871f1c97c	replica: Introduce compaction_group Compaction group is a new abstraction used to group SSTables that are eligible to be compacted together. By this definition, a table in a given shard has a single compaction group. The problem with this approach is that data from different vnodes is intermixed in the same sstable, making it hard to move data in a given sstable around. Therefore, we'll want to have multiple groups per table. A group can be thought of an isolated LSM tree where its memtable and sstable files are isolated from other groups. As for the implementation, the idea is to take a very incremental approach. In this commit, we're introducing a single compaction group to table. Next, we'll migrate sstable and maintenance set from table into that single compaction group. And finally, the memtable. Cache will be shared among the groups, for simplicity. It works due to its ability to invalidate a subset of the token range. There will be 1:1 relationship between compaction_group and table_state. We can later rename table_state to compaction_group_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	a6ecadf3de	replica: convert table::stop() into coroutine await_pending_ops() is today marked noexcept, so doesn't have to be implemented with finally() semantics. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	44913ebbd0	compaction_manager: restore indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	888660fa44	compaction_manager: Make remove() and stop_ongoing_compactions() noexcept stop_ongoing_compactions() is made noexcept too as it's called from remove() and we want to make the latter noexcept, to allow compaction group to qualify its stop function as noexcept too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	65414e6756	test: sstable_compaction_test: Don't reference main sstable set directly Preparatory change for main sstable set to be moved into compaction group. After that, tests can no longer direct access the main set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	dfa7273127	test: sstable_utils: Set data size fields for fake SSTable So methods that look at data size and require it to be higher than 0 will work on fake SSTables created using set_values(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	4fa8159a13	test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable column_family_test::add_sstable will soon be changed to run in a thread, and it's not needed in this procedure, so let's remove its usage. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Jadw1	ba461aca8b	cql-pytest: more neutral command in cql_test_connection fixture I found 'use system` to not be neutral enough (e.g. in case of testing describe statement). `BEGIN BATCH APPLY BATCH` sounds better. Closes #11504	2022-09-11 18:49:06 +03:00
Nadav Har'El	d71098a3b8	Update tools/java submodule * tools/java b7a0c5bd31...b004da9d1b (1): > Revert "dist/debian:add python3 as dependency"	2022-09-11 17:45:43 +03:00
Pavel Emelyanov	bbad3eac63	pylib: Cast port number config to int explicitly Otherwise it crashes some python versions. The cast was there before `a2dd64f68f` explicitly dropped one while moving the code between files. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11511	2022-09-09 18:08:08 +02:00
Kamil Braun	be1ef9d2a7	gms: feature_service: remove the USES_RAFT feature It was not and won't be used for anything. Note that the feature was always disabled or masked so no node ever announced it, thus it's safe to get rid of. Closes #11505	2022-09-09 18:05:46 +02:00
Michał Chojnowski	47844689d8	token_metadata: make local_dc_filter a lambda, not a std::function This std::function causes allocations, both on construction and in other operations. This costs ~2200 instructions for a DC-local query. Fix that. Closes #11494	2022-09-09 18:05:46 +02:00
Michał Chojnowski	af7ace3926	utils: config_file: fix handling of workdir,W in the YAML file Option names given in db/config.cc are handled for the command line by passing them to boost::program_options, and by YAML by comparing them with YAML keys. boost::program_options has logic for understanding the long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W worked, as intended. But our YAML config parsing doesn't have this logic and expected "workdir,W" verbatim, which is obviously not intended. Fix that. Fixes #7478 Fixes #9500 Fixes #11503 Closes #11506	2022-09-09 18:05:46 +02:00
Kamil Braun	dba595d347	Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch Broadcast tables are tables for which all statements are strongly consistent (linearizable), replicated to every node in the cluster and available as long as a majority of the cluster is available. If a user wants to store a “small” volume of metadata that is not modified “too often” but provides high resiliency against failures and strong consistency of operations, they can use broadcast tables. The main goal of the broadcast tables project is to solve problems which need to be solved when we eventually implement general-purpose strongly consistent tables: designing the data structure for the Raft command, ensuring that the commands are idempotent, handling snapshots correctly, and so on. In this MVP (Minimum Viable Product), statements are limited to simple SELECT and UPDATE operations on the built-in table. In the future, other statements and data types will be available but with this PR we can already work on features like idempotent commands or snapshotting. Snapshotting is not handled yet which means that restarting a node or performing too many operations (which would cause a snapshot to be created) will give incorrect results. In a follow-up, we plan to add end-to-end Jepsen tests (https://jepsen.io/). With this PR we can already simulate operations on lists and test linearizability in linear complexity. This can also test Scylla's implementation of persistent storage, failure detector, RPC, etc. Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing Closes #11164 * github.com:scylladb/scylladb: raft: broadcast_tables: add broadcast_kv_store test raft: broadcast_tables: add returning query result raft: broadcast_tables: add execution of intermediate language raft: broadcast_tables: add compilation of cql to intermediate language raft: broadcast_tables: add definition of intermediate language db: system_keyspace: add broadcast_kv_store table db: config: add BROADCAST_TABLES feature flag	2022-09-09 18:05:37 +02:00
Aleksandra Martyniuk	55cd8fe3bf	test: task manager api test Test of a task manager api.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	ec86410094	task_manager: test api layer implementation The implementation of a test api that helps testing task manager api. It provides methods to simulate the operations that can happen on modules and theirs task. Through the api user can: register and unregister the test module and the tasks belonging to the module, and finish the tasks with success or custom error.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	b1fa6e49af	task_manager: add test specific classes Add test_module and test_task classes inheriting from respectively task_manager::module and task_manager::task::impl that serve task manager testing.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	42f36db55b	task_manager: test api layer The test api that helps testing task manager api. It can be used to simulate the operations that can happen on modules and theirs task. Through the api user can: register and unregister the test module and the tasks belonging to the module, and finish the tasks with success or custom error.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	c9637705a6	task_manager: api layer implementation The implementation of a task manager api layer. It provides methods to list the modules registered in task_manager, list tasks belonging to the given module, abort, wait for or retrieve a status of the given task.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	07043cee68	task_manager: api layer The task manager api layer. It can be used to list the modules registered in task_manager, list tasks belonging to the given module, abort, wait for or retrieve a status of the given task.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	b87a0a74ab	task_manager: keep task_manager reference in http_context Keep a reference to sharded<task_manager> as a member of http_context so it can be reached from rest api.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	9e68c8d445	start sharded task manager Sharded task manager object is started in main.cc.	2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk	2439e55974	task_manager: create task manager object Implementation of a task manager that allows tracking and managing asynchronous tasks. The tasks are represented by task_manager::task class providing members common to all types of tasks. The methods that differ among tasks of different module can be overriden in a class inheriting from task_manager::task::impl class. Each task stores its status containing parameters like id, sequence number, begin and end time, state etc. After the task finishes, it is kept in memory for configurable time or until it is unregistered. Tasks need to be created with make_task method. Each module is represented by task_manager::module type and should have an access to task manager through task_manager::module methods. That allows to easily separate and collectively manage data belonging to each module.	2022-09-09 14:29:28 +02:00
Pavel Emelyanov	24d68e1995	messaging_service: Simplify remove_rpc_client_one() Make it void as after previous patch no code is interesed in this value Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-09 12:50:41 +03:00
Pavel Emelyanov	ca92ed65e5	messaging_service: Notify connection drop when connection is removed There are two methods to close an RPC socket in m.s. -- one that's called on error path of messaging_service::send_... and the other one that's called upon gossiper down/leave/cql-off notifications. The former one notifies listeners about connection drop, the latter one doesn't. The only listener is the storage-proxy which, in turn, kicks database to release per-table cache hitrate data. Said that, when a node goes down (or when an operator shuts down its transport) the hit-rate stats regarding this node are leaked. This patch moves notification so that any socket drop calls notification and thus releases the hitrates. fixes: #11497 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-09 12:47:38 +03:00
Kamil Braun	0efdc45d59	Merge 'test.py: remove top level conftest and improve logging' from Alecco - To isolate the different pytest suites, remove the top level conftest and move needed contents to existing `test/pylib/cql_repl/conftest.py` and `test/topology/conftest.py`. - Add logging to CQL and Python suites. - Log driver version for CQL and topology tests. Closes #11482 * github.com:scylladb/scylladb: test.py: enable log capture for Python suite test.py: log driver name/version for cql/topology test.py: remove top level conftest.py	2022-09-08 16:25:24 +02:00
Anna Stuchlik	54d6d8b8cc	doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst	2022-09-08 15:38:11 +02:00
Anna Stuchlik	6ccc838740	doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file	2022-09-08 15:38:11 +02:00
Anna Stuchlik	22317f8085	doc: update the Enterprise guide to include the Enterprise-onlyimage file	2022-09-08 15:38:11 +02:00
Anna Stuchlik	593f987bb2	doc: update the image files	2022-09-08 15:38:10 +02:00
Anna Stuchlik	42224dd129	doc: split the upgrade-image file to separate files for Open Source and Enterprise	2022-09-08 15:38:10 +02:00
Anna Stuchlik	64a527e1d3	doc: clarify the alternative upgrade procedures for the ScyllaDB image	2022-09-08 15:38:10 +02:00
Anna Stuchlik	5136d7e6d7	doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z	2022-09-08 15:38:10 +02:00
Anna Stuchlik	f1ef6a181e	doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z	2022-09-08 15:38:10 +02:00
Mikołaj Grzebieluch	eb610c45fe	raft: broadcast_tables: add broadcast_kv_store test Test queries scylla with following statements: * SELECT value FROM system.broadcast_kv_store WHERE key = CONST; * UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST; * UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST IF value = CONST; where CONST is string randomly chosen from small set of random strings and half of conditional updates has condition with comparison to last written value.	2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch	803115d061	raft: broadcast_tables: add returning query result Intermediate language added new layer of abstraction between cql statement and quering mutations, thus this commit adds new layer of abstraction between mutations and returning query result. Result can't be directly returned from `group0_state_machine::apply`, so we decided to hold query results in map inside `raft_group0_client`. It can be safely read after `add_entry_unguarded`, because this method waits for applying raft command. After translating result to `result_message` or in case of exception, map entry is erased.	2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch	db88525774	raft: broadcast_tables: add execution of intermediate language Extended `group0_command` to enable transmission of `raft::broadcast_tables::query`. Added `add_entry_unguarded` method in `raft_group0_client` for dispatching raft commands without `group0_guard`. Queries on group0_kv_store are executed in `group_0_state_machine::apply`, but for now don't return results. They don't use previous state id, so they will block concurrent schema changes, but these changes won't block queries. In this version snapshots are ignored.	2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch	82df8a9905	raft: broadcast_tables: add compilation of cql to intermediate language We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement` and `strongly_consistent_select_statement`. Statements operating on system.broadcast_kv_store will be compiled to these new subclasses if BROADCAST_TABLES flag is enabled. If the query is executed on a shard other than 0 it's bounced to that shard.	2022-09-08 15:25:36 +02:00
Wojciech Mitros	7effd4c53a	wasm: directly handle recycling of invalidated instance An instance may be invalidated before we try to recycle it. We perform this by setting its value to a nullopt. This patch adds a check for it when calculating its size. This behavior didn't cause issues before because the catch clause below caught errors caused by calling value() on a nullopt, even though it was intended for errors from get_instance_size. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #11500	2022-09-08 15:39:28 +03:00
Geoffrey Beausire	f435276d2e	Merge tokens for everywhere_topology With EverywhereStrategy, we know that all tokens will be on the same node and the data is typically sparse like LocalStrategy. Result of testing the feature: Cluster: 2 DC, 2 nodes in each DC, 256 tokens per nodes, 14 shards per node Before: 154 scanning operations After: 14 scanning operations (~10x improvement) On bigger cluster, it will probably be even more efficient. Closes #11403	2022-09-08 15:33:23 +03:00
Mikołaj Grzebieluch	c541d5c363	raft: broadcast_tables: add definition of intermediate language In broadcast tables, raft command contains a whole program to be executed. Sending and parsing on each node entire CQL statement is inefficient, thus we decided to compile it to an intermediate language which can be easily serializable. This patch adds a definition of such a language. For now, only the following types of statements can be compiled: * select value where key = CONST from system.broadcast_kv_store; * update system.broadcast_kv_store set value = CONST where key = CONST; * update system.broadcast_kv_store set value = CONST where key = CONST if value = CONST; where CONST is string literal.	2022-09-08 14:03:51 +02:00
Michał Chojnowski	0c54e7c5c7	sstables: index_reader: remove a stray vsprintf call from the hot path sstable::get_filename() constructs the filename from components, which takes some work. It happens to be called on every index_reader::index_reader() call even though it's only used for TRACE logs. That's 1700 instructions (~1% of a full query) wasted on every SSTable read. Fix that. Closes #11485	2022-09-08 14:29:23 +03:00
Michał Chojnowski	c61b901828	utils: logalloc: prefer memory::free_memory() to memory::stats().free_memory The former is a shortcut that does not involve a copy of all stats. This saves some instructions in the hot path. Closes #11495	2022-09-08 14:12:20 +03:00
Botond Dénes	438aaf0b85	Merge 'Deglobalize repair history maps' from Benny Halevy Change `a8ad385ecd` introduced ``` thread_local std::unordered_map<utils::UUID, seastar::lw_shared_ptr<repair_history_map>> repair_history_maps; ``` We're trying to avoid global scoped variables as much as we can so this should probably be embedded in some sharded service. This series moves the thread-local `repair_history_maps` instances to `compaction_manager` and passes a reference to the shard compaction_manager to functions that need it for compact_for_query and compact_for_compaction. Since some paths don't need it and don't have access to the compactio_manager, the series introduced `utils::optional_reference<T>` that allows to pass nullopt. In this case, `get_gc_before_for_key` behaves in `tombstone_gc_mode::repair` as if the table wasn't repaired and tombstones are not garbage-collected. Fixes #11208 Closes #11366 * github.com:scylladb/scylladb: tombstone_gc: deglobalize repair_history_maps mutation_compactor: pass tombstone_gc_state to compact_mutation_state mutation_partition: compact_for_compaction_v2: get tombstone_gc_state mutation_partition: compact_for_compaction: get tombstone_gc_state mutation_readers: pass tombstone_gc_state to compating_reader sstables: get_gc_before_*: get tombstone_gc_state from caller compaction: table_state: add virtual get_tombstone_gc_state method db: view: get_tombstone_gc_state from compaction_manager db: view: pass base table to view_update_builder repair: row_level: repair_update_system_table_handler: get get_tombstone_gc_state for db compaction_manager replica: database: get_tombstone_gc_state from compaction_manager compaction_manager: add tombstone_gc_state replica: table: add get_compaction_manager function tombstone_gc: introduce tombstone_gc_state repair_service: simplify update_repair_time error handling tombstone_gc: update_repair_time: get table_id rather than schema_ptr tombstone_gc: delete unused forward declaration database: do not drop_repair_history_map_for_table in detach_column_family	2022-09-08 14:08:38 +03:00
Botond Dénes	9d1cc5e616	Merge 'doc: update the OS support for versions 2022.1 and 2022.2' from Anna Stuchlik The scope of this PR: - Removing support for Ubuntu 16.04 and Debian 9. - Adding support for Debian 11. Closes #11461 * github.com:scylladb/scylladb: doc: remove support for Debian 9 from versions 2022.1 and 2022.2 doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2 doc: add support for Debian 11 to versions 2022.1 and 2022.2	2022-09-08 13:27:47 +03:00
Anna Stuchlik	0dee507c48	doc: fix the upgrade version in the upgrade guide for RHEL and CentOS Closes #11477	2022-09-08 13:26:49 +03:00
Alejo Sanchez	4190c61dbf	test.py: enable log capture for Python suite Enable pytest log capture for Python suite. This will help debugging issues in remote machines. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-08 11:37:32 +02:00
Alejo Sanchez	c6a048827a	test.py: log driver name/version for cql/topology Log the python driver name and version to help debugging on third party machines. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-08 11:37:32 +02:00
Alejo Sanchez	a2dd64f68f	test.py: remove top level conftest.py Remove top level conftest so different suites have their own (as it was before). Move minimal functionality into existing test/pylib/cql_repl/conftest.py so cql tests can run on their own. Move param setting into test/topology/conftest.py. Use uuid module for unique keyspace name for cql tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-09-08 11:37:32 +02:00
Alejo Sanchez	d892d194fb	test.py: remove spurious after test check Before/after test checks are done per test case, there's no longer need to check after pytest finishes. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11489	2022-09-08 11:33:37 +02:00
Felipe Mendes	7ccf8ed221	cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr As check_restricted_table_properties() is invoked both within CREATE TABLE and ALTER TABLE CQL statements, we currently have no way to determine whether the operation was either a CREATE or ALTER. In many situations, it is important to be able to distinguish among both operations, such as - for example - whether a table already has a particular property set or if we are defining it within the statement. This patch simply adds a std::optional<schema_ptr> to check_restricted_table_properties() and updates its caller. Whenever a CREATE TABLE statement is issued, the method is called as a std::nullopt, whereas if an ALTER TABLE is issued instead, we call it with a schema_ptr.	2022-09-07 21:27:32 -03:00
Kamil Braun	ff4430d8ea	test: topology: make imports friendlier for tools (such as `mypy`) When importing from `pylib`, don't modify `sys.path` but use the fact that both `test/` and `test/pylib/` directories contain an `__init__.py` file, so `test.pylib` is a valid module if we start with `test/` as the Python package root. Both `pytest` and `mypy` (and I guess other tools) understand this setup. Also add an `__init__.py` to `test/topology/` so other modules under the `test/` directory will be able to import stuff from `test/topology/` (i.e. from `test.topology.X import Y`). Closes #11467	2022-09-07 23:52:50 +03:00
Karol Baryła	1c2eef384d	transport/server.cc: Return correct size of decompressed lz4 buffer An incorrect size is returned from the function, which could lead to crashes or undefined behavior. Fix by erroring out in these cases. Fixes #11476	2022-09-07 10:58:23 +03:00
Nadav Har'El	e5f6adf46c	test/alternator: improve tests for DescribeTable for indexes I created new issues for each missing field in DescribeTable's response for GSIs and LSIs, so in this patch we edit the xfail messages in the test to refer to these issues. Additionally, we only had a test for these fields for GSIs, so this patch also adds a similar test for LSIs. I turns out there is a difference between the two tests - the two fields IndexStatus and ProvisionedThroughput are returned for GSIs, but not for LSIs. Refs #7750 Refs #11466 Refs #11470 Refs #11471 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11473	2022-09-07 09:50:16 +02:00
Benny Halevy	e9cfe9e572	tombstone_gc: deglobalize repair_history_maps Move the thread-local instances of the per-table repair history maps into compaction_manager. Fixes #11208 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	8b38893895	mutation_compactor: pass tombstone_gc_state to compact_mutation_state Used in get_gc_before. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	d86810d22c	mutation_partition: compact_for_compaction_v2: get tombstone_gc_state To be passed down to compact_mutation_state in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	0627667a06	mutation_partition: compact_for_compaction: get tombstone_gc_state And pass down to `do_compact`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	7e4612d3aa	mutation_readers: pass tombstone_gc_state to compating_reader To be passed further done to `compact_mutation_state` in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:14 +03:00
Benny Halevy	572d534d0d	sstables: get_gc_before_: get tombstone_gc_state from caller Pass the tombstone_gc_state from the compaction_strategy to sstables get_gc_before_ functions using the table state to get to the tombstone_gc_state. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:05:39 +03:00
Benny Halevy	2cd3fc2f36	compaction: table_state: add virtual get_tombstone_gc_state method and override it in table::table_state to get the tombstone_gc_state from the table's compaction_manager. It is going to be used in the next patched to pass the gc state from the compaction_strategy down to sstables and compaction. table_state_for_test was modified to just keep a null tombstone_gc_state. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:05:39 +03:00
Benny Halevy	6fb4b5555d	db: view: get_tombstone_gc_state from compaction_manager Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:05:39 +03:00
Benny Halevy	71ede6124a	db: view: pass base table to view_update_builder To be used by generate_update() for getting the tombstone_gc_state via the table's compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:04:23 +03:00
Benny Halevy	6a11c410fd	repair: row_level: repair_update_system_table_handler: get get_tombstone_gc_state for db compaction_manager Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:04:16 +03:00
Benny Halevy	3b0147390b	replica: database: get_tombstone_gc_state from compaction_manager Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Benny Halevy	8b841e1207	compaction_manager: add tombstone_gc_state Add a tombstone_gc_state member and methods to get it. Currently the tombstone_gc_state is default constructed, but a following patch will move the thread-local repair history maps into the compaction_manager as a member and then the _tombstone_gc_state member will be initialized from that member. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Benny Halevy	1ce50439af	replica: table: add get_compaction_manager function so to let a view get the tombstone_gc_state via the compaction_manager of the base table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Benny Halevy	5dd15aa3c8	tombstone_gc: introduce tombstone_gc_state and use it to access the repair history maps. At this introductory patch, we use default-constructed tombstone_gc_state to access the thread-local maps temporarily and those use sites will be replaced in following patches that will gradually pass the tombstone_gc_state down from the compaction_manager to where it's used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Benny Halevy	b2b211568e	repair_service: simplify update_repair_time error handling There's no need for per-shard try/catch here. Just catch exceptions from the overall sharded operation to update_repair_time. Also, update warning to indicate that only updating the repair history time failed, not "Loading repair history". Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 22:43:08 +03:00
Benny Halevy	7d13811297	tombstone_gc: update_repair_time: get table_id rather than schema_ptr The function doesn't need access to the whole schema. The table_id is just enough to get by. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 22:43:08 +03:00
Benny Halevy	442d43181c	tombstone_gc: delete unused forward declaration Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 22:43:08 +03:00
Benny Halevy	3d88fe9729	database: do not drop_repair_history_map_for_table in detach_column_family drop_repair_history_map_for_table is called on each shard when database::truncate is done, and the table is stopped. dropping it before the table is stopped is too early. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 22:43:08 +03:00
Nadav Har'El	ee606a5d52	Merge 'doc: fix the CQL version in the Interfaces table' from Anna Stuchlik Fix https://github.com/scylladb/scylla-doc-issues/issues/816 Fix https://github.com/scylladb/scylla-docs/issues/1613 This PR fixes the CQL version in the Interfaces page, so that it is the same as in other places across the docs and in sync with the version reported by the ScyllaDB (see https://github.com/scylladb/scylla-doc-issues/issues/816#issuecomment-1173878487). To make sure the same CQL version is used across the docs, we should use the `\|cql-version\| `variable rather than hardcode the version number on several pages. The variable is specified in the conf.py file: ``` rst_prolog = """ .. \|cql-version\| replace:: 3.3.1 """ ``` Closes #11320 * github.com:scylladb/scylladb: doc: add the Cassandra version on which the tools are based doc: fix the version number doc: update the Enterprise version where the ME format was introduced doc: add the ME format to the Cassandar Compatibility page doc: replace Scylla with ScyllaDB doc: rewrite the Interfaces table to the new format to include more information about CQL support doc: remove the CQL version from pages other than Cassandra compatibility doc: fix the CQL version in the Interfaces table	2022-09-06 19:02:44 +03:00
Asias He	792a91b1fa	storage_service: Drop ignore dead nodes option for bootstrap and decommission in log The ignore dead node options are not really supported at the moment. Drop it in the log to reduce confusion. Closes #11426	2022-09-06 18:21:21 +03:00
Anna Stuchlik	4c7aa5181e	doc: remove support for Debian 9 from versions 2022.1 and 2022.2	2022-09-06 14:04:22 +02:00
Anna Stuchlik	dfc7203139	doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2	2022-09-06 14:01:35 +02:00
Anna Stuchlik	dd4979ffa8	doc: add support for Debian 11 to versions 2022.1 and 2022.2	2022-09-06 13:54:08 +02:00
Pavel Emelyanov	398e9f8593	network_topology_strategy_test: Use topology instead of snitch Most of the test's cases use rack-inferring snitch driver and get DC/RACK from it via the test_dc_rack() helper. The helper was introduced in one of the previous sets to populate token metadata with some DC/RACK as normal tokens manipulations required respective endpoint in topology. This patch removes the usage of global snitch and replaces it with the pre-populated topology. The pre-population is done in rack-inferring snitch like manner, since token_metadata still uses global snitch and the locations from snitch and this temporary topology should match. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-06 12:26:30 +03:00
Pavel Emelyanov	d8b2940cd8	network_topology_strategy_test: Populate explicit topology There's a test case that makes its own snitch driver that generates pre-claculated DC/RACK data for test endpoints. This patch replaces this custom snitch driver with a standalone topology object. Note: to get DC/RACK info from this topo the get_location() is used since the get_rack()/get_datacenter() are still wrappers around global snitch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-06 12:24:39 +03:00
Botond Dénes	b89b84ad3c	compaction: scrub/abort: be more verbose Currently abort-mode scrub exits with a message which basically says "some problem was found", with no details on what problem it found. Add a detailed error report on the found problem before aborting the scrub. Closes #11418	2022-09-06 11:42:34 +03:00
Avi Kivity	3dc39474ec	Merge 'tools/scylla-types: add tokenof and shardof actions' from Botond Dénes `tokenof` calculates and prints the token of a partition-key. `shardof` calculates the token and finds the owner shard of a partition-key. The number of shards has to be provided by the `--sharads` parameter. Ignore msb bits param can be tweaked with the `--ignore-msb-bits` parameter, which defaults to 12. Examples: ``` $ scylla types tokenof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196 (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): -5043005771368701888 $ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196 (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1 ``` Closes #11436 * github.com:scylladb/scylladb: tools/scylla-types: add shardof action tools/scylla-types: pass variable_map to action handlers tools/scylla-types: add tokenof action tools/scylla-types: extract printing code into functions	2022-09-06 11:25:54 +03:00
Pavel Emelyanov	42c9f35374	topology: Mark compare_endpoints() arguments as const Continuation to `debfcc0e` (snitch: Move sort_by_proximity() to topology). The passed addresses are not modified by the helper. They are not yet const because the method was copy-n-pasted from snitch where it wasn't such. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220906074708.29574-1-xemul@scylladb.com>	2022-09-06 11:03:13 +03:00
Yaron Kaikov	4459cecfd6	Docs: fix wrong manifest file for enterprise releases In https://docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image.html, manifest file location is pointing the wrong filename for enterprise Fixing Closes #11446	2022-09-06 06:28:16 +03:00
Avi Kivity	ae4b2ee583	locator: token_metadata: drop unused and dangerous accessors The mutable get_datacenter_endpoints() and get_datacenter_racks() are dangerous since they expose internal members without enforcing class invariants. Fortunately they are unused, so delete them. Closes #11454	2022-09-06 06:08:02 +03:00
Avi Kivity	3f8cb608c3	Merge "Move auxiliary topology sorters from snitch" from Pavel E " There are two helpers on snitch that manipulate lists of nodes taking their dc/rack into account. This set moves these methods from snitch to topology and storage proxy. " * 'br-snitch-move-proximity-sorters' of https://github.com/xemul/scylla: snitch: Move sort_by_proximity() to topology topology: Add "enable proximity sorting" bit code: Call sort_endpoints_by_proximity() via topology snitch, code: Remove get_sorted_list_by_proximity() snitch: Move is_worth_merging_for_range_query to proxy	2022-09-05 17:25:08 +03:00
Anna Stuchlik	b0ebf0902c	doc: add the Cassandra version on which the tools are based	2022-09-05 14:45:15 +02:00
Pavel Emelyanov	debfcc0eff	snitch: Move sort_by_proximity() to topology Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-05 15:17:04 +03:00
Pavel Emelyanov	41973c5bf7	topology: Add "enable proximity sorting" bit There's one corner case in nodes sorting by snitch. The simple snitch code overloads the call and doesn't sort anything. The same behavior should be preserved by (future) topology implementation, but it doesn't know the snitch name. To address that the patch adds a boolean switch on topology that's turned off by main code when it sees the snitch is "simple" one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-05 15:15:07 +03:00
Pavel Emelyanov	b6fdea9a79	code: Call sort_endpoints_by_proximity() via topology The method is about to be moved from snitch to topology, this patch prepares the rest of the code to use the latter to call it. The topology's method just calls snitch, but it's going to change in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-05 15:14:01 +03:00
Pavel Emelyanov	4184091f1c	snitch, code: Remove get_sorted_list_by_proximity() There are two sorting methods in snitch -- one sorts the list of addresses in place, the other one creates a sorted copy of the passed const list (in fact -- the passed reference is not const, but it's not modified by the method). However, both callers of the latter anyway create their own temporary list of address, so they don't really benefit from snitch generating another copy. So this patch leaves just one sorting method -- the in-place one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-05 15:11:37 +03:00
Pavel Emelyanov	642e50f3e3	snitch: Move is_worth_merging_for_range_query to proxy Proxy is the only place that calls this method. Also the method name suggests it's not something "generic", but rather an internal logic of proxy's query processing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-05 15:10:46 +03:00
Anna Stuchlik	7294fce065	doc: simplify the information about default formats in different versions	2022-09-05 11:36:24 +02:00
Avi Kivity	3ef8d616f6	Merge 'Fix wrong commit on scylla_raid_setup: prevent mount failed for /var/lib/scylla(#11399 )' from Takuya ASADA On #11399, I mistakenly committed bug fix of first patch (`40134ef`) to second one (`8835a34`). So the script will broken when `40134ef` only, it's not looks good when we backport it to older version. Let's revert commits and make them single commit. Closes #11448 * github.com:scylladb/scylladb: scylla_raid_setup: prevent mount failed for /var/lib/scylla Revert "scylla_raid_setup: check uuid and device path are valid" Revert "scylla_raid_setup: prevent mount failed for /var/lib/scylla"	2022-09-05 12:16:10 +03:00
Mikołaj Grzebieluch	726658f073	db: system_keyspace: add broadcast_kv_store table First implementation of strongly consistent everywhere tables operates on simple table representing string to string map. Add hard-coded schema for broadcast_kv_store table (key text primary key, value text). This table is under system keyspace and is created if and only if BROADCAST_TABLES feature is enabled.	2022-09-05 11:11:08 +02:00
Mikołaj Grzebieluch	5b1421cc33	db: config: add BROADCAST_TABLES feature flag Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature. This feature requires raft group0, thus enabling it without RAFT will cause an error.	2022-09-05 11:11:08 +02:00
Avi Kivity	e3cdc8c4d3	Update tools/java submodule (python3 dependency) * tools/java 6995a83cc1...b7a0c5bd31 (1): > dist/debian:add python3 as dependency	2022-09-05 12:08:24 +03:00
Takuya ASADA	d676c22f09	scylla_raid_setup: prevent mount failed for /var/lib/scylla Just like `4a8ed4c`, we also need to wait for udev event completion to create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the disk just after formatting. Also added code to check make sure uuid and uuid based device path are valid. Fixes #11359 Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2022-09-05 17:52:49 +09:00
Takuya ASADA	ede7da366b	Revert "scylla_raid_setup: check uuid and device path are valid" This reverts commit `40134efee4`.	2022-09-05 17:52:42 +09:00
Takuya ASADA	841c686301	Revert "scylla_raid_setup: prevent mount failed for /var/lib/scylla" This reverts commit `8835a34ab6`.	2022-09-05 17:52:41 +09:00
Piotr Sarna	2379a25ade	alternator: propagate authenticated user in client state From now on, when an alternator user correctly passed an authentication step, their assigned client_state will have that information, which also means proper access to service level configuration. Previously the username was only used in tracing.	2022-09-05 10:43:29 +02:00
Anna Stuchlik	39e6002fc8	doc: fix the version number	2022-09-05 10:04:34 +02:00
Piotr Sarna	66f7ab666f	client_state: add internal constructor with auth_service The constructor can be used as backdoor from frontends other than CQL to create a session with an authenticated user, with access to its attached service level information.	2022-09-05 10:03:00 +02:00
Piotr Sarna	9511c21686	alternator: pass auth_service and sl_controller to server It's going to be needed to recreate a client state for an authenticated user.	2022-09-05 10:03:00 +02:00
Anna Stuchlik	81f32899d0	doc: update the Enterprise version where the ME format was introduced	2022-09-05 10:02:57 +02:00
Botond Dénes	f8b38cbe09	Merge 'doc: add support for Ubuntu 22.04 in ScyllaDB Enterprise' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/11430 @tzach I've added support for Ubuntu 22.04 to the row for version 2022.2. Does that version support Debian 11? That information is also missing (it was only added to OSS 5.0 and 5.1). Closes #11437 * github.com:scylladb/scylladb: doc: add support for Ubuntu 22.04 to the Enterprise table doc: rename the columns in the Enterpise section to be in sync with the OSS section	2022-09-05 06:42:55 +03:00
Anna Stuchlik	41b91e3632	doc: fix the architecture type on the upgrade page Closes #11438	2022-09-05 06:30:51 +03:00
Botond Dénes	21ef0c64f1	tools/scylla-types: add shardof action Decorates a partition key and calculates which shard it belongs to, given the shard count (--shards) and the ignore msb bits (--ignore-msb-bits) parameters. The latter is optional and is defaulted to 12. Example: $ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196 (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1	2022-09-05 06:22:57 +03:00
Botond Dénes	4333d33f01	tools/scylla-types: pass variable_map to action handlers Allowing them to have get the value of extra command line parameters.	2022-09-05 06:22:55 +03:00
Botond Dénes	58d4f22679	tools/scylla-types: add tokenof action Calculate and print the token of a partition-key. Example: $ scylla types tokenof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196 (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): -5043005771368701888	2022-09-05 06:20:10 +03:00
Botond Dénes	be9d1c4df4	sstables: crawling mx-reader: make on_out_of_clustering_range() no-op Said method currently emits a partition-end. This method is only called when the last fragment in the stream is a range tombstone change with a position after all clustered rows. The problem is that consume_partition_end() is also called unconditionally, resulting in two partition-end fragments being emitted. The fix is simple: make this method a no-op, there is nothing to do there. Also add two tests: one targeted to this bug and another one testing the crawling reader with random mutations generated for random schema. Fixes: #11421 Closes #11422	2022-09-04 20:02:50 +03:00
Botond Dénes	3e69fe0fe7	scylla-gdb.py: scylla repairs: print only address of repair_meta Instead of the entire object. Repair meta is a large object, its printout floods the output of the command. Print only its address, the user can print the objects it is interested in. Closes #11428	2022-09-04 19:58:42 +03:00
Yaron Kaikov	9f9ee8a812	build_docker.sh: Build docker based on Ubuntu:22.04 Ubuntu 20.04 has less than 3 years of OS support remaining. We should switch to Ubuntu 22.04 to reduce the need for OS upgrades in newly installed clusters. Closes #11440	2022-09-04 14:00:27 +03:00
Avi Kivity	61769d3b21	Merge "Make messaging service use topology for DC/RACK" from Pavel E " Messaging needs to know DC/RACK for nodes to decide whether it needs to do encryption or compression depending on the options. As all the other services did it still uses snitch to get it, but simple switch to use topology needs extra care. The thing is that messaging can use internal IP instead of endpoints. Currently it's snitch who tries har^w somehow to resolve this, in particular -- if the DC/RACK is not found for the given argument it assumes that it might be internal IP and calls back messaging to convert it to the endpoint. However, messaging does know when it uses which address and can do this conversion itself. So this set eliminates few more global snitch usages and drops the knot tieing snitch, gossiper and messaging with each-other. " * 'br-messaging-use-topology-1.2' of https://github.com/xemul/scylla: messaging: Get DC/RACK from topology messaging, topology: Keep shared_token_metadata* on messaging messaging: Add is_same_{dc\|rack} helpers snitch, messaging: Dont relookup dc/rack on internal IP	2022-09-04 13:54:34 +03:00
Pavel Emelyanov	6dedc69608	topology: Do not add bootstrapping nodes to topology Recent change in topology (commit `4cbe6ee9` titled "topology: Require entry in the map for update_normal_tokens()") made token_metadata::update_normal_tokens() require the entry presense in the embedded topology object. Respectively, the commit in question equipped most callers of update_normal_tokens() with preceeding topology update call to satisfy the requirement. However, tokens are put into token_metadata not only for normal state, but also for bootstrapping, and one place that added bootstrapping tokens errorneously got topology update. This is wrong -- node must not be present in the topology until switching into normal state. As the result several tests with bootstrapping nodes started to fail. The fix removes topology update for bootstrapping nodes, but this change reveals few other places that piggy-backed this mistaken update, so noy _they_ need to update topology themselves. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2040/ update_cluster_layout_tests.py::test_simple_add_new_node_while_schema_changes_with_repair update_cluster_layout_tests.py::test_simple_kill_new_node_while_bootstrapping_with_parallel_writes_in_multidc repair_based_node_operations_test.py::test_lcs_reshape_efficiency Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220902082753.17827-1-xemul@scylladb.com>	2022-09-04 13:53:38 +03:00
Avi Kivity	16a3e55aa1	Update seastar submodule * seastar f2d70c4a17...2b2f6c080e (4): > perftune.py: special case a former 'MQ' mode in the new auto-detection code > iostream: Generalize flush and batched flush > Merge "Equip sharded<>::invoke_on_all with unwrap_sharded_args" from Pavel E > Merge "perftune.py: cosmetic fixes" from VladZ Closes #11434	2022-09-04 10:19:48 +03:00
Anna Stuchlik	5d09e1a912	doc: add the ME format to the Cassandar Compatibility page	2022-09-02 15:12:30 +02:00
Anna Stuchlik	dfb7a221db	doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format	2022-09-02 14:55:11 +02:00
Anna Stuchlik	f1184e1470	doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page	2022-09-02 14:48:58 +02:00
Anna Stuchlik	177a1d4396	doc: fix additional information regarding the ME format on the SStable 3.x page	2022-09-02 14:41:55 +02:00
Anna Stuchlik	b3eacdca75	doc: add the ME format to the table	2022-09-02 14:26:16 +02:00
Anna Stuchlik	af4d1b80d8	doc: add support for Ubuntu 22.04 to the Enterprise table	2022-09-02 12:43:04 +02:00
Anna Stuchlik	947f8769f4	doc: rename the columns in the Enterpise section to be in sync with the OSS section	2022-09-02 12:31:57 +02:00
Pavel Emelyanov	f0580aedaf	messaging: Get DC/RACK from topology Now everything is prepared for that Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-02 11:34:57 +03:00
Botond Dénes	be70fcf587	tools/scylla-types: extract printing code into functions To make the individual overloads on the exact type usable on their own.	2022-09-02 07:46:18 +03:00
Botond Dénes	2c46c24608	Merge 'doc: change the tool names to "Scylla SStable" and "Scylla Types"' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/11393 - Rename the tool names across the docs. - Update the examples to replace `scylla-sstable` and `scylla-types` with `scylla sstable` and `scylla types`, respectively. Closes #11432 * github.com:scylladb/scylladb: doc: update the tool names in the toctree and reference pages doc: rename the scylla-types tool as Scylla Types doc: rename the scylla-sstable tool as Scylla SStable	2022-09-01 16:32:18 +03:00
Anna Stuchlik	18da200669	doc: update the tool names in the toctree and reference pages	2022-09-01 15:09:12 +02:00
Anna Stuchlik	c255399f27	doc: rename the scylla-types tool as Scylla Types	2022-09-01 15:05:44 +02:00
Anna Stuchlik	d0cb24feaa	doc: rename the scylla-sstable tool as Scylla SStable	2022-09-01 14:45:19 +02:00
Anna Stuchlik	1834d5d121	add a comment to remove the information when the documentation is versioned (in 5.1)	2022-09-01 12:57:15 +02:00
Anna Stuchlik	476107912c	doc: replace Scylla with ScyllaDB	2022-09-01 12:52:58 +02:00
Anna Stuchlik	8aae8a3cef	doc: fix the formatting and language in the updated section	2022-09-01 12:50:04 +02:00
Anna Stuchlik	ff4ae879cb	doc: fix the default SStable format	2022-09-01 12:47:11 +02:00
Pavel Emelyanov	e147681d85	messaging, topology: Keep shared_token_metadata* on messaging Messaging will need to call topology methods to compare DC/RACK of peers with local node. Topology now resides on token metadata, so messaging needs to get the dependency reference. However, messaging only needs the topology when it's up and running, so instead of producing a life-time reference, add a pointer, that's set up on .start_listen(), before any client pops up, and is cleared on .shutdown() after all connections are dropped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-01 11:32:34 +03:00
Pavel Emelyanov	551c51b5bf	messaging: Add is_same_{dc\|rack} helpers For convenience of future patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-01 11:32:34 +03:00
Pavel Emelyanov	c08c370c2c	snitch, messaging: Dont relookup dc/rack on internal IP When getting dc/rack snitch may perform two lookups -- first time it does it using the provided IP, if nothing is found snitch assumes that the IP is internal one, gets the corresponding public one and searches again. The thing is that the only code that may come to snitch with internal IP is the messaging service. It does so in two places: when it tries to connect to the given endpoing and when it accepts a connection. In the former case messaging performs public->internal IP conversion itself and goes to snitch with the internal IP value. This place can get simpler by just feeding the public IP to snich, and converting it to the internal only to initiate the connection. In the latter case the accepted IP can be either, but messaging service has the public<->private map onboard and can do the conversion itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-01 11:32:34 +03:00
Avi Kivity	fe401f14de	Merge 'scylla_raid_setup: prevent mount failed for /var/lib/scylla' from Takuya ASADA Just like `4a8ed4cc6f`, we also need to wait for udev event completion to create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the disk just after formatting. Also added code to check make sure uuid and uuid based device path are valid. Fixes #11359 Closes #11399 * github.com:scylladb/scylladb: scylla_raid_setup: prevent mount failed for /var/lib/scylla scylla_raid_setup: check uuid and device path are valid	2022-08-31 19:59:38 +03:00
Kefu Chai	a5e696fab8	storage_service, test: drop unused storage_service_config this setting was removed back in `dcdd207349`, so despite that we are still passing `storage_service_config` to the ctor of `storage_service`, `storage_service::storage_service()` just drops it on the floor. in this change, `storage_service_config` class is removed, and all places referencing it are updated accordingly. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes #11415	2022-08-31 19:49:13 +03:00
Botond Dénes	2a3012db7f	docs/README.md: expand prerequisites list poetry and make was missing from the list. Closes #11391	2022-08-31 17:00:59 +03:00
Botond Dénes	cb98d4f5da	docs: admin-tools: remove defunct sstable-index Said tool was supplanted by scylla-sstable in 4.6. Remove the page as well as all references to it. Closes #11392	2022-08-31 17:00:04 +03:00
Avi Kivity	a9a230afbe	scripts: introduce script to apply email, working around google groups brokeness Google Groups recently started rewriting the From: header, garbaging our git log. This script rewrites it back, using the Reply-To header as a still working source. Closes #11416	2022-08-31 14:47:24 +03:00
Botond Dénes	b9fc504fb2	Merge 'doc: cql-extensions.md: improve description of synchronous views' from Nadav Har'El It was pointed out to me that our description of the synchronous_updates materialized-view option does not make it clear enough what is the default setting, or why a user might want to use this option. This patch changes the description to (I hope) better address these issues. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11404 * github.com:scylladb/scylladb: doc: cql-extensions.md: replace "Scylla" by "ScyllaDB" doc: cql-extensions.md: improve description of synchronous views	2022-08-31 14:33:39 +03:00
Avi Kivity	2ab5cbd841	Merge 'Docs: document how scylla-sstable obtains its schema' from Botond Dénes This is a very important aspect of the tool that was completely missing from the document before. Also add a comparison with SStableDump. Fixes: https://github.com/scylladb/scylladb/issues/11363 Closes #11390 * github.com:scylladb/scylladb: docs: scylla-sstable.rst: add comparison with SStableDump docs: scylla-sstable.rst: add section about providing the schema	2022-08-31 14:28:52 +03:00
Anna Stuchlik	72b77b8c78	doc: add a comment to remove the note in version 5.1	2022-08-31 12:49:10 +02:00
Anna Stuchlik	b4bbd1fd53	doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB	2022-08-31 12:39:05 +02:00
Nadav Har'El	ad0f6158c4	doc: cql-extensions.md: replace "Scylla" by "ScyllaDB" It was recently decided that the database should be referred to as "ScyllaDB", not "Scylla". This patch changes existing references in docs/cql/cql-extensions.md. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-31 13:23:24 +03:00
Nadav Har'El	ec8e98e403	doc: cql-extensions.md: improve description of synchronous views It was pointed out to me that our description of the synchronous_updates materialized-view option does not make it clear enough what is the default setting, or why a user might want to use this option. This patch changes the description to (I hope) better address these issues. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-31 13:22:24 +03:00
Anna Stuchlik	180fc73695	doc: add a note to the description of COUNT with a reference to the KB article	2022-08-31 12:11:12 +02:00
Anna Stuchlik	cff849d845	doc: add COUNT to the list of acceptable selectors of the SELECT statement	2022-08-31 11:59:20 +02:00
Avi Kivity	421557b40a	Merge "Provide DC/RACK when populating topology" from Pavel E " The topology object maintains all sort of node/DC/RACK mappings on board. When new entries are added to it the DC and RACK are taken from the global snitch instance which, in turn, checks gossiper, system keyspace and its local caches. This set make topology population API require DC and RACK via the call argument. In most of the cases the populating code is the storage service that knows exactly where to get those from. After this set it will be possible to remove the dependency knot consiting of snitch, gossiper, system keyspace and messaging. " * 'br-topology-dc-rack-info' of https://github.com/xemul/scylla: toplogy: Use the provided dc/rack info test: Provide testing dc/rack infos storage_service: Provide dc/rack for snitch reconfiguration storage_service: Provide dc/rack from system ks on start storage_service: Provide dc/rack from gossiper for replacement storage_service: Provide dc/rack from gossiper for remotes storage_service,dht,repair: Provide local dc/rack from system ks system_keyspace: Cache local dc-rack on .start() topology: Some renames after previous patch topology: Require entry in the map for update_normal_tokens() topology: Make update_endpoint() accept dc-rack info replication_strategy: Accept dc-rack as get_pending_address_ranges argument dht: Carry dc-rack over boot_strapper and range_streamer storage_service: Make replacement info a real struct	2022-08-31 12:53:06 +03:00
Benny Halevy	c284c32f74	Update seastar submodule * seastar f9f5228b74...f2d70c4a17 (51): > cmake: attach property to Valgrind not to hwloc > Create the seastar_memory logger in all builds > drop unused parameters > Merge "Unify pollable_fd shutdown and abort_{reader\|writer}" from Pavel E > > pollable_fd: Replace two booleans with a mask > > pollable_fd: Remove abort_reader/_writer > Merge "Improve Rx channels assignment" from Vlad > > perftune.py: fix comments of IRQ ordering functors > > perftune.py: add VIRTIO fast path IRQs ordering functor > > perftune.py: reduce number of Rx channels to the number of IRQ CPUs > > perftune.py: introduce a --num-rx-queues parameter > program_options: enable optional selection_value > .gitignore: ignore the directories generated by VS Code and CLion. > httpd: compare the Connection header value in a case-insensitive manner. > httpd: move the logic of keepalive to a separate method. > register one default priority class for queue > Reset _total_stats before each run > log: add colored logging support > Merge "perftune.py: add NUMA aware auto-detection for big machines" from Vlad > > perftune.py: mention 'irq_cpu_mask' in the description of the script operation > > perftune.py: NetPerfTuner: fix bits counting in self.irqs_cpu_mask wider than 32 bits > > perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): cosmetics: fix a comment and a variable name > > perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): take into account omitted zero components of the mask > > perftune.py: PerfTuneBase.compute_cpu_mask_for_mode(): cosmetics: fix a variable name > > perftune.py: stop printing 'mode' in --dump-options-file > > perftune.py: introduce a generic auto_detect_irq_mask(cpu_mask) function > > perftune.py: DiskPerfTuner: use self.irqs_cpu_mask for tuning non-NVME disks > > perftune.py: stop auto-detecting and using 'mode' internally > > perftune.py: introduce --get-irq-cpu-mask command line parameter > > perftune.py: introduce --irq-core-auto-detection-ratio parameter > build: add a space after function name > Update HACKING.md > log: do not inherit formatter<seastar::log_level> from formatter<string_view> > Merge "Mark connected_socket::shutdown_...'s internals noexcept" from Pavel E > > native-stack: Mark tcp::in_state() (and its wrappers) const noexcept > > native-stack: Mark tcb::close and tcb::abort_reader noexcept > > native-stack: Mark tcp::connection::close_{read\|write}() noexcept > > native-stack: Mark tcb::clear_delayed_ack() and tcb::stop_retransmit_timer() noexcept > > tls: Mark session::close() noexcept > > file_desc: Add fdinfo() helper > > posix-stack: Mark posix_connected_socket_impl::shutdown_{input\|output}() noexcept > > tests: Mark loopback_buffer::shutdown() noexcept > Merge "Enhance RPC connection error injector" from Pavel E > > loopback_socket: Shuffle error injection > > loopback_socket: Extend error injection > > loopback_socket: Add one-shot errors > > loopback_socket: Add connection error injection > > rpc_test: Extend error injector with kind > > rpc_test: Inject errors on all paths > > rpc_test: Use injected connect error > > rpc_test: De-duplicate test socket creation > Merge 'tls: vec_push: handle async errors rather than throwing on_internal_error' from Benny Halevy > > tls: do_handshake: handle_output_error of gnutls_handshake > > tls: session: vec_push: return output_pending error > > tls: session: vec_push: reindent > log: disambiguate formatter<log_level> from operator<< > tls_test: Fix spurious fail in test_x509_client_with_builder_system_trust_multiple (et al) Fixes scylladb/scylladb#11252 Closes #11401	2022-08-31 12:12:48 +03:00
Botond Dénes	dca351c2a6	Merge 'doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1' from Anna Stuchlik This PR is related to https://github.com/scylladb/scylla-docs/issues/4124 and https://github.com/scylladb/scylla-docs/issues/4123. New Enterprise Upgrade Guide from 2021.1 to 2022.2 I've added the upgrade guide for ScyllaDB Enterprise image. In consists of 3 files: /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst upgrade/_common/upgrade-image.rst /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst Modified Enterprise Upgrade Guides 2021.1 to 2022.2 I've modified the existing guides for Ubuntu and Debian to use the same files as above, but exclude the image-related information: /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst + /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst = /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian.rst To make things simpler and remove duplication, I've replaced the guides for Ubuntu 18 and 20 with a generic Ubuntu guide. Modified Enterprise Upgrade Guides from 4.6 to 5.0 These guides included a bug: they included the image-related information (about updating OS packages), because a file that includes that information was included by mistake. What's worse, it was duplicated. After the includes were removed, image-related information is no longer included in the Ubuntu and Debian guides (this fixes https://github.com/scylladb/scylla-docs/issues/4123). I've modified the index file to be in sync with the updates. Closes #11285 * github.com:scylladb/scylladb: doc: reorganize the content to list the recommended way of upgrading the image first doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information doc: update the guides for Ubuntu and Debian to remove image information and the OS version number doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1	2022-08-31 07:24:55 +03:00
Gleb Natapov' via ScyllaDB development	0d20830863	direct_failure_detector: reduce severity of ping error logging Having an error while pinging a peer is not a critical error. The code retires and move on. Lets log the message with less severity since sometimes those error may happen (for instance during node replace operation some nodes refuse to answer to pings) and dtest complains that there are unexpected errors in the logs. Message-Id: <Ywy5e+8XVwt492Nc@scylladb.com>	2022-08-31 07:11:59 +03:00
Raphael S. Carvalho	631b2d8bdb	replica: rename table::on_compaction_completion and coroutinize it on_compaction_completion() is not very descriptive. let's rename it, following the example of update_sstable_lists_on_off_strategy_completion(). Also let's coroutinize it, so to remove the restriction of running it inside a thread only. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11407	2022-08-31 06:17:20 +03:00
Nadav Har'El	a797512148	Merge 'Raft test topology start stopped servers' from Alecco Test teardown involves dropping the test keyspace. If there are stopped servers occasionally we would see timeouts. Start stopped servers after a test is finished (and passed). Revert previous commit making teardown async again. Closes #11412 * github.com:scylladb/scylladb: test.py: restart stopped servers before teardown... Revert "test.py: random tables make DDL queries async"	2022-08-30 22:48:47 +03:00
Pavel Emelyanov	e5e75ba43c	Merge 'scylla-gdb.py: bring scylla reads-stats up-to-date' from Botond Dénes Said command is broken since 4.6, as the type of `reader_concurrency_semaphore::_permit_list` was changed without an accompanying update to this command. This series updates said command and adds it to the list of tested commands so we notice if it breaks in the future. Closes #11389 * github.com:scylladb/scylladb: test/scylla-gdb: test scylla read-stats scylla-gdb.py: read_stats: update w.r.t. post 4.5 code scylla-gdb.py: improve string_view_printer implementation	2022-08-30 20:24:02 +03:00
Nadav Har'El	56d714b512	Merge 'Docs: Update support OS' from Tzach Livyatan This PR change the CentOS 8 support to Rocky, and add 5.1 and 2022.1, 2022.2 rows to the list of Scylla releases Closes #11383 * github.com:scylladb/scylladb: OS support page: use CentOS not Centos OS support page: add 5.1, 2022.1 and 2022.2 OS support page: Update CentOS 8 to Rocky 8	2022-08-30 18:02:44 +03:00
Anna Stuchlik	0d3285dd3c	doc: replace Scylla with ScyllaDB	2022-08-30 15:35:06 +02:00
Anna Stuchlik	ab04ed2fda	doc: rewrite the Interfaces table to the new format to include more information about CQL support	2022-08-30 15:31:41 +02:00
Anna Stuchlik	4ac5574f1d	doc: remove the CQL version from pages other than Cassandra compatibility	2022-08-30 13:58:26 +02:00
Alejo Sanchez	df1ca57fda	test.py: restart stopped servers before teardown... for topology tests Test teardown involves dropping the test keyspace. If there are stopped servers occasionally we would see timeouts. Start stopped servers after a test is finished. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-30 11:40:40 +02:00
Alejo Sanchez	e5eac22a37	Revert "test.py: random tables make DDL queries async" This reverts commit `67c91e8bcd`. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-30 10:54:33 +02:00
Anna Stuchlik	99cee1aceb	doc: reorganize the content to list the recommended way of upgrading the image first	2022-08-30 10:11:02 +02:00
Anna Stuchlik	ffe6f97c06	doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file	2022-08-30 10:01:56 +02:00
Tzach Livyatan	4e413787d2	doc: Fix nodetool flush example `nodetool flush` have a space between keyspace and table names See https://docs.scylladb.com/stable/operating-scylla/nodetool-commands/flush for the right syntax. Fixes #11314 Closes #11334	2022-08-29 15:06:38 +03:00
Nadav Har'El	eed65dfc2d	Merge 'db: schema_tables: Make table creation shadow earlier concurrent changes' from Tomasz Grabiec Issuing two CREATE TABLE statements with a different name for one of the partition key columns leads to the following assertion failure on all replicas: scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id \|\| def.id == id - column_offset(def.kind)' failed. The reason is that once the create table mutations are merged, the columns table contains two entries for the same position in the partition key tuple. If the schemas were the same, or not conflicting in a way which leads to abort, the current behavior would be to drop the older table as if the last CREATE TABLE was preceded by a DROP TABLE. The proposed fix is to make CREATE TABLE mutation include a tombstone for all older schema changes of this table, effectively overriding them. The behavior will be the same as if the schemas were not different, older table will be dropped. Fixes #11396 Closes #11398 * github.com:scylladb/scylladb: db: schema_tables: Make table creation shadow earlier concurrent changes db: schema_tables: Fix formatting db: schema_mutations: Make operator<<() print all mutations schema_mutations: Make it a monoid by defining appropriate += operator	2022-08-29 14:21:07 +03:00
Tomasz Grabiec	ae8d2a550d	db: schema_tables: Make table creation shadow earlier concurrent changes Issuing two CREATE TABLE statements with a different name for one of the partition key columns leads to the following assertion failure on all replicas: scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id \|\| def.id == id - column_offset(def.kind)' failed. The reason is that once the create table mutations are merged, the columns table contains two entries for the same position in the partition key tuple. If the schemas were the same, or not conflicting in a way which leads to abort, the current behavior would be to drop the older table as if the last CREATE TABLE was preceded by a DROP TABLE. The proposed fix is to make CREATE TABLE mutation include a tombstone for all older schema changes of this table, effectively overriding them. The behavior will be the same as if the schemas were not different, older table will be dropped. Fixes #11396	2022-08-29 12:06:02 +02:00
Benny Halevy	d588e2a7c5	release: properly evaluate SCYLLA_BUILD_MODE_* macros Patch `765d2f5e46` did not evaluate the #if SCYLLA_BUILD_MODE directives properly and it always matched SCYLLA_BULD_MODE == release. This change fixes that by defining numerical codes for each build mode and using macro expansion to match the define SCYLLA_BUILD_MODE against these codes. Also, ./configure.py was changes to pass SCYLLA_BUILD_MODE to all .cc source files, and makes sure it is defined in build_mode.hh. Support was added for coverage build mode, and an #error was added if SCYLLA_BUILD_MODE was not recognized by the #if ladder directives. Additional checks verifying the expected SEASTAR_DEBUG against SCYLLA_BUILD_MODE were added as well, Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11387	2022-08-29 10:20:19 +03:00
Botond Dénes	0f4666010a	docs: scylla-sstable.rst: add comparison with SStableDump The two tools have very similar goals, user might wonder when to use one or the other. Also add a link to sstabledump.rst to scylla-sstable.	2022-08-29 08:29:14 +03:00
Botond Dénes	65da6a26a3	docs: scylla-sstable.rst: add section about providing the schema Providing the schema for the scylla-sstable tool is an important topic that was completely missing from the description so far.	2022-08-29 08:29:09 +03:00
Alejo Sanchez	67c91e8bcd	test.py: random tables make DDL queries async There are async timeouts for ALTER queries. Seems related to othe issues with the driver and async. Make these queries synchronous for now. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11394	2022-08-28 10:38:39 +03:00
Felipe Mendes	fd5cb85a7a	alternator - Doc - Update DescribeTable response and introduce hashing function differences This commit introduces the following changes to Alternator compability doc: * As of https://github.com/scylladb/scylladb/pull/11298 Alternator will return ProvisionedThroughput in DescribeTable API calls. We add the fact that tables will default to a BillingMode of PAY_PER_REQUEST (this wasn't made explicit anywhere in the docs), and that the values for RCUs/WCUs are hardcoded to 0. * Mention the fact that ScyllaDB (thus Alternator) hashing function is different than AWS proprietary implementation for DynamoDB. This is mostly of an implementation aspect rather than a bug, but it may cause user confusion when/if comparing the ResultSet between DynamoDB and Alternator returned from Table Scans. Refs: https://github.com/scylladb/scylladb/issues/11222 Fixes: https://github.com/scylladb/scylladb/issues/11315 Closes #11360	2022-08-28 10:29:07 +03:00
Takuya ASADA	8835a34ab6	scylla_raid_setup: prevent mount failed for /var/lib/scylla Just like `4a8ed4c`, we also need to wait for udev event completion to create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the disk just after formatting. Fixes #11359	2022-08-27 03:27:44 +09:00
Takuya ASADA	40134efee4	scylla_raid_setup: check uuid and device path are valid Added code to check make sure uuid and uuid based device path are valid.	2022-08-27 03:08:31 +09:00
Tomasz Grabiec	661db2706f	db: schema_tables: Fix formatting	2022-08-26 17:37:48 +02:00
Tomasz Grabiec	a020c4644c	db: schema_mutations: Make operator<<() print all mutations	2022-08-26 16:48:15 +02:00
Tomasz Grabiec	cf034c1891	schema_mutations: Make it a monoid by defining appropriate += operator	2022-08-26 16:48:15 +02:00
Kamil Braun	6c16ae4868	Merge 'raft, limit for command size' from Gusev Petr Commitlog imposes a limit on the size of mutations and throws an exception if it's exceeded. In case of schema changes before raft this exception was delivered to the client. Now it happens while saving the raft command in io_fiber in persistence->store_log_entries and what the client gets is just a timeout exception, which doesn't say much about the cause of the problem. This patch introduces an explicit command size limit and provides a clear error message in this case. Closes #11318 * github.com:scylladb/scylladb: raft, use max_command_size to satisfy commitlog limit raft, limit for command size	2022-08-26 12:20:58 +02:00
Pavel Emelyanov	6405aba748	toplogy: Use the provided dc/rack info Previous patches made all the callers of topology.update_endpoint() (via token_metadata.update_topology()) provide correct dc/rack info for the endpoint. It's now possible to stop using global snitch by topology and just rely on the dc/rack argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 10:02:00 +03:00
Pavel Emelyanov	10e8804417	test: Provide testing dc/rack infos There's a test that's sensitive to correct dc/rack info for testing entries. To populate them it uses global rack-inferring snitch instance or a special "testing" snitch. To make it continue working add a helper that would populate the topology properly (spoiler: next branch will replace it with explicitly populated topology object). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 10:00:04 +03:00
Pavel Emelyanov	f6abc3f759	storage_service: Provide dc/rack for snitch reconfiguration When snitch reconfigures (gossiper-property-file one) it kicks storage service so that it updates itself. This place also needs to update the dc/rack info about itself, the correct (new) values are taken from the snitch itself. There's a bug here -- system.local table it not update with new data until restart. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:58:34 +03:00
Pavel Emelyanov	f8614fe039	storage_service: Provide dc/rack from system ks on start When a node starts it loads the information about peers from system.peers table and populates token metadata and topology with this information. The dc/rack are taken from the sys-ks cache here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:57:15 +03:00
Pavel Emelyanov	5d5782a086	storage_service: Provide dc/rack from gossiper for replacement When a node it started to replace another node it updates token metadata and topology with the target information eary. The tokens are now taken from gossiper shadow round, this patch makes the same for dc/rack info. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:55:31 +03:00
Pavel Emelyanov	6b70358616	storage_service: Provide dc/rack from gossiper for remotes When a node is notified about other nodes state change it may want to update the topology information about it. In all those places the dc/rack into about the peer is provided by the gossiper. Basically, these updates mirror the relevant updates of tokens on the token metadata object. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:53:54 +03:00
Pavel Emelyanov	43e83c5415	storage_service,dht,repair: Provide local dc/rack from system ks When a node starts it adds itself to the topology. Mostly it's done in the storage_service::join_cluster() and whoever it calls. In all those places the dc/rack for the added node is taken from the system keyspace (it's cache was populated with local dc/rack by the previous patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:52:16 +03:00
Pavel Emelyanov	a03d6f7751	system_keyspace: Cache local dc-rack on .start() There's a cache of endpoint:{dc,rack} on system keyspace cache, but the local node is not there, because this data is populated from the peers table, while local node's dc/rack is in snitch (or system.local table). At the same time, storage_service::join_cluster() and whoever it calls (e.g. -- the repair) will need this info on start and it's convenient to have this data on sys-ks cache. It's not on the peers part of the cache because next branch removes this map and it's going to be very clumsy to have a whole container with just one enty in it. There's a peer code in system_keyspace::setup() that gets the local node dc/rack and committs it into the system.local table. However, putting the data into cache is done on .start(). This is because cql-test-env needs this data cached too, but it doesn't call sys_ks.setup(). Will be cleaned some other day. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:47:30 +03:00
Pavel Emelyanov	c043f6fa96	topology: Some renames after previous patch The topology::update_endpoint() is now a plain wrapper over private ::add_endpoint() method of the same class. It's simpler to merge them Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:46:26 +03:00
Pavel Emelyanov	4cbe6ee9f4	topology: Require entry in the map for update_normal_tokens() The method in question tries to be on the safest side and adds the enpoint for which it updates the tokens into the topology. From now on it's up to the caller to put the endpoint into topology in advance. So most of what this patch does is places topology.update_endpoint() into the relevant places of the code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:44:08 +03:00
Pavel Emelyanov	5fc9854eae	topology: Make update_endpoint() accept dc-rack info The method in question populates topology's internal maps with endpoint vs dc/rack relations. As for today the dc/rack values are taken from the global snitch object (which, in turn, goes to gossiper, system keyspace and its internal non-updateable cache for that). This patch prepares the ground for providing the dc/rack externally via argument. By now it's just and argument with empty strings, but next patches will populate it with real values (spoiler: in 99% it's storage service that calls this method and each call will know where to get it from for sure) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:41:09 +03:00
Pavel Emelyanov	7305061674	replication_strategy: Accept dc-rack as get_pending_address_ranges argument The method creates a copy of token metadata and pushes an endpoint (with some tokens) into it. Next patches will require providing dc/rack info together with the endpoint, this patch prepares for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:39:44 +03:00
Pavel Emelyanov	360c4f8608	dht: Carry dc-rack over boot_strapper and range_streamer Both classes may populate (temporarly clones of) token metadata object with endpoint:tokens pairs for the endpoint they work with. Next patches will require that endpoint comes with the dc/rack info. This patch makes sure dht classes have the necessary information at hand (for now it's just empty pair of strings). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:37:02 +03:00
Pavel Emelyanov	c7a3fed225	storage_service: Make replacement info a real struct This is to extend it in one of the next patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:36:16 +03:00
Botond Dénes	4d33812a77	test/scylla-gdb: test scylla read-stats This command was not run before, allowing it to silently break.	2022-08-26 08:08:28 +03:00
Botond Dénes	82c157368a	scylla-gdb.py: read_stats: update w.r.t. post 4.5 code scylla_read_stats is not up-to-date wr.r.t. the type of `reader_concurrency_semaphore::_permit_list`, which was changed in 4.6. Bring it up-to-date, keeping it backwards compatible with 4.5 and older releases.	2022-08-26 07:25:40 +03:00
Botond Dénes	107fd97f45	scylla-gdb.py: improve string_view_printer implementation The `_M_str` member of an `std::string_view` is not guaranteed to be a valid C string (.e.g. be null terminated). Printing it directly often resulted in printing partial strings or printing gibberish, effecting in particular the semaphore diagnostics dumps (scylla read-stats). Use a more reliable method: read `_M_len` amount of bytes from `_M_str` and decode as UTF-8.	2022-08-26 07:25:11 +03:00
Avi Kivity	0dbcd13a0f	config: change logging::settings constructor call to use designated initializer Safer wrt reordering, and more readable too. Closes #11382	2022-08-26 06:14:01 +03:00
Konstantin Osipov	4e128bafb5	docs: clarify the tricky field of row existence in LWT Closes #11372	2022-08-26 06:10:45 +03:00
Vlad Zolotarov	c538cc2372	scylla_prepare + scylla_cpuset_setup: make scylla_cpuset_setup idempotent without introducing regressions This patch fixes the regression introduced by `3a51e78` which broke a very important contract: perftune.yaml should not be "touched" by Scylla scriptology unless explicitly requested. And a call for scylla_cpuset_setup is such an explicit request. The issue that the offending patch was intending to fix was that cpuset.conf was always generated anew for every call of scylla_cpuset_setup - even if a resulting cpuset.conf would come out exactly the same as the one present on the disk before tha call. And since the original code was following the contract mentioned above it was also deleting perftune.yaml every time too. However, this was just an unavoidable side-effect of that cpuset.conf re-generation. The above also means that if scylla_cpuset_setup doesn't write to cpuset.conf we should not "touch" perftune.yaml and vise versa. This patch implements exactly that together with reverting the dangerous logic introduced by `3a51e78`. Fixes #11385 Fixes #10121	2022-08-25 13:03:02 -04:00
Vlad Zolotarov	80917a1054	scylla_prepare: stop generating 'mode' value in perftune.yaml Modern perftune.py supports a more generic way of defining IRQ CPUs: 'irq_cpu_mask'. This patch makes our auto-generation code create a perftune.yaml that uses this new parameter instead of using outdated 'mode'. As a side effect, this change eliminates the notion of "incorrect" value in cpuset.conf - every value is valid now as long as it fits into the 'all' CPU set of the specific machine. Auto-generated 'irq_cpu_mask' is going to include all bits from 'all' CPU mask except those defined in cpuset.conf. Fixes #9903	2022-08-25 13:02:57 -04:00
Benny Halevy	765d2f5e46	release: define SCYLLA_BUILD_MODE_STR by stringifying SCYLLA_BUILD_MODE Currently SCYLLA_BULD_MODE is defined as a string by the cxxflags generated by configure.py. This is not very useful since one cannot use it in a @if preprocessor directive. Instead, use -DSCYLLA_BULD_MODE=release, for example, and define a SCYLLA_BULD_MODE_STR as the dtirng representation of it. In addition define the respective SCYLLA_BUILD_MODE_{RELEASE,DEV,DEBUG,SANITIZE} macros that can be easily used in @ifdef (or #ifndef :)) for conditional compilation. The planned use case for it is to enable a task_manager test module only in non-release modes. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11357	2022-08-25 16:50:42 +02:00
Tzach Livyatan	e86cd3684e	OS support page: use CentOS not Centos	2022-08-25 17:33:50 +03:00
Wojciech Mitros	49dba4f0c1	functions: fix dropping of a keyspace with an aggregate in it Currently, if a keyspace has an aggregate and the keyspace is dropped, the keyspace becomes corrupted and another keyspace with the same name cannot be created again This is caused by the fact that when removing an aggregate, we call create_aggregate() to get values for its name and signature. In the create_aggregate(), we check whether the row and final functions for the aggregate exist. Normally, that's not an issue, because when dropping an existing aggregate alone, we know that its UDFs also exist. But when dropping and entire keyspace, we first drop the UDFs, making us unable to drop the aggregate afterwards. This patch fixes this behavior by removing the create_aggregate() from the aggregate dropping implementation and replacing it with specific calls for getting the aggregate name and signature. Additionally, a test that would previously fail is added to cql-pytest/test_uda.py where we drop a keyspace with an aggregate. Fixes #11327 Closes #11375	2022-08-25 16:28:57 +02:00
Tzach Livyatan	f6157a38a0	OS support page: add 5.1, 2022.1 and 2022.2	2022-08-25 16:44:40 +03:00
Tzach Livyatan	1697f17d90	OS support page: Update CentOS 8 to Rocky 8	2022-08-25 16:43:24 +03:00
Tomasz Grabiec	83850e247a	Merge 'raft: server: handle aborts when waiting for config entry to commit' from Kamil Braun Changing configuration involves two entries in the log: a 'joint configuration entry' and a 'non-joint configuration entry'. We use `wait_for_entry` to wait on the joint one. To wait on the non-joint one, we use a separate promise field in `server`. This promise wasn't connected to the `abort_source` passed into `set_configuration`. The call could get stuck if the server got removed from the configuration and lost leadership after committing the joint entry but before committing the non-joint one, waiting on the promise. Aborting wouldn't help. Fix this by subscribing to the `abort_source` in resolving the promise exceptionally. Furthermore, make sure that two `set_configuration` calls don't step on each other's toes by one setting the other's promise. To do that, reset the promise field at the end of `set_configuration` and check that it's not engaged at the beginning. Fixes #11288. Closes #11325 * github.com:scylladb/scylladb: test: raft: randomized_nemesis_test: additional logging raft: server: handle aborts when waiting for config entry to commit	2022-08-25 12:49:09 +02:00
Avi Kivity	df87949241	Merge "Remove batch tokens update helper" from Pavel E " On token_metadata there are two update_normal_tokens() overloads -- one updates tokens for a single endpoint, another one -- for a set (well -- std::map) of them. Other than updating the tokens both methods also may add an endpoint to the t.m.'s topology object. There's an ongoing effort in moving the dc/rack information from snitch to topology, and one of the changes made in it is -- when adding an entry to topology, the dc/rack info should be provided by the caller (which is in 99% of the cases is the storage service). The batched tokens update is extremely unfriendly to the latter change. Fortunately, this helper is only used by tests, the core code always uses fine-grained tokens updating. " * 'br-tokens-update-relax' of https://github.com/xemul/scylla: token_metadata: Indentation fix after prevuous patch token_metadata: Remove excessive empty tokens check token_metadata: Remove batch tokens updating method tests: Use one-by-one tokens updating method	2022-08-25 12:01:58 +02:00
Wojciech Mitros	9e6e8de38f	tests: prevent test_wasm from occasional failing Some cases in test_wasm.py assumed that all cases are ran in the same order every time and depended on values that should have been added to tables in previous cases. Because of that, they were sometimes failing. This patch removes this assumption by adding the missing inserts to the affected cases. Additionally, an assert that confirms low miss rate of udfs is more precise, a comment is added to explain it clearly. Closes #11367	2022-08-25 11:32:06 +03:00
Kamil Braun	90233551be	test: raft: randomized_nemesis_test: don't access failure detector service after it's stopped It could happen that we accessed failure detector service after it was stopped if a reconfiguration happened in the 'right' moment. This would resolve in an assertion failure. Fix this. Closes #11326	2022-08-25 11:32:06 +03:00
Tomasz Grabiec	1d0264e1a9	Merge 'Implement Raft upgrade procedure' from Kamil Braun Start with a cluster with Raft disabled, end up with a cluster that performs schema operations using group 0. Design doc: https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/ (TODO: replace this with .md file - we can do it as a follow-up) The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations. With the procedure comes a recovery mode in case the upgrade procedure gets stuck (and it may if we lose a node during recovery - the procedure, to correctly establish a single group 0 cluster, requires contacting every node). This recovery mode can also be used to recover clusters with group 0 already established if they permanently lose a majority of nodes - killing two birds with one stone. Details in the last commit message. Read the design doc, then read the commits in topological order for best reviewing experience. --- I did some manual tests: upgrading a cluster, using the cluster to add nodes, remove nodes (both with `decommission` and `removenode`), replacing nodes. Performing recovery. As a follow-up, we'll need to implement tests using the new framework (after it's ready). It will be easy to test upgrades and recovery even with a single Scylla version - we start with a cluster with the RAFT flag disabled, then rolling-restart while enabling the flag (and recovery is done through simple CQL statements). Closes #10835 * github.com:scylladb/scylladb: service/raft: raft_group0: implement upgrade procedure service/raft: raft_group0: extract `tracker` from `persistent_discovery::run` service/raft: raft_group0: introduce local loggers for group 0 and upgrade service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb service/raft: raft_group0_client: prepare for upgrade procedure service/raft: introduce `group0_upgrade_state` db: system_keyspace: introduce `load_peers` idl-compiler: introduce cancellable verbs message: messaging_service: cancellable version of `send_schema_check`	2022-08-25 11:32:06 +03:00
Pavel Emelyanov	d8c5044eee	token_metadata: Indentation fix after prevuous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-24 08:24:21 +03:00
Pavel Emelyanov	8238c38e9f	token_metadata: Remove excessive empty tokens check After the previous patch empty passed tokens make the helper co_return early, so this if is the dead code Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-24 08:24:21 +03:00
Pavel Emelyanov	056d21c050	token_metadata: Remove batch tokens updating method No users left. The endpoint_tokens.empty() check is removed, only tests could trigger it, but they didn't and are patched out. Indentation is left broken Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-24 08:24:21 +03:00
Pavel Emelyanov	1d437302a8	tests: Use one-by-one tokens updating method Tests are the only users of batch tokens updating "sugar" which actually makes things more complicated Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-24 08:24:21 +03:00
Pavel Emelyanov	18fa5038b1	replication_strategy: Remove unused method The get_pending_address_ranges() accepting a single token is not in use, its peer that accepts a set of tokens is Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11358	2022-08-23 20:23:50 +02:00
Avi Kivity	6ce5e9079c	Merge 'utils/logalloc: consolidate lsa state in shard tracker' from Botond Dénes Currently the state of LSA is scattered across a handful of global variables. This series consolidates all these into a single one: the shard tracker. Beyond reducing the number of globals (the less globals, the better) this paves the way for a planned de-globalization of the shard tracker itself. There is one separate global left, the static migrators registry. This is left as-is for now. Closes #11284 * github.com:scylladb/scylladb: utils/logalloc: remove reclaim_timer:: globals utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg utils/logalloc: move global stat accessors to tracker utils/logalloc: allocating_section: don't use the global tracker utils/logalloc: pass down tracker::impl reference to segment_pool utils/logalloc: move segment pool into tracker utils/logalloc: add tracker member to basic_region_impl utils/logalloc: make segment independent of segment pool	2022-08-23 18:51:14 +02:00
Benny Halevy	a980510654	table: seal_active_memtable: handle ENOSPC error Aborting too soon on ENOSPC is too harsh, leading to loss of availability of the node for reads, while restarting it won't solve the ENOSPC condition. Fixes #11245 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11246	2022-08-23 17:58:20 +02:00
Tomasz Grabiec	9c4e32d2e2	Merge 'raft: server: drop waiters in `applier_fiber` instead of `io_fiber`' from Kamil Braun When `io_fiber` fetched a batch with a configuration that does not contain this node, it would send the entries committed in this batch to `applier_fiber` and proceed by any remaining entry dropping waiters (if the node was no longer a leader). If there were waiters for entries committed in this batch, it could either happen that `applier_fiber` received and processed those entries first, notifying the waiters that the entries were committed and/or applied, or it could happen that `io_fiber` reaches the dropping waiters code first, causing the waiters to be resolved with `commit_status_unknown`. The second scenario is undesirable. For example, when a follower tries to remove the current leader from the configuration using `modify_config`, if the second scenario happens, the follower will get `commit_status_unknown` - this can happen even though there are no node or network failures. In particular, this caused `randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail from time to time. Fix it by serializing the notifying and dropping of waiters in a single fiber - `applier_fiber`. We decided to move all management of waiters into `applier_fiber`, because most of that management was already there (there was already one `drop_waiters` call, and two `notify_waiters` calls). Now, when `io_fiber` observes that we've been removed from the config and no longer a leader, instead of dropping waiters, it sends a message to `applier_fiber`. `applier_fiber` will drop waiters when receiving that message. Improve an existing test to reproduce this scenario more frequently. Fixes #11235. Closes #11308 * github.com:scylladb/scylladb: test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes` raft: server: drop waiters in `applier_fiber` instead of `io_fiber` raft: server: use `visit` instead of `holds_alternative`+`get`	2022-08-23 17:19:44 +02:00
Avi Kivity	fd9d8ddb3e	Merge 'distributed_loader: Restore separate processing of keyspace init prio/normal' from Calle Wilund Fixes #11349 In `7396de7` (and refactorings before it) the set of prioritized keyspaces (and processing thereof) was removed, due to apparent non-usage (which is true for open-source version). This functionality is however required for certain features of the enterprise version (ear). As such is needs to be restored and reenabled. This patch set does so, adapted to the recent version of this file. Closes #11350 * github.com:scylladb/scylladb: distributed_loader: Restore separate processing of keyspace init prio/normal Revert "distributed_loader: Remove unused load-prio manipulations"	2022-08-23 16:25:48 +02:00
Kamil Braun	e350e37605	service/raft: raft_group0: implement upgrade procedure A listener is created inside `raft_group0` for acting when the SUPPORTS_RAFT feature is enabled. The listener is established after the node enters NORMAL status (in `raft_group0::finish_setup_after_join()`, called at the end of `storage_service::join_cluster()`). The listener starts the `upgrade_to_group0` procedure. The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled (see earlier commit which implemented this logic) - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations (only those for now). The devil lies in the details, and the implementation is ugly compared to this nice description; for example there are many retry loops for handling intermittent network failures. Read the code. `leave_group0` and `remove_group0` were adjusted to handle the upgrade procedure being run correctly; if necessary, they will wait for the procedure to finish. If the upgrade procedure gets stuck (and it may, since it requires all nodes to be available to contact them to correctly establish a single group 0 raft cluster); or if a running cluster permanently loses a majority of nodes, causing group 0 unavailability; the cluster admin is not left without help. We introduce a recovery mode, which allows the admin to completely get rid of traces of existing group 0 and restart the upgrade procedure - which will establish a new group 0. This works even in clusters that never upgraded but were bootstrapped using group 0 from scratch. To do that, the admin does the following on every node: - writes 'recovery' under 'group0_upgrade_state' key in `system.scylla_local` table, - truncates the `system.discovery` table, - truncates the `system.group0_history` table, - deletes group 0 ID and group 0 server ID from `system.scylla_local` (the keys are `raft_group0_id` and `raft_server_id` then the admin performs a rolling restart of their cluster. The nodes restart in a "group 0 recovery mode", which simply means that the nodes won't try to perform any group 0 operations. Then the admin calls `removenode` to remove the nodes that are down. Finally, the admin removes the `group0_upgrade_state` key from `system.scylla_local`, rolling-restarts the cluster, and the cluster should establish group 0 anew. Note that this recovery procedure will have to be extended when new stuff is added to group 0 - like topology change state. Indeed, observe that a minority of nodes aren't able to receive committed entries from a leader, so they may end up in inconsistent group 0 states. It wouldn't be safe to simply create group 0 on those nodes without first ensuring that they have the same state from which group 0 will start. Right now the state only consist of schema tables, and the upgrade procedure ensures to synchronize them, so even if the nodes started in inconsistent schema states, group 0 will correctly be established. (TODO: create a tracking issue? something needs to remind us of this whenever we extend group 0 with new stuff...)	2022-08-23 13:51:01 +02:00
Kamil Braun	b42dfbc0aa	test: raft: randomized_nemesis_test: additional logging Add some more logging to `randomized_nemesis_test` such as logging the start and end of a reconfiguration operation in a way that makes it easy to find one given the other in the logs.	2022-08-23 13:14:30 +02:00
Kamil Braun	efad6fe9b4	raft: server: handle aborts when waiting for config entry to commit Changing configuration involves two entries in the log: a 'joint configuration entry' and a 'non-joint configuration entry'. We use `wait_for_entry` to wait on the joint one. To wait on the non-joint one, we use a separate promise field in `server`. This promise wasn't connected to the `abort_source` passed into `set_configuration`. The call could get stuck if the server got removed from the configuration and lost leadership after committing the joint entry but before committing the non-joint one, waiting on the promise. Aborting wouldn't help. Fix this by subscribing to the `abort_source` in resolving the promise exceptionally. Furthermore, make sure that two `set_configuration` calls don't step on each other's toes by one setting the other's promise. To do that, reset the promise field at the end of `set_configuration` and check that it's not engaged at the beginning. Fixes #11288.	2022-08-23 13:14:29 +02:00
Calle Wilund	54aca8e814	distributed_loader: Restore separate processing of keyspace init prio/normal Fixes #11349 In `7396de7` (and refactorings before it) the set of prioritized keyspaces (and processing thereof) was removed, due to apparent non-usage (which is true for open-source version). This functionality is however required for certain features of the enterprise version (ear). As such is needs to be restored and reenabled. This patch and revert before it does so, adapted to the recent version of this file.	2022-08-23 10:39:19 +00:00
Calle Wilund	d9c391e366	Revert "distributed_loader: Remove unused load-prio manipulations" This reverts commit `7396de72b1`. In `7396de7` (and refactorings before it) the set of prioritized keyspaces (and processing thereof) was removed, due to apparent non-usage (which is true for open-source version). This functionality is however required for certain features of the enterprise version (ear). As such is needs to be restored and reenabled. This reverts the actual commit, patch after ensures we use the prio set.	2022-08-23 10:34:05 +00:00
Avi Kivity	5d1ff17ddf	Merge 'Streaming: define plan_id as a strong tagged_uuid type' from Benny Halevy This series turns plan_id from a generic UUID into a strong type so it can't be used interchangeably with other uuid's. While at it, streaming/stream_fwd.hh was added for forward declarations and the definition of plan_id. Also, `stream_manager::update_progress` parameter name was renamed to plan_id to represent its assumed content, before changing its type to `streaming::plan_id`. Closes #11338 * github.com:scylladb/scylladb: streaming: define plan_id as a strong tagged_uuid type stream_manager: update_progress: rename cf_id param to plan_id streaming: add forward declarations in stream_fwd.hh	2022-08-23 10:48:34 +02:00
Petr Gusev	aa88d58539	raft, use max_command_size to satisfy commitlog limit Commitlog imposes a limit on the size of mutations and throws an exception if it's exceeded. In case of schema changes before raft this exception was delivered to the client. Now it happens while saving the raft command in io_fiber in persistence->store_log_entries and what the client gets is just a timeout exception, which doesn't say much about the cause of the problem. This patch introduces an explicit command size limit and provides a clear error message in this case.	2022-08-23 12:09:32 +04:00
Tomasz Grabiec	0e5b86d3da	Merge 'Optimize mutation consume of range tombstones in reverse' from Benny Halevy Reversing the whole range_tombstone_list into reversed_range_tombstones is inefficient and can lead to reactor stalls with a large number of range tombstones. Instead, iterate over the range_tombsotne_list in reverse direction and reverse each range_tombstone as we go, keeping the result in the optional cookie.reversed_rt member. While at it, this series contains some other cleanups on this path to improve the code readability and maybe make the compiler's life easier as for optimizing the cleaned-up code. Closes #11271 * github.com:scylladb/scylladb: mutation: consume_clustering_fragments: get rid of reversed_range_tombstones; mutation: consume_clustering_fragments: reindent mutation: consume_clustering_fragments: shuffle emit_rt logic around mutation: consume, consume_gently: simplify partition_start logic mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor mutation: consume_clustering_fragments: keep the reversed schema in cookie mutation: clustering_iterators: get rid of current_rt mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently	2022-08-23 10:05:39 +02:00
Botond Dénes	5bc499080d	utils/logalloc: remove reclaim_timer:: globals One of them (_active_timer) is moved to shard tracker, the other is made a simple local in reclaim_timer.	2022-08-23 10:38:58 +03:00
Botond Dénes	5f8971173e	utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker We want to consolidate all the logalloc state into a single object: the shard tracker. Replacing this global with a member in said object is part of this effort.	2022-08-23 10:38:58 +03:00
Botond Dénes	499b9a3a7c	utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg	2022-08-23 10:38:58 +03:00
Botond Dénes	7d17d675af	utils/logalloc: move global stat accessors to tracker These are pretend free functions, accessing globals in the background, make them a member of the tracker instead, which everything needed locally to compute them. Callers still have to access these stats through the global tracker instance, but this can be changed to happen through a local instance. Soon....	2022-08-23 10:38:58 +03:00
Botond Dénes	f406151a86	utils/logalloc: allocating_section: don't use the global tracker Instead, get the tracker instance from the region. This requires adding a `region&` parameter to `with_reserve()`. This brings us one step closer to eliminating the global tracker.	2022-08-23 10:38:58 +03:00
Botond Dénes	e968866fa1	utils/logalloc: pass down tracker::impl reference to segment_pool To get rid of some usages of `shard_tracker()`.	2022-08-23 10:38:58 +03:00
Botond Dénes	3bd94e41bf	utils/logalloc: move segment pool into tracker Instead of a separate global segment pool instance, make it a member of the already global tracker. Most users are inside the tracker instance anyway. Outside users can access the pool through the global tracker instance.	2022-08-23 10:38:58 +03:00
Botond Dénes	5b86dfc35a	utils/logalloc: add tracker member to basic_region_impl For now this member is initialized from the global tracker instance. But it allows the members of region impl to be detached from said global, making a step towards removing it.	2022-08-23 10:38:58 +03:00
Botond Dénes	f4056bd344	utils/logalloc: make segment independent of segment pool segment has some members, which simply forward the call to a segment_pool method, via the global segment_pool instance. Remove these and make the callers use the segment pool directly instead.	2022-08-23 10:38:58 +03:00
Nadav Har'El	9c15659194	Merge 'test.py: bump timeout of async requests for topology' from Alecco Topology tests do async requests using the Python driver. The driver's API for async doesn't use the session timeout. Pass 60 seconds timeout (default is 10) to match the session's. Fixes https://github.com/scylladb/scylladb/issues/11289 Closes #11348 * github.com:scylladb/scylladb: test.py: bump schema agreement timeout for topology tests test.py: bump timeout of async requests for topology test.py: fix bad indent	2022-08-23 10:30:59 +03:00
Raya Kurlyand	bc7539cff0	Update auditing.rst https://github.com/scylladb/scylladb/issues/11341 Closes #11347	2022-08-23 06:59:41 +03:00
Botond Dénes	331033adae	Merge 'Fix frozen mutation consume ordering' from Benny Halevy Currently, frozen_mutation is not consumed in position_in_partition order as all range tombstones are consumed before all rows. This violates the range_tombstone_generator invariants as its lower_bound needs to be monotonically increasing. Fix this by adding mutation_partition_view::accept_ordered and rewriting do_accept_gently to do the same, both making sure to consume the range tombstones and clustering rows in position_in_partition order, similar to the mutation consume_clustering_fragments function. Add a unit test that verifies that. Fixes #11198 Closes #11269 * github.com:scylladb/scylladb: mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable frozen_mutation: consume and consume_gently in-order frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen	2022-08-23 06:37:04 +03:00
Alejo Sanchez	01cac33472	test.py: bump schema agreement timeout for topology tests Increase the schema agreement timeout to match other timeouts. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-22 21:07:55 +02:00
Alejo Sanchez	f9d31112cf	test.py: bump timeout of async requests for topology Topology tests do async requests using the Python driver. The driver's API for async doesn't use the session timeout. Pass 60 seconds timeout (default is 10) to match the session's. This will hopefully will fix timeout failures on debug mode. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-22 21:07:03 +02:00
Benny Halevy	357e805e1f	mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable So that the frozen_mutation consumer can return stop_iteration::yes if it wishes to stop consuming at some clustering position. In this case, on_end_of_partition must still be called so a closing range_tombstone_change can be emitted to the consumer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-22 20:12:58 +03:00
Mikołaj Sielużycki	b5380baf8a	frozen_mutation: consume and consume_gently in-order Currently, frozen_mutation is not consumed in position_in_partition order as all range tombstones are consumed before all rows. This violates the range_tombstone_generator invariants as its lower_bound needs to be monotonically increasing. Fix this by adding mutation_partition_view::accept_ordered and rewriting do_accept_gently to do the same, both making sure to consume the range tombstones and clustering rows in position_in_partition order, similar to the mutation consume_clustering_fragments function. Add a unit test that verifies that. Fixes #11198 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-22 20:12:20 +03:00
Kamil Braun	e0c6153adf	test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes` Improve the randomness of this test, making it a bit easier to reproduce the scenarios that the test aims to catch. Increase timeouts a bit to account for this additional randomness.	2022-08-22 18:53:48 +02:00
Kamil Braun	db2a3deda1	raft: server: drop waiters in `applier_fiber` instead of `io_fiber` When `io_fiber` fetched a batch with a configuration that does not contain this node, it would send the entries committed in this batch to `applier_fiber` and proceed by any remaining entry dropping waiters (if the node was no longer a leader). If there were waiters for entries committed in this batch, it could either happen that `applier_fiber` received and processed those entries first, notifying the waiters that the entries were committed and/or applied, or it could happen that `io_fiber` reaches the dropping waiters code first, causing the waiters to be resolved with `commit_status_unknown`. The second scenario is undesirable. For example, when a follower tries to remove the current leader from the configuration using `modify_config`, if the second scenario happens, the follower will get `commit_status_unknown` - this can happen even though there are no node or network failures. In particular, this caused `randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail from time to time. Fix it by serializing the notifying and dropping of waiters in a single fiber - `applier_fiber`. We decided to move all management of waiters into `applier_fiber`, because most of that management was already there (there was already one `drop_waiters` call, and two `notify_waiters` calls). Now, when `io_fiber` observes that we've been removed from the config and no longer a leader, instead of dropping waiters, it sends a message to `applier_fiber`. `applier_fiber` will drop waiters when receiving that message. Fixes #11235.	2022-08-22 18:53:44 +02:00
Kamil Braun	5badf20c7a	raft: server: use `visit` instead of `holds_alternative`+`get` In `std::holds_alternative`+`std::get` version, the `get` performs a redundant check. Also `std::visit` gives a compile-time exhaustiveness check (whether we handled all possible cases of the `variant`).	2022-08-22 18:47:48 +02:00
Benny Halevy	314e45d957	streaming: define plan_id as a strong tagged_uuid type Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-22 19:45:30 +03:00
Benny Halevy	add612bc52	mutation: consume_clustering_fragments: get rid of reversed_range_tombstones; Reversing the whole range_tombstone_list into reversed_range_tombstones is inefficient and can lead to reactor stalls with a large number of range tombstones. Instead, iterator over the range_tombsotne_list in reverse direction and reverse each range_tombstone as we go, keeping the result in the optional cookie.reversed_rt member. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-22 19:42:52 +03:00
Alejo Sanchez	87c233b36b	test.py: fix bad indent Fix leftover bad indent Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-22 14:29:54 +02:00
Nadav Har'El	941c719a23	alternator: return ProvisionedThroughput in DescribeTable DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable operation must return a ProvisionedThroughput structure, listing both ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not stated in some DynamoDB documentation but is explictly mentioned in https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html Also in empirically, DynamoDB returns ProvisionedThroughput with zeros even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this. The ProvisionedThroughput structure being missing was a problem for applications like DynamoDB connectors for Spark, if they implicitly assume that ProvisionedThroughput is returned by DescribeTable, and fail (as described in issue #11222) if it's outright missing. So this patch adds the missing ProvisionedThroughput structure, and the xfailing test starts to pass. Note that this patch doesn't change the fact that attempting to set a table to PROVISIONED billing mode is ignored: DescribeTable continues to always return PAY_PER_REQUEST as the billing mode and zero as the provisioned capacities. Fixes #11222 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11298	2022-08-22 09:58:09 +02:00
Takuya ASADA	60e8f5743c	systemd: drop StandardOutput=syslog On recent version of systemd, StandardOutput=syslog is obsolete. We should use StandardOutput=journal instead, but since it's default value, so we can just drop it. Fixes #11322 Closes #11339	2022-08-22 10:47:37 +03:00
Benny Halevy	fa7033bc2b	configure: add --perf-tests-debuginfo option Provides separate control over debuginfo for perf tests since enabling --tests-debuginfo affects both today causing the Jenkins archives of perf tests binaries to inflate considerably. Refs https://github.com/scylladb/scylla-pkg/issues/3060 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11337	2022-08-21 19:08:21 +03:00
Konstantin Osipov	4b6ed7796b	test.py: extend documentation Add documentation about python, topology tests, server pooling and provide some debugging tips. Closes #11317	2022-08-21 17:55:49 +03:00
Benny Halevy	3554533e2c	stream_manager: update_progress: rename cf_id param to plan_id Before changing its type to streaming::plan_id this patch clarifies that the parameter actually represents the plan id and not the table id as its name suggests. For reference, see the call to update_progress in `stream_transfer_task::execute`, as well as the function using _stream_bytes which map key is the plan id. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-21 16:56:41 +03:00
Benny Halevy	c1fc0672a5	streaming: add forward declarations in stream_fwd.hh To be used for defining streaming::plan_id in the next patcvh. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-21 16:00:02 +03:00
Anna Stuchlik	3b93184680	doc: fix the description of ORDER BY for the SELECT statement Closes #11272	2022-08-21 15:28:15 +03:00
Tzach Livyatan	8fc58300ea	Update Alternator Markdown file to use automatic link notation Closes #11335	2022-08-21 13:32:57 +03:00
Piotr Sarna	484004e766	Merge 'Fix mutation commutativity with shadowable tombstone' from Tomasz Grabiec This series fixes lack of mutation associativity which manifests as sporadic failures in row_cache_test.cc::test_concurrent_reads_and_eviction due to differences in mutations applied and read. No known production impact. Refs https://github.com/scylladb/scylladb/issues/11307 Closes #11312 * github.com:scylladb/scylladb: test: mutation_test: Add explicit test for mutation commutativity test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones db: mutation_partition: Drop unnecessary maybe_shadow() db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone mutation_partition: row: make row marker shadowing symmetric	2022-08-20 16:46:32 +02:00
Kamil Braun	2ba1fb0490	service/raft: raft_group0: extract `tracker` from `persistent_discovery::run` Extract it to a top-level abstraction, write comments. It will be reused in the following commit.	2022-08-19 19:15:19 +02:00
Kamil Braun	f7e02a7de9	service/raft: raft_group0: introduce local loggers for group 0 and upgrade	2022-08-19 19:15:19 +02:00
Kamil Braun	ac5f4248a9	service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb During the upgrade procedure nodes will want to obtain the upgrade state of other nodes to proceed. This is what the new verb is for.	2022-08-19 19:15:19 +02:00
Kamil Braun	43687be1f1	service/raft: raft_group0_client: prepare for upgrade procedure Now, whether an 'group 0 operation' (today it means schema change) is performed using the old or new methods, doesn't depend on the local RAFT fature being enabled, but on the state of the upgrade procedure. In this commit the state of the upgrade is always `use_pre_raft_procedures` because the upgrade procedure is not implemented yet. But stay tuned. The upgrade procedure will need certain guarantees: at some point it switches from `use_pre_raft_procedures` to `synchronize` state. During `synchronize` schema changes must be disabled, so the procedure can ensure that schema is in sync across the entire cluster before establishing group 0. Thus, when the switch happens, no schema change can be in progress. To handle all this weirdness we introduce `_upgrade_lock` and `get_group0_upgrade_state` which takes this lock whenever it returns `use_pre_raft_procedures`. Creating a `group0_guard` - which happens at the start of every group 0 operation - will take this lock, and the lock holder shall be stored inside the guard (note: the holder only holds the lock if `use_pre_raft_procedures` was returned, no need to hold it for other cases). Because `group0_guard` is held for the entire duration of a group 0 operation, and because the upgrade procedure will also have to take this lock whenever it wants to change the upgrade state (it's an rwlock), this ensures that no group 0 operation that uses the old ways is happening when we change the state. We also implement `wait_until_group0_upgraded` using a condition variable. It will be used by certain methods during upgrade (later commits; stay tuned). Some additional comments were written.	2022-08-19 19:15:19 +02:00
Kamil Braun	7e56251aea	service/raft: introduce `group0_upgrade_state` Define an enum class, `group0_upgrade_state`, describing the state of the upgrade procedure (implemented in later commits). Provide IDL definitions for (de)serialization. The node will have its current upgrade state stored on disk in `system.scylla_local` under the `group0_upgrade_state` key. If the key is not present we assume `use_pre_raft_procedures` (meaning we haven't started upgrading yet or we're at the beginning of upgrade). Introduce `system_keyspace` accessor methods for storing and retrieving the on-disk state.	2022-08-19 19:15:19 +02:00
Kamil Braun	547134faf4	db: system_keyspace: introduce `load_peers` Load the addresses of our peers from `system.peers`. Will be used be the Raft upgrade procedure to obtain the set of all peers.	2022-08-19 19:15:18 +02:00
Kamil Braun	a5b465b796	idl-compiler: introduce cancellable verbs The compiler allowed passing a `with_timeout` flag to a verb definition; it then generated functions for sending and handling RPCs that accepted a timeout parameter. We would like to generate functions that accept an `abort_source` so an RPC can be cancelled from the sender side. This is both more and less powerful than `with_timeout`. More powerful because you can abort on other conditions than just reaching a certain point in time. Less powerful because you can't abort the receiver. In any case, sometimes useful. For this the `cancellable` flag was added. You can't use `with_timeout` and `cancellable` at the same verb. Note that this uses an already existing function in RPC module, `send_message_cancellable`.	2022-08-19 19:15:18 +02:00
Kamil Braun	9e5a81da4a	message: messaging_service: cancellable version of `send_schema_check` This RPC will be used during the Raft upgrade procedure during schema synchronization step. Make a version which can be cancelled when the upgrade procedure gets aborted.	2022-08-19 19:15:18 +02:00
Nadav Har'El	516089beb0	Merge 'Raft test topology II part 1' from Alecco - Remove `ScyllaCluster.__getitem__()` (pending request by @kbr- in a previous pull request), for this remove all direct access to servers from caller code - Increase Python driver timeouts (req by @nyh) - Improve `ManagerClient` API requests: use `http+unix://<sockname>/<resource>` instead of `http://localhost/<resource>` and callers of the helper method only pass the resource - Improve lint and type hints Closes #11305 * github.com:scylladb/scylladb: test.py: remove ScyllaCluster.__getitem__() test.py: ScyllaCluster check kesypace with any server test.py: ScyllaCluster server error log method test.py: ScyllaCluster read_server_log() test.py: save log point for all running servers test.py: ScyllaCluster provide endpoint test.py: build host param after before_test test.py: manager client disable lint warnings test.py: scylla cluster lint and type hint fixes test.py: increase more timeouts test.py: ManagerClient improve API HTTP requests	2022-08-18 20:27:50 +03:00
Alejo Sanchez	fe07f9ceed	test.py: make topology conftest module paths work when imported To allow other suites to use topology suite conftest, add pylib to the module lookup path. Closes #11313	2022-08-18 20:22:35 +03:00
Konstantin Osipov	7481f0d404	test.py: simplify CQL test search No need to repeat code available in the base class. Closes #11156	2022-08-18 19:28:43 +03:00
Benny Halevy	7747b8fa33	sstables: define run_identifier as a strong tagged_uuid type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11321	2022-08-18 19:03:10 +03:00
Avi Kivity	35fbba3a5b	Revert "gms: gossiper: include nodes with empty feature sets when calculating enabled features" This reverts commit `08842444b4`. It causes a failure in test_shutdown_all_and_replace_node. Fixes #11316.	2022-08-18 15:01:50 +03:00
Kamil Braun	b52429f724	Merge 'raft: relax some error severity' from Gleb Natapov Dtest fails if it sees an unknown errors in the logs. This series reduces severity of some errors (since they are actually expected during shutdown) and removes some others that duplicate already existing errors that dtest knows how to deal with. Also fix one case of unhandled exception in schema management code. * 'dtest-fixes-v1' of github.com:gleb-cloudius/scylla: raft: getting abort_requested_exception exception from a sm::apply is not a critical error schema_registry: fix abandoned feature warning service: raft: silence rpc::closed_errors in raft_rpc	2022-08-18 12:16:44 +02:00
Anna Stuchlik	dc307b6895	doc: fix the CQL version in the Interfaces table	2022-08-18 12:02:42 +02:00
Petr Gusev	eedfd7ad9b	raft, limit for command size Adds max_command_size to the raft configuration and restricts commands to this limit.	2022-08-18 13:35:49 +04:00
Tomasz Grabiec	5a9df433c6	test: mutation_test: Add explicit test for mutation commutativity	2022-08-17 17:39:54 +02:00
Tomasz Grabiec	3d9efee3bf	test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones Given 3 row mutations: m1 = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}} } m2 = { marker: {row_marker: timestamp=-9223372036854775805} } m3 = { tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}} } We get different shadowable tombstones depending on the order of merging: (m1 + m2) + m3 = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}} m1 + (m2 + m3) = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}} } The reason is that in the second case the shadowable tombstone in m3 is shadwed by the row marker in m2. In the first case, the marker in m2 is cancelled by the dead marker in m1, so shadowable tombstone in m3 is not cancelled (the marker in m1 does not cancel because it's dead). This wouldn't happen if the dead marker in m1 was accompanied by a hard tombstone of the same timestamp, which would effectively make the difference in shadowable tombstones irrelevant. Found by row_cache_test.cc::test_concurrent_reads_and_eviction. I'm not sure if this situation can be reached in practice (dead marker in mv table but no row tombstone). Work it around for tests by producing a row tombstone if there is a dead marker. Refs #11307	2022-08-17 17:39:54 +02:00
Tomasz Grabiec	56e5b6f095	db: mutation_partition: Drop unnecessary maybe_shadow() It is performed inside row_tombstone::apply() invoked in the preceding line.	2022-08-17 17:39:54 +02:00
Tomasz Grabiec	9c66c9b3f0	db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone When the row has a live row marker which shadows the shadowable tombstone, the shadowable tombstone should not be effective. The code assumes that _shadowable always reflects the current tombstone, so maybe_shadow() needs to be called whenever marker or regular tombstone changes. This was not ensured by row::apply(tombstone). This causes problems in tests which use random_mutation_generator, which generates mutations which would violate this invariant, and as a result, mutation commutativity would be violated. I am not aware of problems in production code.	2022-08-17 17:34:13 +02:00
Botond Dénes	778f5adde7	mutation_partition: row: make row marker shadowing symmetric Currently row marker shadowing the shadowable tombstone is only checked in `apply(row_marker)`. This means that shadowing will only be checked if the shadowable tombstone and row marker are set in the correct order. This at the very least can cause flakyness in tests when a mutation produced just the right way has a shadowable tombstone that can be eliminated when the mutation is reconstructed in a different way, leading to artificial differences when comparing those mutations. This patch fixes this by checking shadowing in `apply(shadowable_tombstone)` too, making the shadowing check symmetric. There is still one vulnerability left: `row_marker& row_marker()`, which allow overwriting the marker without triggering the corresponding checks. We cannot remove this overload as it is used by compaction so we just add a comment to it warning that `maybe_shadow()` has to be manually invoked if it is used to mutate the marker (compaction takes care of that). A caller which didn't do the manual check is mutation_source_test: this patch updates it to use `apply(row_marker)` instead. Fixes: #9483 Tests: unit(dev) Closes #9519	2022-08-17 17:22:13 +02:00
Benny Halevy	8f0376bba1	mutation: consume_clustering_fragments: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 16:45:20 +03:00
Benny Halevy	749371c2b0	mutation: consume_clustering_fragments: shuffle emit_rt logic around To prepare for a following patch that will get rid of the cookie.reversed_range_tombstones list. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 16:44:23 +03:00
Benny Halevy	0e21073c38	mutation: consume, consume_gently: simplify partition_start logic Concentrate the logic in a single (!cookie.partition_start_consumed) block Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 15:49:12 +03:00
Benny Halevy	d661b84d51	mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor and set crs and rts only in the block where they are used, so we can get rid of reversed_range_tombstones. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 15:30:36 +03:00
Benny Halevy	f1b7a1a6f1	mutation: consume_clustering_fragments: keep the reversed schema in cookie Rather than reversing the schema on every call just keep the potentially reversed schema in cookie. Othwerwise, cookie.schema was write only. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 15:30:36 +03:00
Benny Halevy	a230ea0019	mutation: clustering_iterators: get rid of current_rt It is currently write-only. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 15:30:16 +03:00
Benny Halevy	017f9b4131	mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 14:43:52 +03:00
Alejo Sanchez	d732d776ed	test.py: remove ScyllaCluster.__getitem__() Users of ScyllaCluster should not directly manage its ScyllaServers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	729f8e2834	test.py: ScyllaCluster check kesypace with any server Directly pick any server instead of calling self[0]. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	7ad7a5e718	test.py: ScyllaCluster server error log method Provide server error logs to caller (test.py). Avoids direct access to list of servers. To be done later: pick the failed server. For now it just provides the log of one server. While there, fix type hints. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	e755207fcc	test.py: ScyllaCluster read_server_log() Instead of accessing the first server, now test.py asks ScyllaCluster for the server log. In a later commit, ScyllaCluster will pick the appropriate server. Also removes another direct access to the list of servers we want to get rid of.	2022-08-17 10:24:48 +02:00
Alejo Sanchez	f141ab95f9	test.py: save log point for all running servers For error reporting, before a test a mark of the log point in time is saved. Previously, only the log of the first server was saved. Now it's done for all running servers. While there, remove direct access to servers on test.py. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	8fff636776	test.py: ScyllaCluster provide endpoint For pytest CQL driver connections a host id (IP) is used. Provide it with a method. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	5bd266424e	test.py: build host param after before_test If no server started, there is no server in the cluster list. So only build the pytest --host param after before_test check is done. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	30c8e961ba	test.py: manager client disable lint warnings Disable noisy lint warnings. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	2b4c7fbb8a	test.py: scylla cluster lint and type hint fixes Add missing docstrings, reorder imports, add type hints, improve formatting, fix variable names, fix line lengths, iterate over dicts not keys, and disable noisy lint warnings. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	566a4ebf4e	test.py: increase more timeouts Increase Python driver connection timeouts to deal with extreme cases for slow debug builds in slow machines as done (and explained) in `95bd02246a`. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Alejo Sanchez	ce27c02d91	test.py: ManagerClient improve API HTTP requests Use the AF Unix socket name as host name instead of localhost and avoid repeating the full URL for callers of _request() for the Manager API requests from the client. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-17 10:24:48 +02:00
Benny Halevy	1b997a8514	frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc It is a range_tombstone_change, not a range_tombstone. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 10:17:42 +03:00
Benny Halevy	87fd4a7d82	frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes If the consumer return stop_iteration::yes for a flushed row (static or clustered, we should return early and no consume any more fragments, until `on_end_of_partition`, where we may still consume a closing range_tombstone_change past the last consumed row. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 10:17:42 +03:00
Benny Halevy	f11a5e2ec8	frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally Consuming the static row is the first ooportunity for the consumer to return stop_iteration::yes, so there's no point in checking `_stop_consuming` before consuming it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 10:17:42 +03:00
Benny Halevy	4b4eb9037a	frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen We already flushed rt_gen when building the current_row When we get to flush_rows_and_tombstones, we should just consume it, as the passed position is not if the current_row but rather a position following it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-17 10:17:42 +03:00
Nadav Har'El	055340ae39	cql-pytest: increase more timeouts In commit `7eda6b1e90`, we increased the request_timeout parameter used by cql-pytest tests from the default of 10 seconds to 120 seconds. 10 seconds was usually more than enough for finishing any Scylla request, but it turned out that in some extreme cases of a debug build running on an extremely over-committed machine, the default timeout was not enough. Recently, in issue #11289 we saw additional cases of timeouts which the request_timeout setting did not solve. It turns out that the Python CQL driver has two additional timeout settings - connect_timeout and control_connection_timeout, which default to 5 seconds and 2 seconds respectively. I believe that most of the timeouts in issue #11289 come from the control_connection_timeout setting - by changing it to a tiny number (e.g., 0.0001) I got the same error messages as those reported in #11289. The default of that timeout - 2 seconds - is certainly low enough to be reached on an extremely over-committed machine. So this patch significantly increases both connect_timeout and control_connection_timeout to 60 seconds. We don't care that this timeout is ridiculously large - under normal operations it will never be reached. There is no code which loops for this amount of time, for example. Refs #11289 (perhaps even Fixes, we'll need to see that the test errors go away). NOTE: This patch only changes test/cql-pytest/util.py, which is only used by the cql-pytest test suite. We have multiple other test suites which copied this code, and those test suites might need fixing separately. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11295	2022-08-16 19:11:59 +03:00
Kamil Braun	08842444b4	gms: gossiper: include nodes with empty feature sets when calculating enabled features Right now, if there's a node for which we don't know the features supported by this node (they are neither persisted locally, nor gossiped by that node), we would skip this node in calculating the set of enabled features and potentially enable a feature which shouldn't be enabled - because that node may not know it. We should only enable a feature when we know that all nodes have upgraded and know the feature. This bug caused us problems when we tried to move RAFT out of experimental. There are dtests such as `partitioner_tests.py` in which nodes would enable features prematurely, which caused the Raft upgrade procedure to break (the procedure starts only when all nodes upgrade and announce that they know the SUPPORTS_RAFT cluster feature). Closes #11225	2022-08-16 19:07:41 +03:00
Piotr Sarna	cf30d4cbcf	Merge 'Secondary index of collection columns' from Nadav Har'El This pull request introduces global secondary-indexing for non-frozen collections. The intent is to enable such queries: ``` CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id)); CREATE INDEX ON test(keys(somemap)); CREATE INDEX ON test(values(somemap)); CREATE INDEX ON test(entries(somemap)); CREATE INDEX ON test(values(somelist)); CREATE INDEX ON test(values(someset)); -- index on test(c) is the same as index on (values(c)) CREATE INDEX IF NOT EXISTS ON test(somelist); CREATE INDEX IF NOT EXISTS ON test(someset); CREATE INDEX IF NOT EXISTS ON test(somemap); SELECT * FROM test WHERE someset CONTAINS 7; SELECT * FROM test WHERE somelist CONTAINS 7; SELECT * FROM test WHERE somemap CONTAINS KEY 7; SELECT * FROM test WHERE somemap CONTAINS 7; SELECT * FROM test WHERE somemap[7] = 7; ``` We use here all-familiar materialized views (MVs). Scylla treats all the collections the same way - they're a list of pairs (key, value). In case of sets, the value type is dummy one. In case of lists, the key type is TIMEUUID. When describing the design, I will forget that there is more than one collection type. Suppose that the columns in the base table were as follows: ``` pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2) ``` The MV schema is as follows (the names of columns which are not the same as in base might be different). All the columns here form the primary key. ``` -- for index over entries indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int -- for index over keys indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int -- for index over values indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int ``` The reason for the last additional column is that the values from a collection might not be unique. Fixes #2962 Fixes #8745 Fixes #10707 This patch does not implement local secondary indexes for collection columns: Refs #10713. Closes #10841 * github.com:scylladb/scylladb: test/cql-pytest: un-xfail yet another passing collection-indexing test secondary index: fix paging in map value indexing test/cql-pytest: test for paging with collection values index cql, view: rename and explain bytes_with_action cql, index: make collection indexing a cluster feature test/cql-pytest: failing tests for oversized key values in MV and SI cql: fix secondary index "target" when column name has special characters cql, index: improve error messages cql, index: fix default index name for collection index test/cql-pytest: un-xfail several collecting indexing tests test/cql-pytest/test_secondary_index: verify that local index on collection fails. docs/design-notes/secondary_index: add `VALUES` to index target list test/cql-pytest/test_secondary_index: add randomized test for indexes on collections cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail. test/boost/secondary_index_test: update for non-frozen indexes on collections test/cql-pytest: Uncomment collection indexes tests that should be working now cql, index: don't use IS NOT NULL on collection column cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections cql3, index: Use entries() indexes on collections for queries cql3, index: Use keys() and values() indexes on collections for queries. types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function db/view/view.cc: compute view_updates for views over collections view info: has_computed_column_depending_on_base_non_primary_key column_computation: depends_on_non_primary_key_column schema, index/secondary_index_manager: make schema for index-induced mv index/secondary_index_manager: extract keys, values, entries types from collection cql3/statements/: validate CREATE INDEX for index over a collection cql3/statements/create_index_statement,index_target: rewrite index target for collection column_computation.hh, schema.cc: collection_column_computation column_computation.hh, schema.cc: compute_value interface refactor Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`	2022-08-16 14:18:51 +02:00
Nadav Har'El	fbb0b66d0c	test/cql-pytest: fix run's "--ssl" option Commit `23acc2e848` broke the "--ssl" option of test/cql-pytest/run (which makes Scylla - and cqlpytest - use SSL-encrypted CQL). The problem was that there was a confusion between the "ssl" module (Python's SSL support) and a new "ssl" variable. A rename and a missing "import" solves the breakage. We never noticed this because Jenkins does not run cql-pytest/run with --ssl (actually, it no longer runs cql-pytest/run at all). It is still a useful option for checking SSL-related problems in Scylla and Seastar. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11292	2022-08-16 12:29:05 +02:00
Kamil Braun	4e35e62597	Merge 'Raft test topology part 3' from Alecco Test schema changes when there was an underlying topology change. - per test case checks of cluster health and cycling - helper class to do cluster manager API requests - tests can perform topology changes: stop/start/restart servers - modified clusters are marked dirty and discarded after the test case - cql connection is updated per topology change and per cluster change Closes #11266 * github.com:scylladb/scylladb: test.py: test topology and schema changes test.py: ClusterManager API mark cluster dirty test.py: call before/after_test for each test case test.py: handle driver connection in ManagerClient test.py: ClusterManager API and ManagerClient test.py: improve topology docstring	2022-08-16 11:00:26 +02:00
Avi Kivity	afa7960926	Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later. Fixes: https://github.com/scylladb/scylladb/issues/11264 Closes #11273 * github.com:scylladb/scylladb: querier: querier_cache: remove now unused evict_all_for_table() database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table() reader_concurrency_semaphore: add evict_inactive_reads_for_table()	2022-08-15 19:05:59 +03:00
Botond Dénes	d56dcb842c	db/virtual_table: add virtual destructor to virtual_table It should have had one, derived instances are stored and destroyed via the base-class. The only reason this haven't caused bugs yet is that derived instances happen to not have any non-trivial members yet. Closes #11293	2022-08-15 16:58:05 +03:00
Avi Kivity	73d4930815	Merge 'test/lib: various improvements to sstable test env' from Botond Dénes A mixed bag of improvements developed as part of another PR (https://github.com/scylladb/scylladb/pull/10736). Said PR was closed so I'm submitting these improvements separately. Closes #11294 * github.com:scylladb/scylladb: test/lib: move convenience table config factory to sstable_test_env test/lib/sstable_test_env: move members to impl struct test/lib/sstable_utils: use test_env::do_with_async()	2022-08-15 16:57:01 +03:00
Botond Dénes	92e5f438a4	querier: querier_cache: remove now unused evict_all_for_table()	2022-08-15 14:16:41 +03:00
Botond Dénes	2b1eb6e284	database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table() Instead of querier_cache::evict_all_for_table(). The new method cover all queriers and in addition any other inactive reads registered on the semaphore. In theory by the time we detach a table, no regular inactive reads should be in the semaphore anymore, but if there is any still, we better evict them before the table is destroyed, they might attempt to access it in when destroyed later.	2022-08-15 14:16:41 +03:00
Botond Dénes	e55ccbde8f	reader_concurrency_semaphore: add evict_inactive_reads_for_table() Allowing for evicting all inactive reads that belong to a certain table.	2022-08-15 14:16:41 +03:00
Botond Dénes	c8ef356859	test/lib: move convenience table config factory to sstable_test_env All users of `column_family_test_config()`, get the semaphore parameter for it from `sstable_test_env`. It is clear that the latter serves as the storage space for stable objects required by the table config. This patch just enshrines this fact by moving the config factory method to `sstable_test_env`, so it can just get what it needs from members.	2022-08-15 11:23:59 +03:00
Botond Dénes	c0e017e0f7	test/lib/sstable_test_env: move members to impl struct All present members of sstable_test_env are std::unique_ptr<>:s because they require stable addresses. This makes their handling somewhat awkward. Move all of them into an internal `struct impl` and make that member a unique ptr.	2022-08-15 11:20:09 +03:00
Botond Dénes	a9f296ed47	test/lib/sstable_utils: use test_env::do_with_async() Instead of manually instantiating test_env.	2022-08-15 11:19:27 +03:00
Botond Dénes	a9573b84c5	Merge 'commitlog: Revert/modify `fac2bc4` - do footprint add in delete' from Calle Wilund Fixes #11184 Fixes #11237 In prev (broken) fix for https://github.com/scylladb/scylladb/issues/11184 we added the footprint for left-over files (replay candidates) to disk footprint on commitlog init. This effectively prevents us from creating segments iff we have tight limits. Since we nowadays do quite a bit of inserts _before_ commitlog replay (system.local, but...) we can end up in a situation where we deadlock start because we cannot get to the actual replay that will eventually free things. Another, not thought through, consequence is that we add a single footprint to _all_ commitlog shard instances - even though only shard 0 will get to actually replay + delete (i.e. drop footprint). So shards 1-X would all be either locked out or performance degraded. Simplest fix is to add the footprint in delete call instead. This will lock out segment creation until delete call is done, but this is fast. Also ensures that only replay shard is involved. To further emphasize this, don't store segments found on init scan in all shard instances, instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store pointless string lists in shards 1 to X, and also we can potentially ask for the list twice. More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is considered candidate for recycling, and included in footprint. Closes #11251 * github.com:scylladb/scylladb: commitlog: Make get_segments_to_replay on-demand commitlog: Revert/modify `fac2bc4` - do footprint add in delete	2022-08-15 09:10:32 +03:00
Botond Dénes	8f10413087	Merge 'doc: describe specifying workload attributes with service levels' from Anna Stuchlik Fix https://github.com/scylladb/scylladb/issues/11197 This PR adds a new page where specifying workload attributes with service levels is described and adds it to the menu. Also, I had to fix some links because of the warnings. Closes #11209 * github.com:scylladb/scylladb: doc: remove the reduntant space from index doc: update the syntax for defining service level attributes doc: rewording doc: update the links to fix the warnings doc: add the new page to the toctree doc: add the descrption of specifying workload attributes with service levels doc: add the definition of workloads to the glossary	2022-08-15 07:14:28 +03:00
Nadav Har'El	c8b5c3595e	Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity Increase readability in preparation for managing topology with effective_replication_map (continuing `69aea59d9`). Closes #11290 * github.com:scylladb/scylladb: cql3: select_statement: improve loop termination condition in indexed_table_select_statement::do_execute_base_query() cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query() cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query() cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query()	2022-08-14 23:26:06 +03:00
Nadav Har'El	4a4231ea53	Merge 'storage_proxy: coroutinize some counter mutate functions' from Avi Kivity In preparation for effective_replication_map hygiene, convert some counter functions to coroutines to simplify the changes. Closes #11291 * github.com:scylladb/scylladb: storage_proxy: mutate_counters_on_leader: coroutinize storage_proxy: mutate_counters: coroutinize storage_proxy: mutate_counters: reorganize error handling	2022-08-14 23:16:42 +03:00
Avi Kivity	8070cdbbf9	storage_proxy: mutate_counters_on_leader: coroutinize Simplify ahead of refactoring for consistent effective_replication_map.	2022-08-14 17:36:58 +03:00
Avi Kivity	6e330d98d2	storage_proxy: mutate_counters: coroutinize Simplify ahead of refactoring for consistent effective_replication_map. This is probably a pessimization of the error case, but the error case will be terrible in any case unless we resultify it.	2022-08-14 17:28:46 +03:00
Avi Kivity	105b066ff7	storage_proxy: mutate_counters: reorganize error handling Move the error handling function where it's used so the code is more straightforward. Due to some std::move()s later, we must still capture the schema early.	2022-08-14 17:13:22 +03:00
Avi Kivity	fbaa280acd	cql3: select_statement: improve loop termination condition in indexed_table_select_statement::do_execute_base_query() Move the termination condition to the front of the loop so it's clear why we're looping and when we stop. It's less than perfectly clean since we widen the scope of some variables (from loop-internal to loop-carried), but IMO it's clearer.	2022-08-14 15:40:45 +03:00
Avi Kivity	60c7c11c96	cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query() Reindent after coroutinization. No functional changes.	2022-08-14 15:35:36 +03:00
Avi Kivity	492dc6879e	cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query() It's much easier to maintain this way. Since it uses ranges_to_vnodes, it interacts with topology and needs integration into effective_replication_map management. The patch leaves bad indentation and an infinite-looking loop in the interest of minimization, but that will be corrected later. Note, the test for `!r.has_value()` was eliminated since it was short-circuited by the test for `!rqr.has_value()` returning from the coroutine rather than propagating an error.	2022-08-14 15:31:45 +03:00
Avi Kivity	973034978c	cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query() We use result_wrap() in two places, but that makes coroutinizing the containing function a little harder, since it's composed of more lambdas. Remove the wrappers, gaining a bit of performance in the error case.	2022-08-14 15:22:18 +03:00
Kamil Braun	b4c5b79f5e	db: system_distributed_keyspace: don't call `on_internal_error` in `check_exists` The function `check_exists` checks whether a given table exists, giving an error otherwise. It previously used `on_internal_error`. `check_exists` is used in some old functions that insert CDC metadata to CDC tables. These tables are no longer used in newer Scylla versions (they were replaced with other tables with different schema), and this function is no longer called. The table definitions were removed and these tables are no longer created. They will only exists in clusters that were upgraded from old versions of Scylla (4.3) through a sequence of upgrades. If you tried to upgrade from a very old version of Scylla which had neither the old or the new tables to a modern version, say from 4.2 to 5.0, you would get `on_internal_error` from this `check_exists` function. Fortunately: 1. we don't support such upgrade paths 2. `on_internal_error` in production clusters does not crash the system, only throws. The exception would be catched, printed, and the system would run (just without CDC - until you finished upgrade and called the propoer nodetool command to fix the CDC module). Unfortunately, there is a dtest (`partitioner_tests.py`) which performs an unsupported upgrade scenario - it starts Scylla from Cassandra (!) work directories, which is like upgrading from a very old version of Scylla. This dtest was not failing due to another bug which masked the problem. When we try to fix the bug - see #11225 - the dtest starts hitting the assertion in `check_exists`. Because it's a test, we configure `on_internal_error` to crash the system. The point of this commit is to not crash the system in this rare scenario which happens only in some weird tests. We now throw `std::runtime_error` instead of calling `on_internal_error`. In the dtest, we already ignore the resulting CDC error appearing in the logs (see scylladb/scylla-dtest#2804). Together with this change, we'll be able to fix the #11225 bug and pass this test. Closes #11287	2022-08-14 13:12:03 +03:00
Nadav Har'El	329068df99	test/cql-pytest: un-xfail yet another passing collection-indexing test After collection indexing has been implemented, yet another test which failed because of #2962 now passes. So remove the "xfail" marker. Refs #2962 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	f6f18b187a	secondary index: fix paging in map value indexing When indexing a map column's values, if the same value appears more than once, the same row will appear in the index more than once. We had code that removed these duplicates, but this deduplication did not work across page boundaries. We had two xfailing tests to demonstrate this bug. In this patch we fix this bug by looking at the page's start and not generating the same row again, thereby getting the same deduplication we had inside pages - now across pages. The previously-xfailing tests now pass, and their xfail tag is removed. I also added another test, for the case where the base table has only partition keys without clustering keys. This second test is important because the code path for the partition-key-only case is different, and the second test exposed a bug in it as well (which is also fixed in this patch). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	dc445b9a73	test/cql-pytest: test for paging with collection values index If a map has several keys with the same value, then the "values(m)" index must remember all of them as matching the same row - because later we may remove one of these keys from the map but the row would still need to match the value because of the remaining keys. We already had a test (test_index_map_values) that although the same row appears more than once for this value, when we search for this value the result only returns the row once. Under the hood, Scylla does find the same value multiple times, but then eliminates the duplicate matched raw and returns it only once. But there is a complication, that this de-duplication does not easily span paging. So in this patch we add a test that checks that paging does not cause the same row to be returned more than once. Unfortunately, this test currently fails on Scylla so marked "xfail". It passes on Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	5d556115a1	cql, view: rename and explain bytes_with_action The structure "bytes_with_action" was very hard to understand because of its mysterious and general-sounding name, and no comments. In this patch I add a large comment explaining its purpose, and rename it to a more suitable name, view_key_and_action, which suggests that each such object is about one view key (where to add a view row), and an additional "action" that we need to take beyond adding the view row. This is the best I can do to make this code easier to understand without completely reorganizing it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	8b00c91c13	cql, index: make collection indexing a cluster feature Prevent a user from creating a secondary index on a collection column if the cluster has any nodes which don't support this feature. Such nodes will not be able to correctly handle requests related to this index, so better not allow creating one. Attempting to create an index on a collection before the entire cluster supports this feature will result in the error: Indexing of collection columns not supported by some older nodes in this cluster. Please upgrade them. Tested by manually disabling this feature in feature_service.cc and seeing this error message during collection indexing test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	aa86f808a6	test/cql-pytest: failing tests for oversized key values in MV and SI In issue #9013, we noticed that if a value larger than 64 KB is indexed, the write fails in a bad way, and we fixed it. But the test we wrote when fixing that issue already suggested that something was still wrong: Cassandra failed the write cleanly, with an InvalidRequest, while Scylla failed with a mysterious WriteFailure (with a relevant error message only in the log). This patch adds several xfailing tests which demonstrate what's still wrong. This is also summarized in issue #8627: 1. A write of an oversized value to an indexed column returns the wrong error message. 2. The same problem also exists when indexing a collection, and the indexed key or value is oversized. 3. The situation is even less pleasant when adding an index to a table with pre-existing data and an oversized value. In this case, the view building will fail on the bad row, and never finish. 4. We have exactly the same bugs not just with indexes but also with materialized views. Interestingly, Cassandra has similar bugs in materialized views as well (but not in the secondary index case, where Cassandra does behave as expected). Refs #8627. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	2c244c6e09	cql: fix secondary index "target" when column name has special characters Unfortunately, we encode the "target" of a secondary index in one of three ways: 1. It can be just a column name 2. It can be a string like keys(colname) - for the new type of collection indexes introduced in this series. 3. It can be a JSON map ({ ... }). This form is used for local indexes. The code parsing this target - target_parser::parse() - needs not to confuse these different formats. Before this patch, if the column name contains special characters like braces or parentheses (this is allowed in CQL syntax, via quoting), we can confuse case 1, 2, and 3: A column named "keys(colname)" will be confused for case 2, and a column named "{123}" will be confused with case 3. This problem can break indexing of some specially-crafted column names - as reproduced by test_secondary_index.py::test_index_quoted_names. The solution adopted in this patch is that the column name in case 1 should be escaped somehow so it cannot be possibly confused with either cases 2 and 3. The way we chose is to convert the column name to CQL (with column_definition::as_cql_name()). In other words, if the column name contains non-alphanumeric characters, it is wrapped in quotes and also quotes are doubled, as in CQL. The result of this can't be confused with case 2 or 3, neither of which may begin with a quote. This escaping is not the minimal we could have done, but incidentally it is exactly what Cassandra does as well, so I used it as well. This change is mostly backward compatible: Already-existing indexes will still have unescaped column names stored for their "target" string, and the unescaping code will see they are not wrapped in quotes, and not change them. Backward compatibility will only fail on existing indexes on columns whose name begin and end in the quote characters - but this case is extremely unlikely. This patch illustrates how un-ideal our index "target" encoding is, but isn't what made it un-ideal. We should not have used three different formats for the index target - the third representation (JSON) should have sufficed. However, two two other representations are identical to Cassandra's, so using them when we can has its compatibility advantages. The patch makes test_secondary_index.py::test_index_quoted_names pass. Fixes #10707. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	56204a3794	cql, index: improve error messages Before this patch, trying to create an index on entries(x) where x is not a map results in an error message: Cannot create index on index_keys_and_values of column x The string "index_keys_and_values" is strange - Cassandra prints the easier to understand string "entries()" - which better corresponds to what the user actually did. It turns out that this string "index_keys_and_values" comes from an elaborate set of variables and functions spanning multiple source files, used to convert our internal target_type variable into such a string. But although this code was called "index_option" and sounded very important, it was actually used just for one thing - error messages! So in this patch we drop the entire "index_option" abstraction, replacing it by a static trivial function defined exactly where it's used (create_index_statement.cc), which prints a target type. While at it, we print "entries()" instead of "index_keys_and_values" ;-) After this patch, the test_secondary_index.py::test_index_collection_wrong_type finally passes (the previous patch fixed the default table names it assumes, and this patch fixes the expected error messages), so its "xfail" tag is removed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	84461f1827	cql, index: fix default index name for collection index When creating an index "CREATE INDEX ON tbl(keys(m))", the default name of the index should be tbl_m_idx - with just "m". The current code incorrectly used the default name tbl_m_keys_idx, so this patch adds a test (which passes on Cassandra, and after this patch also on Scylla) and fixes the default name. It turns out that the default index name was based on a mysterious index_target::as_string(), which printed the target "keys(m)" as "m_keys" without explaining why it was so. This method was actually used only in three places, and all of them wanted just the column name, without the "_keys" suffix! So in this patch we rename the mysterious as_string() to column_name(), and use this function instead. Now that the default index name uses column_name() and gets just column_name(), the correct default index name is generated, and the test passes. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Nadav Har'El	94ba03a4d6	test/cql-pytest: un-xfail several collecting indexing tests After the previous patches implemented collection indexing, several tests in test/cql-pytest/test_secondary_index.py that were marked with "xfail" started to pass - so here we remove the xfail. Only three collection indexing tests continue to xfail: test_secondary_index.py::test_index_collection_wrong_type test_secondary_index.py::test_index_quoted_names (#10707) test_secondary_index.py::test_local_secondary_index_on_collection (#10713) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Michał Radwański	2690ecd65d	test/cql-pytest/test_secondary_index: verify that local index on collection fails. Collection indexing is being tracked by #2962. Global secondary index over collection is enabled by #10123. Leave this test to track this behaviour. Related issue: #10713	2022-08-14 10:29:52 +03:00
Michał Radwański	1d852a9c7f	docs/design-notes/secondary_index: add `VALUES` to index target list A new secondary index target is being supported, which is `VALUES(v)`.	2022-08-14 10:29:52 +03:00
Michał Radwański	25f4c905f5	test/cql-pytest/test_secondary_index: add randomized test for indexes on collections	2022-08-14 10:29:52 +03:00
Michał Radwański	2a8289c101	cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra	2022-08-14 10:29:52 +03:00
Michał Radwański	fb476702a7	cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail.	2022-08-14 10:29:52 +03:00
Michał Radwański	f572051ee9	test/boost/secondary_index_test: update for non-frozen indexes on collections	2022-08-14 10:29:52 +03:00
Karol Baryła	9e377b2824	test/cql-pytest: Uncomment collection indexes tests that should be working now	2022-08-14 10:29:52 +03:00
Nadav Har'El	67990d2170	cql, index: don't use IS NOT NULL on collection column When the secondary-index code builds a materialized view on column x, it adds "x IS NOT NULL" to the where-clause of the view, as required. However, when we index a collection column, we index individual pieces of the collection (keys, values), the the entire collection, so checking if the entire collection is null does not make sense. Moreover, for a collection column x, "x IS NOT NULL" currently doesn't work and throws errors when evaluating that expression when data is written to the table. The solution used in this patch is to simply avoid adding the "x IS NOT NULL" when creating the materialized view for a collection index. Everything works just fine without it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Michał Radwański	bd44bc3e35	cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows The index on collection values is special in a way, as its' clustering key contains not only the base primary key, but also a column that holds the keys of the cells in the collection, which allows to distinguish cells with different keys but the same value. This has an unwanted consequence, that it's possible to receive two identical base table primary keys from indexed_table_select_statement::find_index_clustering_rows. Thankfully, the duplicate primary keys are guaranteed to occur consequently.	2022-08-14 10:29:52 +03:00
Michał Radwański	10e241988e	cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections	2022-08-14 10:29:52 +03:00
Karol Baryła	ac97086855	cql3, index: Use entries() indexes on collections for queries Previous commit added the ability to use GSI over non-frozen collections in queries, but only the keys() and values() indexes. This commit adds support for the missing index type - entries() index. Signed-off-by: Karol Baryła <karol.baryla@scylladb.com> Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Karol Baryła	7966841d37	cql3, index: Use keys() and values() indexes on collections for queries. Previous commits added the possibility of creating GSI on non-frozen collections. This (and next) commit allow those indexes to actually be used by queries. This commit enables both keys() and values() indexes, as they are pretty similar.	2022-08-14 10:29:52 +03:00
Karol Baryła	aa47f4a15c	types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented std::begin in concept for build_value_fragmented's parameter allows creating it from an array	2022-08-14 10:29:52 +03:00
Michał Radwański	e6521ff8ba	cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function GCC doesn't consider switches over enums to be exhaustive. Replace bogous return value after a switch where each of the cases return, with an exception.	2022-08-14 10:29:52 +03:00
Michał Radwański	32289d681f	db/view/view.cc: compute view_updates for views over collections For collection indexes, logic of computing values for each of the column needed to change, since a single particular column might produce more than one value as a result. The liveness info from individual cells of the collection impacts the liveness info of resulting rows. Therefore it is needed to rewrite the control flow - instead of functions getting a row from get_view_row and later computing row markers and applying it, they compute these values by themselves. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:49 +03:00
Michał Radwański	112086767c	view info: has_computed_column_depending_on_base_non_primary_key In case of secondary indexes, if an index does not contain any column from the base which makes up for the primary key, then it is assumed that during update, a change to some cells from the base table cannot cause that we're dealing with a different row in the view. This however doesn't take into account the possibility of computed columns which in fact do depend on some non-primary-key columns. Introduce additional property of an index, has_computed_column_depending_on_base_non_primary_key.	2022-08-14 10:29:14 +03:00
Michał Radwański	4cfd264e5d	column_computation: depends_on_non_primary_key_column depends_on_non_primary_key_column for a column computation is needed to detect a case where the primary key of a materialized view depends on a non primary key column from the base table, but at the same time, the view itself doesn't have non-primary key columns. This is an issue, since as for now, it was assumed that no non-primary key columns in view schema meant that the update cannot change the primary key of the view, and therefore the update path can be simplified.	2022-08-14 10:29:14 +03:00
Michał Radwański	f1a9def2e1	schema, index/secondary_index_manager: make schema for index-induced mv Indexes over collections use materialized views. Supposing that we're dealing with global indexes, and that pk, ck were the partition and clustering keys of the base table, the schema of the materialized view, apart from having idx_token (which is used to preserve the order on the entries in the view), has a computed column coll_value (the name is not guaranteed to be exactly) and potentially also coll_keys_for_values_index, if the index was over collection values. This is needed, since values in a specific collection need not be unique. To summarize, the primary key is as follows: coll_value, idx_token, pk, ck, coll_keys_for_values_index? where coll_value is the computed value from the collection, be it a key from the collection, a value from the collection, or the tuple containing both.	2022-08-14 10:29:14 +03:00
Michał Radwański	60d50f6016	index/secondary_index_manager: extract keys, values, entries types from collection These functions are relevant for indexes over collections (creating schema for a materialized view related to the index). Signed-off-by: Michał Radwański <michal.radwanski@scylladb.com> Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:14 +03:00
Michał Radwański	cbe33f8d7a	cql3/statements/: validate CREATE INDEX for index over a collection Allow CQL like this: CREATE INDEX idx ON table(some_map); CREATE INDEX idx ON table(KEYS(some_map)); CREATE INDEX idx ON table(VALUES(some_map)); CREATE INDEX idx ON table(ENTRIES(some_map)); CREATE INDEX idx ON table(some_set); CREATE INDEX idx ON table(VALUES(some_set)); CREATE INDEX idx ON table(some_list); CREATE INDEX idx ON table(VALUES(some_list)); This is needed to support creating indexes on collections.	2022-08-14 10:29:13 +03:00
Michał Radwański	997682ed72	cql3/statements/create_index_statement,index_target: rewrite index target for collection The syntax used for creating indexes on collections that is present in Cassandra is unintuitive from the internal representation point of view. For instance, index on VALUES(some_set) indexes the set elements, which in the internal representation are keys of collection. Rewrite the index target after receiving it, so that the index targets are consistent with the representation.	2022-08-14 10:29:13 +03:00
Michał Radwański	ebc4ad4713	column_computation.hh, schema.cc: collection_column_computation This type of column computation will be used for creating updates to materialized views that are indexes over collections. This type features additional function, compute_values_with_action, which depending on an (optional) old row and new row (the update to the base table) returns multiple bytes_with_action, a vector of pairs (computed value, some action), where the action signifies whether a deletion of row with a specific key is needed, or creation thereby.	2022-08-14 10:29:13 +03:00
Michał Radwański	2babee2cdc	column_computation.hh, schema.cc: compute_value interface refactor The compute_value function of column_computation has had previously the following signature: virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override; This is superfluous, since never in the history of Scylla, the last parameter (row) was used in any implentation, and never did it happen that it returned bytes_opt. The absurdity of this interface can be seen especially when looking at call sites like following, where dummy empty row was created: ``` token_column.get_computation().compute_value( *_schema, pkv_linearized, clustering_row(clustering_key_prefix::make_empty())); ```	2022-08-14 10:29:13 +03:00
Michał Radwański	166afd46b5	Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))` Brings support of cql syntax `INDEX ON table(VALUES(collection))`, even though there is still no support for indexes over collections. Previously, index_target::target_type::values was refering to values of a regular (non-collection) column. Rename it to `regular_values`. Fixes #8745.	2022-08-14 10:29:13 +03:00
Piotr Sarna	fe617ed198	Merge 'db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column' from Piotr Dulikowski Previously, the `system.local`'s `rpc_address` column kept local node's `rpc_address` from the scylla.yaml configuration. Although it sounds like it makes sense, there are a few reasons to change it to the value of scylla.yaml's `broadcast_rpc_address`: - The `broadcast_rpc_address` is the address that the drivers are supposed to connect to. `rpc_address` is the address that the node binds to - it can be set for example to 0.0.0.0 so that Scylla listens on all addresses, however this gives no useful information to the driver. - The `system.peers` table also has the `rpc_address` column and it already keeps other nodes' `broadcast_rpc_address`es. - Cassandra is going to do the same change in the upcoming version 4.1. Fixes: #11201 Closes #11204 * github.com:scylladb/scylladb: db/system_keyspace: fix indentation after previous patch db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column	2022-08-12 16:24:28 +02:00
Anna Stuchlik	41362829b5	doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information	2022-08-12 14:39:10 +02:00
Anna Stuchlik	b45ba69a6c	doc: update the guides for Ubuntu and Debian to remove image information and the OS version number	2022-08-12 14:05:49 +02:00
Anna Stuchlik	24acffc2ce	doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1	2022-08-12 13:47:03 +02:00
Piotr Sarna	1ab4c6aab3	Merge 'cql3: enable collections as UDA accumulators' from Wojciech Mitros Currently, the initial values of UDA accumulators are converted to strings using the to_string() method and from strings using the from_string() method. The from_string() method is not implemented for collections, and it can't be implemented without changing the string format, because in that format, we cannot differentiate whether a separator is a part of a value or is an actual separator between values. In particular, the separators are not escaped in the collection values. Instead of from_string()/to_string() the cql parser is used for creating a value from a string (the same , and to_parsable_string() is used to converting a value into a string. A test using a list as an accumulator is added to cql-pytest/test_uda.py. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #11250 * github.com:scylladb/scylladb: cql3: enable collections as UDA accumulators cql3: extend implementation of to_bytes for raw_value	2022-08-12 12:51:17 +02:00
Botond Dénes	ceb1cdcb7a	Merge 'doc: fix the typo on the Fault Tolerance page' from Anna Stuchlik Fix https://github.com/scylladb/scylla-doc-issues/issues/438 In addition, I've replaced "Scylla" with "ScyllaDB" on that page. Closes #11281 * github.com:scylladb/scylladb: doc: replace Scylla with ScyllaDB on the Fault Tolerance page doc: fis the typo in the note	2022-08-12 06:58:39 +03:00
Nadav Har'El	c27f431580	test/alternator: fix a flaky test for full-table scan page size This patch fixes the test test_scan.py::test_scan_paging_missing_limit which failed in a Jenkins run once (that we know of). That test verifies that an Alternator Scan operation without an explicit "Limit" is nevertheless paged: DynamoDB (and also Scylla) wanted this page size to be 1 MB, but it turns out (see #10327) that because of the details of how Scylla's scan works, the page size can be larger than 1 MB. How much larger? I ran this test hundreds of times and never saw it exceed a 3 MB page - so the test asserted the page must be smaller than 4 MB. But now in one run - we got to this 4 MB and failed the test. So in this patch we increase the table to be scanned from 4 MB to 6 MB, and assert the page size isn't the full 6 MB. The chance that this size will eventually fail as well should be (famous last words...) very small for two reasons: First because 6 MB is even higher than I the maximum I saw in practice, and second because empirically I noticed that adding more data to the table reduces the variance of the page size, so it should become closer to 1 MB and reduce the chance of it reaching 6 MB. Refs #10327 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11280	2022-08-12 06:57:45 +03:00
Botond Dénes	2a39d6518d	Merge 'doc: clarify the disclaimer about reusing deleted counter column values' from Anna Stuchlik Fix https://github.com/scylladb/scylla-doc-issues/issues/857 Closes #11253 * github.com:scylladb/scylladb: doc: language improvemens to the Counrers page doc: fix the external link doc: clarify the disclaimer about reusing deleted counter column values	2022-08-12 06:56:28 +03:00
Botond Dénes	10371441c9	Merge 'docs: add a disclaimer about not supporting local counters by SSTableLoader' from Anna Stuchlik Fix https://github.com/scylladb/scylla-doc-issues/issues/867 Plus some language, formatting, and organization improvements. Closes #11248 * github.com:scylladb/scylladb: doc: language, formatting, and organization improvements doc: add a disclaimer about not supporting local counters by SSTableLoader	2022-08-12 06:55:00 +03:00
Benny Halevy	d295d8e280	everywhere: define locator::host_id as a strong tagged_uuid type So it can be distinguished from other uuid-based identifiers in the system. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11276	2022-08-12 06:01:44 +03:00
Botond Dénes	69aea59d97	Merge 'storage_proxy: use consistent topology, prepare for fencing' from Avi Kivity Replication is a mix of several inputs: tokens and token->node mappings (topology), the replication strategy, replication strategy parameters. These are all captured in effective_replication_map. However, if we use effective_replication_map:s captured at different times in a single query, then different uses may see different inputs to effective_replication_map. This series protects against that by capturing an effective_replication_map just once in a query, and then using it. Furthermore, the captured effective_replication_map is held until the query completes, so topology code can know when a topology is no longer is use (although this isn't exploited in this series). Only the simple read and write paths are covered. Counters and paxos are left for later. I don't think the series fixes any bugs - as far as I could tell everything was happening in the same continuation. But this series ensures it. Closes #11259 * github.com:scylladb/scylladb: storage_proxy: use consistent topology storage_proxy: use consistent replication map on read path storage_proxy: use consistent replication map on write path storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map consistency_level: accept effective_replication_map as parameter, rather than keyspace consistency_level: be more const when using replication_strategy	2022-08-12 06:00:30 +03:00
Alejo Sanchez	10baac1c84	test.py: test topology and schema changes Add support for topology changes: add/stop/remove/restart/replace node. Test simple schema changes when changing topology. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-11 23:39:13 +02:00
Alejo Sanchez	7f32fc0cc7	test.py: ClusterManager API mark cluster dirty Allow tests to manually mark current cluster dirty. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-11 23:39:13 +02:00
Alejo Sanchez	a585a82ad1	test.py: call before/after_test for each test case Preparing for topology tests with changing clusters, run before and after checks per test case. Change scope of pytest fixtures to function as we need them per test casse. Add server and client API logic. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-11 23:39:13 +02:00
Alejo Sanchez	eedc866433	test.py: handle driver connection in ManagerClient Preparing for cluster cycling, handle driver connection in ManagerClient. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-11 23:39:13 +02:00
Alejo Sanchez	fe561a7dbd	test.py: ClusterManager API and ManagerClient Add an API via Unix socket to Manager so pytests can query information about the cluster. Requests are managed by ManagerClient helper class. The socket is placed inside a unique temporary directory for the Manager (as safe temporary socket filename is not possible in Python). Initial API services are manager up, cluster up, if cluster is dirty, cql port, configured replicas (RF), and list of host ids. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-11 23:39:13 +02:00
Alejo Sanchez	aad015d4e2	test.py: improve topology docstring Improve docstring of TopologyTestSuite to reflect its differences with other test suites. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-11 23:39:13 +02:00
Avi Kivity	a2c4f5aa1a	storage_proxy: use consistent topology Derive the topology from captured and stable effective_replication_map instead of getting a fresh topology from storage_proxy, since the fresh topology may be inconsistent with the running query. digest_read_resolver did not capture an effective_replication_map, so that is added.	2022-08-11 17:58:42 +03:00
Avi Kivity	883518697b	storage_proxy: use consistent replication map on read path Capture a replication map just once in abstract_read_executor::_effective_replication_map_ptr. Although it isn't used yet, it serves to keep a reference count on topology (for fencing), and some accesses to topology within reads still remain, which can be converted to use the member in a later patch.	2022-08-11 17:58:42 +03:00
Avi Kivity	01a614fb4d	storage_proxy: use consistent replication map on write path Capture a replication map just once in abstract_write_handler::_effective_replication_map_ptr and use it in all write handlers. A few accesses to get the topology still remain, they will be fixed up in a later patch.	2022-08-11 17:58:42 +03:00
Avi Kivity	f1b0e3d58e	storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map Allow callers to use consistent effective_replication_map:s across calls by letting the caller select the object to use.	2022-08-11 17:58:42 +03:00
Avi Kivity	46bd0b1e62	consistency_level: accept effective_replication_map as parameter, rather than keyspace A keyspace is a mutable object that can change from time to time. An effective_replication_map captures the state of a keyspace at a point in time and can therefore be consistent (with care from the caller). Change consistency_level's functions to accept an effective_replication_map. This allows the caller to ensure that separate calls use the same information and are consistent with each other. Current callers are likely correct since they are called from one continuation, but it's better to be sure.	2022-08-11 17:58:42 +03:00
Avi Kivity	1078d1bfda	consistency_level: be more const when using replication_strategy We don't modify the replication_strategy here, so use const. This will help when the object we get is const itself, as it will be in the next patches.	2022-08-11 17:58:42 +03:00
Wojciech Mitros	48bd752971	cql3: enable collections as UDA accumulators Currently, the initial values of UDA accumulators are converted to strings using the to_string() method and from strings using the from_string() method. The from_string() method is not implemented for collections, and it can't be implemented without changing the string format, because in that format, we cannot differentiate whether a separator is a part of a value or is an actual separator between values. In particular, the separators are not escaped in the collection values. For example, a list with string elements: 'a, b', 'c' would be represented as a string 'a, b, c', while now it is represented as "['a, b', 'c']". Some types that were parsable are now represented in a different way. For example, a tuple ('a', null, 0) was represented as "a:\@:0", and now it is "('a', null, 0)". Instead of from_string()/to_string() the cql parser is used for creating a value from a string (the same , and to_parsable_string() is used to converting a value into a string. A test using a list as an accumulator is added to cql-pytest/test_uda.py. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-08-11 16:23:57 +02:00
Anna Stuchlik	f5a49688ae	doc: replace Scylla with ScyllaDB on the Fault Tolerance page	2022-08-11 16:14:33 +02:00
Anna Stuchlik	7218a977df	doc: fis the typo in the note	2022-08-11 16:09:49 +02:00
Botond Dénes	d407d3b480	Merge 'Calculate effective_replication_map: prevent stalls with everywhere_replication_strategy' from Benny Halevy For replication strategies like "everywhere" and "local" that return the same set of endpoints for all tokens, we can call rs->calculate_natural_endpoints one once and reuse the result for all token. Note that ideally the replication_map could contain only a single token range for this case, but that does't seem to work yet. Add `maybe_yield()` calls to the tight loop to prevent reactor stalls on large clusters when copying a long vector returned by everywhere_replication_strategy to potentially 1000's of tokens in the map. Nicholas Peshek wrote in https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370 about similar patch by Geoffrey Beausire: `994c6ecf3c` > Yep. That dropped our startup from 3000+ seconds to about 40. Fixes #10337 Closes #11277 * github.com:scylladb/scylladb: abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies abstract_replication_strategy: add has_uniform_natural_endpoints	2022-08-11 15:19:47 +03:00
Gleb Natapov	e5157b27ad	raft: getting abort_requested_exception exception from a sm::apply is not a critical error During shutdown it is normal to get abort_requested_exception exception from a state machine "apply" method. Do not rethrow it as state_machine_error, just abort an applier loop with an info message.	2022-08-11 15:11:21 +03:00
Gleb Natapov	9977851eb1	schema_registry: fix abandoned feature warning maybe_sync ignores failed feature in case waiting is aborted. Fix it.	2022-08-11 15:11:21 +03:00
Gleb Natapov	eed8e19813	service: raft: silence rpc::closed_errors in raft_rpc Before the patch if an RPC connection was established already then the close error was reported by the RPC layer and then duplicated by raft_rpc layer. If a connection cannot be established because the remote node is already dead RPC does not report the error since we decided that in that case gossiper and failure detector messages can be used to detect the dead node case and there is no reason to pollute the logs with recurring errors. This aligns raft behaviour with what we already have in storage_proxy that does not report closed errors as well.	2022-08-11 15:11:21 +03:00
Anna Stuchlik	1603129275	doc: remove the reduntant space from index	2022-08-11 12:36:16 +02:00
Anna Stuchlik	ee258cb0af	doc: update the syntax for defining service level attributes	2022-08-11 12:32:38 +02:00
Petr Gusev	4bc6611829	raft read_barrier, retry over intermittent rpc failures If the leader was unavailable during read_barrier, closed_error occurs, which was not handled in any way and eventually reached the client. This patch adds retries in this case. Fix: scylladb#11262 Refs: #11278 Closes #11263	2022-08-11 13:31:19 +03:00
Amnon Heiman	5ac20ac861	Reduce the number of per-scheduling group metrics This patch reduces the number of metrics ScyllaDB generates. Motivation: The combination of per-shard with per-scheduling group generates a lot of metrics. When combined with histograms, which require many metrics, the problem becomes even bigger. The two tools we are going to use: 1. Replace per-shard histograms with summaries 2. Do not report unused metrics. The storage_proxy stats holds information for the API and the metrics layer. We replaced timed_rate_moving_average_and_histogram and time_estimated_histogram with the unfied timed_rate_moving_average_summary_and_histogram which give us an option to report per-shard summaries instead of histogram. All the counters, histograms, and summaries were marked as skip_when_empty. The API was modified to use timed_rate_moving_average_summary_and_histogram. Closes #11173	2022-08-11 13:31:19 +03:00
Benny Halevy	9167b857e9	abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies For replication strategies like "everywhere" and "local" that return the same set of endpoints for all tokens, we can call rs->calculate_natural_endpoints one once and reuse the result for all token. Note that ideally the replication_map could contain only a single token range for this case, but that does't seem to work yet. Add maybe_yield() calls to the tight loop to prevent reactor stalls on large clusters when copying a long vector returned by everywhere_replication_strategy to potentially 1000's of tokens in the map. Nicholas Peshek wrote in https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370 about similar patch by Geoffrey Beausire: `994c6ecf3c` > Yep. That dropped our startup from 3000+ seconds to about 40. Fixes #10337 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-11 10:35:29 +03:00
Benny Halevy	eb678e723b	abstract_replication_strategy: add has_uniform_natural_endpoints So that using calaculate_natural_endpoints can be optimized for strategies that return the same endpoints for all tokens, namely everywhere_replication_strategy and local_strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-11 10:34:14 +03:00
Calle Wilund	a729c2438e	commitlog: Make get_segments_to_replay on-demand Refs #11237 Don't store segments found on init scan in all shard instances, instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store pointless string lists in shards 1 to X, and also we can potentially ask for the list twice. More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is considered candidate for recycling, and included in footprint.	2022-08-11 06:41:23 +00:00
Nadav Har'El	d03bd82222	Revert "test: move scylla_inject_error from alternator/ to cql-pytest/" This reverts commit `8e892426e2` and fixes the code in a different way: That commit moved the scylla_inject_error function from test/alternator/util.py to test/cql-pytest/util.py and renamed test/alternator/util.py. I found the rename confusing and unnecessary. Moreover, the moved function isn't even usable today by the test suite that includes it, cql-pytest, because it lacks the "rest_api" fixture :-) so test/cql-pytest/util.py wasn't the right place for it anyway. test/rest_api/rest_util.py could have been a good place for this function, but there is another complication: Although the Alternator and rest_api tests both had a "rest_api" fixture, it has a different type, which led to the code in rest_api which used the moved function to have to jump through hoops to call it instead of just passing "rest_api". I think the best solution is to revert the above commit, and duplicate the short scylla_inject_error() function. The duplication isn't an exact copy - the test/rest_api/rest_util.py version now accepts the "rest_api" fixture instead of the URL that the Alternator version used. In the future we can remove some of this duplication by having some shared "library" code but we should do it carefully and starting with agreeing on the basic fixtures like "rest_api" and "cql", without that it's not useful to share small functions that operate on them. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11275	2022-08-11 06:43:26 +03:00
Wojciech Mitros	42e0fb90ea	cql3: extend implementation of to_bytes for raw_value When called with a null_value or an unset_value, raw_value::to_bytes() threw an std::get error for wrong variant. This patch adds a description for the errors thrown, and adds a to_bytes_opt() method that instead of throwing returns a std::nullopt.	2022-08-10 16:40:22 +02:00
Avi Kivity	e9cbc9ee85	Merge 'Add support for empty replica pages' from Botond Dénes Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones. The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by `3131cbea62`, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones. The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set. Upgrade sanity test was conducted as following: * Created cluster of 3 nodes with RF=3 with master version * Wrote small dataset of 1000 rows. * Deleted prefix of 980 rows. * Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100` * Also did some manual queries via `cqlsh` with smaller page size and tracing on. * Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`. * Confirmed there are no errors or read-repairs. Perf regression test: ``` build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60 ``` Before: ``` median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors) median absolute deviation: 973.40 maximum: 135511.63 minimum: 104978.74 ``` After: ``` median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors) median absolute deviation: 2979.13 maximum: 134538.13 minimum: 114688.07 ``` Diff: +~200 instruction/op. Fixes: https://github.com/scylladb/scylla/issues/7689 Fixes: https://github.com/scylladb/scylla/issues/3914 Fixes: https://github.com/scylladb/scylla/issues/7933 Refs: https://github.com/scylladb/scylla/issues/3672 Closes #11053 * github.com:scylladb/scylladb: test/cql-pytest: add test for query tombstone page limit query-result-writer: stop when tombstone-limit is reached service/pager: prepare for empty pages service/storage_proxy: set smallest continue pos as query's continue pos service/storage_proxy: propagate last position on digest reads query: result_merger::get() don't reset last-pos on short-reads and last pages query: add tombstone-limit to read-command service/storage_proxy: add get_tombstone_limit() query: add tombstone_limit type db/config: add config item for query tombstone limit gms: add cluster feature for empty replica pages tree: don't use query::read_command's IDL constructor	2022-08-10 13:38:06 +03:00
Avi Kivity	32b405d639	Merge 'doc: change the default for the overprovisioned option' from Anna Stuchlik Fix https://github.com/scylladb/scylla-doc-issues/issues/842 This PR changes the default for the `overprovisioned` option from `disabled` to `enabled`, according to https://github.com/scylladb/scylla-doc-issues/issues/842. In addition, I've used this opportunity to replace "Scylla" with "ScyllaDB" on the updated page. Closes #11256 * github.com:scylladb/scylladb: doc: replace Scylla with ScyllaDB in the product name doc: change the default for the overprovisioned option	2022-08-10 12:43:44 +03:00
Raphael S. Carvalho	ace6334619	replica: table: kill unused _sstables_staging Good change as it's one less thing to worry about in compaction group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-08-10 12:32:13 +03:00
Kamil Braun	cff595211e	Merge 'Raft test topology part 2' from Alecco Give cluster control to pytests. While there add missing stop gracefully and add server to ScyllaCluster. Clusters can be marked dirty but they are not recycled yet. This will be done in a later series. Closes #11219 * github.com:scylladb/scylladb: test.py: ScyllaCluster add_server() mark dirty test.py: ScyllaCluster add server management test.py: improve seeds for new servers test.py: Topology tests and Manager for Scylla clusters test.py: rename scylla_server to scylla_cluster test.py: function for python driver connection test.py: ScyllaCluster add_server helper test.py: shutdown control connection during graceful shutdown test.py: configurable authenticator and authorizer test.py: ScyllaServer stop gracefully test.py: FIXME for bad cluster log handling logic	2022-08-10 11:13:21 +02:00
Michał Chojnowski	de0f2c21ec	configure.py: make messaging_service.cc the first source file Currently messaging_service.o takes the longest of all core objects to compile. For a full build of build/release/scylla, with current ninja scheduling, on a 32-hyperthread machine, the last ~16% of the total build time is spent just waiting on messaging_service.o to finish compiling. Moving the file to the top of the list makes ninja start its compilation early and gets rid of that single-threaded tail, improving the total build time. Closes #11255	2022-08-10 11:18:09 +03:00
Calle Wilund	8116c56807	commitlog: Revert/modify `fac2bc4` - do footprint add in delete Fixes #11184 Fixes #11237 In prev (broken) fix for #11184 we added the footprint for left-over files (replay candidates) to disk footprint on commitlog init. This effectively prevents us from creating segments iff we have tight limits. Since we nowadays do quite a bit of inserts _before_ commitlog replay (system.local, but...) we can end up in a situation where we deadlock start because we cannot get to the actual replay that will eventually free things. Another, not thought through, consequence is that we add a single footprint to _all_ commitlog shard instances - even though only shard 0 will get to actually replay + delete (i.e. drop footprint). So shards 1-X would all be either locked out or performance degraded. Simplest fix is to add the footprint in delete call instead. This will lock out segment creation until delete call is done, but this is fast. Also ensures that only replay shard is involved.	2022-08-10 08:04:03 +00:00
Botond Dénes	e27127bb7f	test/cql-pytest: add test for query tombstone page limit Check that the replica returns empty pages as expected, when a large tombstone prefix/span is present. Large = larger than the configured query_tombstone_limit (using a tiny value of 10 in the test to avoid having to write many tombstones).	2022-08-10 09:14:59 +03:00
Tomasz Grabiec	8ee5b69f80	test: row_cache: Use more narrow key range to stress overlapping reads more This makes catching issues related to concurrent access of same or adjacent entries more likely. For example, catches #11239. Closes #11260	2022-08-10 06:53:54 +03:00
Botond Dénes	7730419f5c	query-result-writer: stop when tombstone-limit is reached The query result writer now counts tombstones and cuts the page (marking it as a short one) when the tombstone limit is reached. This is to avoid timing out on large span of tombstones, especially prefixes. In the case of unpaged queries, we fail the read instead, similarly to how we do with max result size. If the limit is 0, the previous behaviour is used: tombstones are not taken into consideration at all.	2022-08-10 06:03:38 +03:00
Botond Dénes	8066dbc635	service/pager: prepare for empty pages The pager currently assumes that an empty pages means the query is exhausted. Lift this assumption, as we will soon have empty short pages. Also, paging using filtering also needs to use the replica-provided last-position when the page is empty.	2022-08-10 06:03:38 +03:00
Botond Dénes	6a7dedfe34	service/storage_proxy: set smallest continue pos as query's continue pos We expect each replica to stop at exactly the same position when the digests match. Soon however, if replicas have a lot of tombstones, some may stop earlier then the others. As long as all digests match, this is fine but we need to make sure we continue from the smallest such positions on the next page.	2022-08-10 06:03:38 +03:00
Botond Dénes	2656968db2	service/storage_proxy: propagate last position on digest reads We want to transmit the last position as determined by the replica on both result and digest reads. Result reads already do that via the query::result, but digest reads don't yet as they don't return the full query::result structure, just the digest field from it. Add the last position to the digest read's return value and collect these in the digest resolver, along with the returned digests.	2022-08-10 06:03:37 +03:00
Botond Dénes	8c0dd99f7c	query: result_merger::get() don't reset last-pos on short-reads and last pages When merging multiple query-results, we use the last-position of the last result in the combined one as the combined result's last position. This only works however if said last result was included fully. Otherwise we have to discard the last-position included with the result and the pager will use the position of the last row in the combined result as the last position. The commit introducing the above logic mistakenly discarded the last position when the result is a short read or a page is not full. This is not necessary and even harmful as it can result in an empty combined result being delivered to the pager, without a last-position.	2022-08-10 06:01:49 +03:00
Botond Dénes	d1d53f1b84	query: add tombstone-limit to read-command Propagate the tombstone-limit from coordinator to replicas, to make sure all is using the same limit.	2022-08-10 06:01:47 +03:00
Anna Stuchlik	43cc17bf5d	doc: replace Scylla with ScyllaDB in the product name	2022-08-09 16:19:55 +02:00
Anna Stuchlik	d21b92fb13	doc: change the default for the overprovisioned option	2022-08-09 16:09:29 +02:00
Anna Stuchlik	c3dbb9706e	doc: language improvemens to the Counrers page	2022-08-09 14:35:44 +02:00
Alejo Sanchez	05afca2199	test.py: ScyllaCluster add_server() mark dirty When changing topology, tests will add servers. Make add_server mark the cluster dirty. But mark the cluster as not dirty after calling add_server when installing the cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	f1a6e4bda9	test.py: ScyllaCluster add server management Preparing for topology changes, implement the primitives for managing ScyllaServers in ScyllaCluster. The states are started, stopped, and removed. Started servers can be stopped or restarted. Stopped servers can be started. Stopped servers can be removed (destroyed). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	a6448458bb	test.py: improve seeds for new servers Instead of only using last started server as seed, use all started servers as seed for new servers. This also avoids tracking last server's state. Pass empty list instead of None. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	83dab6045b	test.py: Topology tests and Manager for Scylla clusters Preparing to cycle clusters modified (dirty) and use multiple clusters per topology pytest, introduce Topology tests and Manager class to handle clusters. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	14328d1e42	test.py: rename scylla_server to scylla_cluster This file's most important class is ScyllaCluster, so rename it accordingly. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	dcd8d77f34	test.py: function for python driver connection Isolate python driver connection on its own function. Preparing for harness client fixture to handle the connection. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	1db31ebfdc	test.py: ScyllaCluster add_server helper For future use from API. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Konstantin Osipov	c81c8af1ba	test.py: shutdown control connection during graceful shutdown	2022-08-09 14:26:13 +02:00
Alejo Sanchez	bc494754e8	test.py: configurable authenticator and authorizer For scylla servers, keep default PasswordAuthenticator and CassandraAuthorizer but allow this to be configurable per test suite. Use AllowAll* for topology test suite. Disabling authentication avoids complications later for topology tests as system_auth kespace starts with RF=1 and tests take down nodes. The keyspace would need to change RF and run repair. Using AllowAll avoids this problem altogether. A different cql fixture is created without auth for topology tests. Topology tests require servers without auth from scylla.yaml conf. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	6437a7f467	test.py: ScyllaServer stop gracefully Add stop_gracefully() method. Terminates a server in a clean way. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Alejo Sanchez	573ed429ad	test.py: FIXME for bad cluster log handling logic The code in test.py using a ScyllaCluster is getting a server id and taking logs from only the first server. If there is a failure in another server it's not reported properly. And CQL connection will go only to the first server. Also, it might be better to have ScyllaCluster to handle these matters and be more opaque. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-09 14:26:13 +02:00
Anna Stuchlik	15c24ba3e0	doc: fix the external link	2022-08-09 14:20:54 +02:00
Anna Stuchlik	82d1f67378	doc: clarify the disclaimer about reusing deleted counter column values	2022-08-09 14:12:37 +02:00
Avi Kivity	be44fd63f9	Merge 'Make get_range_addresses async and hold effective_replication_map_ptr around it' from Benny Halevy This series converts the synchronous `effective_replication_map::get_range_addresses` to async by calling the replication strategy async entry point with the same name, as its callers are already async or can be made so easily. To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map, let the callers of this patch hold a effective_replication_map per keyspace and pass it down to the (now asynchronous) functions that use it (making affected storage_service methods static where possible if they no longer depend on the storage_service instance). Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate that is true for local_strategy and everywhere_replication_strategy, and is false otherwise. With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow. Refs #11005 Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping) Closes #11009 * github.com:scylladb/scylladb: effective_replication_map: make get_range_addresses asynchronous range_streamer: add_ranges and friends: get erm as param storage_service: get_new_source_ranges: get erm as param storage_service: get_changed_ranges_for_leaving: get erm as param storage_service: get_ranges_for_endpoint: get erm as param repair: use get_non_local_strategy_keyspaces_erms database: add get_non_local_strategy_keyspaces_erms database: add get_non_local_strategy_keyspaces storage_service: coroutinize update_pending_ranges effective_replication_map: add get_replication_strategy effective_replication_map: get_range_addresses: use the precalculated replication_map abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies abstract_replication_strategy: reindent utils: sequenced_set: expose set and `contains` method abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set utils: sequenced_set: templatize VectorType utils: sanitize sequenced_set utils: sequenced_set: delete mutable get_vector method	2022-08-09 13:25:53 +03:00
Benny Halevy	f01a526887	docs: debugging: mention use of release number on backtrace.scylladb.com Following scylladb/scylla_s3_reloc_server@af17e4ffcd (scylladb/scylla_s3_reloc_server#28), the release number can be used to search the relcatable package and/or decode a respective backtrace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11247	2022-08-09 12:49:59 +03:00
Avi Kivity	d4c986e4fa	Merge 'doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 20.04' from Anna Stuchlik Ubuntu 22.04 is supported by both ScyllaDB Open Source 5.0 and Enterprise 2022.1. Closes #11227 * github.com:scylladb/scylladb: doc: add the redirects from Ubuntu version specific to version generic pages doc: remove version-speific content for Ubuntu and add the generic page to the toctree doc: rename the file to include Ubuntu doc: remove the version number from the document and add the link to Supported Versions doc: add a generic page for Ubuntu doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1	2022-08-09 12:49:16 +03:00
Asias He	12ab2c3d8d	storage_service: Prevent removed node to restart and join the cluster 1) Start node1,2,3 2) Stop node3 3) Run nodetool removenode $host_id_of_node3 4) Restart node3 Step 4 is wrong and not allowed. If it happens it will bring back node3 to the cluster. This patch adds a check during node restart to detect such operation error and reject the restart. With this patch, we would see the following in step 4. ``` init - Startup failed: std::runtime_error (The node 127.0.0.3 with host_id fa7e500a-8617-4de4-8efd-a0e177218ee8 is removed from the cluster. Can not restart the removed node to join the cluster again!) ``` Refs #11217 Closes #11244	2022-08-09 12:46:21 +03:00
Avi Kivity	1d4bf115e2	Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec Scenario: cache = [ row(pos=2, continuous=false), row(pos=after(2), dummy=true) ] Scanning read starts, starts populating [-inf, before(2)] from sstables. row(pos=2) is evicted. cache = [ row(pos=after(2), dummy=true) ] Scanning read finishes reading from sstables. Refreshes cache cursor via partition_snapshot_row_cursor::maybe_refresh(), which calls partition_snapshot_row_cursor::advance_to() because iterators are invalidated. This advances the cursor to after(2). no_clustering_row_between(2, after(2)) returns true, so advance_to() returns true, and maybe_refresh() returns true. This is interpreted by the cache reader as "the cursor has not moved forward", so it marks the range as complete, without emitting the row with pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads will also miss the row. The bug is in advance_to(), which is using no_clustering_row_between(a, b) to determine its result, which by definition excludes the starting key. Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction with reduced key range in the random_mutation_generator (1024 -> 16). Fixes #11239 Closes #11240 * github.com:scylladb/scylladb: test: mvcc: Fix illegal use of maybe_refresh() tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range() tests: row_cache_test: Introduce one_shot mode to throttle row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy	2022-08-09 12:39:10 +03:00
Anna Stuchlik	e753b4e793	doc: language, formatting, and organization improvements	2022-08-09 10:34:22 +02:00
Tomasz Grabiec	f59d2d9bf8	range_tombstone_list: Avoid amortized_reserve() We can use std::in_place_type<> to avoid constructing op before calling emplace_back(). As a reuslt, we can avoid reserving space. The reserving was there to avoid the need to roll-back in case emplace_back() throws. Kudos to Kamil for suggesting this. Closes #11238	2022-08-09 11:34:16 +03:00
Avi Kivity	8d37370a71	Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj" This reverts commit `bcadd8229b`, reversing changes made to `cf528d7df9`. Since `4bd4aa2e88` ("Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"), memtable is self-compacting and the extra compaction step only reduces throughput. The unit test in memtable_test.cc is not reverted as proof that the revert does not cause a regression. Closes #11243	2022-08-09 11:23:29 +03:00
Anna Stuchlik	61d33cb2a8	doc: add a disclaimer about not supporting local counters by SSTableLoader	2022-08-09 10:00:14 +02:00
Anna Stuchlik	4be88e1a79	doc: add the redirects from Ubuntu version specific to version generic pages	2022-08-09 09:43:28 +02:00
Raphael S. Carvalho	337390d374	forward_service: execute_on_this_shard: avoid reallocation and copy avoid about log2(256)=8 reallocations when pushing partition ranges to be fetched. additionally, also avoid copying range into ranges container. current_range will not contain the last range, after moved, but will still be engaged by the end of the loop, allowing next iteration to happen as expected. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11242	2022-08-09 09:08:53 +02:00
Botond Dénes	1b669cefed	service/storage_proxy: add get_tombstone_limit() To be used by coordinator side code to determine the correct tombstone limit to pass to read-command (tombstone limit field added in the next commit). When this limit is non-zero, the replica will start cutting pages after the tombstone limit is surpassed. This getter works similarly to `get_max_result_size()`: if the cluster feature for empty replica pages is set, it will return the value configured via db::config::query_tombstone_limit. System queries always use a limit of 0 (unlimited tombstones).	2022-08-09 10:00:40 +03:00
Botond Dénes	8cd2ef7a42	query: add tombstone_limit type Will be used in read_command. Add it before it is added to read-command so we can use the unlimited constant in code added in preparation to that.	2022-08-09 10:00:40 +03:00
Botond Dénes	33f0447ba0	db/config: add config item for query tombstone limit This will be the value used to break pages, after processing the specified amount of tombstones. The page will be cut even if empty. We could maybe use the already existing tombstone_{warn,fail}_threshold instead and use them as a soft/hard limit pair, like we did with page sizes.	2022-08-09 10:00:40 +03:00
Botond Dénes	1bc14b5e3b	gms: add cluster feature for empty replica pages So we can start using them only when the entire cluster supports it.	2022-08-09 10:00:40 +03:00
Botond Dénes	60a0e3d88b	tree: don't use query::read_command's IDL constructor It is not type safe: has multiple limits passed to it as raw ints, as well as other types that ints implicitly convert to. Furthermore the row limit is passed in two separate fields (lower 32 bits and upper 32 bits). All this make this constructor a minefield for humans to use. We have a safer constructor for some time but some users of the old one remain. Move them to the safe one.	2022-08-09 10:00:37 +03:00
Tomasz Grabiec	05b0a62132	test: mvcc: Fix illegal use of maybe_refresh() maybe_refresh() can only be called if the cursor is pointing at a row. The code was calling it before the cursor was advanced, and was thus relying on implementation detail.	2022-08-09 02:28:56 +02:00
Tomasz Grabiec	ce624048d9	tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range() Reproducer for #11239.	2022-08-09 02:28:56 +02:00
Tomasz Grabiec	6aaa6f8744	tests: row_cache_test: Introduce one_shot mode to throttle	2022-08-09 02:28:56 +02:00
Tomasz Grabiec	a6a61eaf96	row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy Scenario: cache = [ row(pos=2, continuous=false), row(pos=after(2), dummy=true) ] Scanning read starts, starts populating [-inf, before(2)] from sstables. row(pos=2) is evicted. cache = [ row(pos=after(2), dummy=true) ] Scanning read finishes reading from sstables. Refreshes cache cursor via partition_snapshot_row_cursor::maybe_refresh(), which calls partition_snapshot_row_cursor::advance_to() because iterators are invalidated. This advances the cursor to after(2). no_clustering_row_between(2, after(2)) returns true, so advance_to() returns true, and maybe_refresh() returns true. This is interpreted by the cache reader as "the cursor has not moved forward", so it marks the range as complete, without emitting the row with pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads will also miss the row. The bug is in advance_to(), which is using no_clustering_row_between(a, b) to determine its result, which by definition excludes the starting key. Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction with reduced key range in the random_mutation_generator (1024 -> 16). Fixes #11239	2022-08-09 02:28:56 +02:00
Takuya ASADA	3ffc978166	main: move preinit_description to main() We don't need to wait for handling version options after scylla_main() called, we can handle it in main() instead. Closes #11221	2022-08-08 18:31:43 +03:00
Benny Halevy	91ab8ee1c3	effective_replication_map: make get_range_addresses asynchronous So it may yield, preenting reactor stalls as seen in #11005. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	9b2af3f542	range_streamer: add_ranges and friends: get erm as param Rather than getting it in the callee, let the caller (e.g. storage_service) hold the erm and pass it down to potentially multiple async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	194b9af8d6	storage_service: get_new_source_ranges: get erm as param Rather than getting it in the callee, let the caller hold the erm and pass it down to potentially multiple async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	b50c79eab3	storage_service: get_changed_ranges_for_leaving: get erm as param Rather than getting it in the callee, let the caller hold the erm and pass it down to potentially multiple async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	a5d7ade237	storage_service: get_ranges_for_endpoint: get erm as param Let its caller Pass the effective_replication_map ptr so we can get it at the top level and keep it alive (and coherent) through multiple asynchronous calls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	cffe00cc58	repair: use get_non_local_strategy_keyspaces_erms Use get_non_local_strategy_keyspaces_erms for getting a coherent set of keyspace names and their respective effective replication strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	db5c5ca59e	database: add get_non_local_strategy_keyspaces_erms To be used for getting a coheret set of all keyspaces with non-local replication strategy and their respective effective_replication_map. As an example, use it in this patch in storage_service::update_pending_ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	7ee6048255	database: add get_non_local_strategy_keyspaces For node operations, we currently call get_non_system_keyspaces but really want to work on all keyspace that have non-local replication strategy as they are replicated on other nodes. Reflect that in the replica::database function name. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	d8484b3ee6	storage_service: coroutinize update_pending_ranges Before we make a change in getting the keyspaces and their effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	e541009f65	effective_replication_map: add get_replication_strategy And use it in storage_service::get_changed_ranges_for_leaving. A following patch will pass the e_r_m to storage_service::get_changed_ranges_for_leaving, rather than getting it there. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Benny Halevy	6794e15163	effective_replication_map: get_range_addresses: use the precalculated replication_map There is no need to call get_natural_endpoints for every token in sorted_tokens order, since we can just get the precalculated per-token endpoints already in the _replication_map member. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Benny Halevy	1d4aea4441	abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies Reduce large allocations and reactor stalls seen in #11005 by open coding `get_address_ranges` and using std::vector::insert to efficiently appending the ranges returned by `get_primary_ranges_for` onto the returned token_range_vector in contrast to building an unordered_multimap<inet_address, dht::token_range> first in `get_address_ranges` and traversing it and adding one token_range at a time. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Benny Halevy	7811b0d0aa	abstract_replication_strategy: reindent	2022-08-08 17:31:00 +03:00
Benny Halevy	ebe1edc091	utils: sequenced_set: expose set and `contains` method And use that in sights using the endpoint set returned by abstract_replication_strategy::calculate_natural_endpoints. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Benny Halevy	7017ad6822	abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set So it could be used also for easily searching for an endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Benny Halevy	38934413d4	utils: sequenced_set: templatize VectorType Se we can use basic_sequenced_set<T, std::small_vector<T, N>> as locator::endpoint_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Benny Halevy	df380c30b9	utils: sanitize sequenced_set And templatize its Vector type so it can be used with a small_vector for inet_address_vector_replica_set. Mark the methods const/noexcept as needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Benny Halevy	57d9275d4a	utils: sequenced_set: delete mutable get_vector method It is dangerous to use since modifying the sequenced_set vector will make it go out of sync with the associated unordered_set member, making the object unusable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:00 +03:00
Yaron Kaikov	2fe2306efb	configure.py: add date-stamp parameter When starting `Build` job we have a situation when `x86` and `arm` starting in different dates causing the all process to fail As suggested by @avikivity , adding a date-stamp parameter and will pass it through downstream jobs to get one release for each job Ref: scylladb/scylla-pkg#3008 Closes #11234	2022-08-08 17:28:38 +03:00
Anna Stuchlik	7d4770c116	doc: remove version-speific content for Ubuntu and add the generic page to the toctree	2022-08-08 16:18:20 +02:00
Anna Stuchlik	eb60e5757a	doc: rename the file to include Ubuntu	2022-08-08 16:12:02 +02:00
Anna Stuchlik	011e2fad60	doc: remove the version number from the document and add the link to Supported Versions	2022-08-08 16:11:14 +02:00
Anna Stuchlik	83c08ac5fa	doc: add a generic page for Ubuntu	2022-08-08 16:04:59 +02:00
Avi Kivity	871127f641	Update tools/java submodule * tools/java ad6764b506...6995a83cc1 (1): > dist/debian: drop upgrading from scylla-tools < 2.0	2022-08-08 16:51:14 +03:00
Anna Stuchlik	260f85643d	doc: specify the recommended AWS instance types	2022-08-08 14:35:54 +02:00
Anna Stuchlik	2c69a8f458	doc: replace the tables with a generic description of support for Im4gn and Is4gen instances	2022-08-08 14:17:59 +02:00
Botond Dénes	49c00fa989	Merge 'Define strong uuid-class types for table_id, table_schema_version and query_id' from Benny Halevy We would like to define more distinct types that are currently defined as aliases to utils::UUID to identify resources in the system, like table id and schema version id. As with counter_id, the motivation is to restrict the usage of the distinct types so they can be used (assigned, compared, etc.) only with objects of the same type. Using with a generic UUID will then require explicit conversion, that we want to expose. This series starts with cleaning up the idl header definition by adding support for `import` and `include` statements in the idl-compiler. These allow the idl header to become self-sufficient and then remove manually-added includes from source files. The latter usually need only the top level idl header and it, in turn, should include other headers if it depends on them. Then, a UUID_class template was defined as a shared boiler plate for the various uuid-class. First, we convert counter_id to use it, rather than mimicking utils::UUID on its own. On top of utils::UUID_class<T>, we define table_id, table_schema_version, and query_id. Following up on this series, we should define more commonly used types like: host_id, streaming_plan_id, paxos_ballot_id. Fixes #11207 Closes #11220 * github.com:scylladb/scylladb: query-request, everywhere: define and use query_id as a strong type schema, everywhere: define and use table_schema_version as a strong type schema, everywhere: define and use table_id as a strong type schema: include schema_fwd.hh in schema.hh system_keyspace: get_truncation_record: delete unused lambda capture utils: uuid: define appending_hash<utils::tagged_uuid<Tag>> utils: tagged_uuid: rename to_uuid() to uuid() counters: counter_id: use base class create_random_id counters: base counter_id on utils::tagged_uuid utils: tagged_uuid: mark functions noexcept utils: tagged_uuid: bool: reuse uuid::bool operator raft: migrate tagged_id definition to utils::tagged_uuid utils: uuid: mark functions noexcept counters: counter_id delete requirement for triviality utils: bit_cast: require TriviallyCopyable To repair: delete unused include of utils/bit_cast.hh bit_cast: use std::bit_cast idl: make idl headers self-sufficient db: hints: sync_point: do not include idl definition file db/per_partition_rate_limit: tidy up headers self-sufficiency idl-compiler: include serialization impl and visitors in generated dist.impl.hh files idl-compiler: add include statements idl_test: add a struct depending on UUID	2022-08-08 13:20:40 +03:00
Benny Halevy	c71ef330b2	query-request, everywhere: define and use query_id as a strong type Define query_id as a tagged_uuid So it can be differentiated from other uuid-class types. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:13:28 +03:00
Benny Halevy	2b017ce285	schema, everywhere: define and use table_schema_version as a strong type Define table_schema_version as a distinct tagged_uuid class, So it can be differentiated from other uuid-class types, in particular table_id. Added reversed(table_schema_version) for convenience and uniformity since the same logic is currently open coded in several places. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:45 +03:00
Benny Halevy	257d74bb34	schema, everywhere: define and use table_id as a strong type Define table_id as a distinct utils::tagged_uuid modeled after raft tagged_id, so it can be differentiated from other uuid-class types, in particular from table_schema_version. Fixes #11207 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:41 +03:00
Benny Halevy	26aacb328e	schema: include schema_fwd.hh in schema.hh And remove repeated definitions and forward declarations of the same types in both places. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:28 +03:00
Benny Halevy	6e77ad9392	system_keyspace: get_truncation_record: delete unused lambda capture Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:28 +03:00
Benny Halevy	a390b8475b	utils: uuid: define appending_hash<utils::tagged_uuid<Tag>> And simplify usage for appending_hash<counter_shard_view> respectively. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:28 +03:00
Benny Halevy	8235cfdf7a	utils: tagged_uuid: rename to_uuid() to uuid() To make it more generic, similar to other uuid() get methods we have. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	813cffc2b5	counters: counter_id: use base class create_random_id Rather than defining generate_random, and use respectively in unit tests. (It was inherited from raft::internal::tagged_id.) This allows us to shorten counter_id's definition to just using utils::tagged_uuid<struct counter_id_tag>. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	e9cc24bc18	counters: base counter_id on utils::tagged_uuid Use the common base class for uuid-based types. tagged_uuid::to_uuid defined here for backward compatibility, but it will be renamed in the next patch to uuid(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	082d5efca8	utils: tagged_uuid: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	1b78f8ba82	utils: tagged_uuid: bool: reuse uuid::bool operator utils::UUID defined operator bool the same way, rely on it rather than reimplementing it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	6436c614d7	raft: migrate tagged_id definition to utils::tagged_uuid So it can be used for other types in the system outside of raft, like counter_id, table_id, table_schema_version, and more. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	f0567ab853	utils: uuid: mark functions noexcept Before we define a tagged_uuid template. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	ea91ccaa20	counters: counter_id delete requirement for triviality This stemmed from utils/bit_cast overly strict requirement. Now that it was relaxed, these is no need for this static assert as counter_id is trivially copyable, and that is checked by bit_cast {read,write}_unaligned Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	c68e929b80	utils: bit_cast: require TriviallyCopyable To Like std::bit_cast (https://en.cppreference.com/w/cpp/numeric/bit_cast) we only require To to be trivially copyable. There's no need for it to be a trivial type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	2948a4feb6	repair: delete unused include of utils/bit_cast.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	79000bc02e	bit_cast: use std::bit_cast Now that scylla requries c++20 there's no need to define our own implementation in utils/bit_cast.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	1fda686f96	idl: make idl headers self-sufficient Add include statements to satisfy dependencies. Delete, now unneeded, include directives from the upper level source files. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	cfc7e9aa59	db: hints: sync_point: do not include idl definition file idl definition files are not intended for direct inclusion in .cc files. Data types it represents are supposed to be defined in regular C++ header, so define them in db/hints/scyn_point.hh and include it rather then idl/hinted_handoff.idl.hh. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	82fa205723	db/per_partition_rate_limit: tidy up headers self-sufficiency include what's needed where needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	83811b8e35	idl-compiler: include serialization impl and visitors in generated dist.impl.hh files They are generally required by the serialization implementation. This will simplify using them without having to hand pick what header to include in the .cc file that includes them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	da4f0aae37	idl-compiler: add include statements For generating #include directives in the generated files, so we don't have to hand-craft include the dependencies in the right order. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	4f275a17b4	idl_test: add a struct depending on UUID For testing the next change which adds import and include statements to the idl language. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Avi Kivity	ba42852b0e	Merge 'Overhaul truncate and snapshot' from Benny Halevy This series is aimed at fixing #11132. To get there, the series untangles the functions that currently depend on the the cross-shard coordination in table::snapshot, namely database::truncate and consequently database::drop_column_family. database::get_table_on_all_shards is added here as a helper to get a foreign shared ptr of the the table shard from all shards, and it is later used by multiple functions to truncate and then take a snapshot of the sharded table. database::truncate_table_on_all_shards is defined to orchestrate the truncate process end-to-end, flushing or clearing all table shards before taking a snapshot if needed, using the newly defined table::snapshot_on_all_shards, and by that leaving only the discard_sstables job to the per-shard database::truncate function. The latter, snapshot_on_all_shards, orchestrates the snapshot process on all shards - getting rid of the per-shard table::snapshot function (after refactoring take_snapshot and finalize_snapshot out of it), and the associated dreaded data structures: snapshot_manager and pending_snapshots. Fixes #11132. Closes #11133 * github.com:scylladb/scylladb: table: reindent write_schema_as_cql table: coroutinize write_schema_as_cql table: seal_snapshot: maybe_yield when iterating over the table names table: reindent seal_snapshot table: coroutinize seal_snapshot table: delete unused snapshot_manager and pending_snapshots table: delete unused snapshot function table: snapshot_on_all_shards: orchestrate snapshot process table: snapshot: move pending_snapshots.erase from seal_snapshot table: finalize_snapshot: take the file sets as a param table: make seal_snapshot a static member table: finalize_snapshot: reindent table: refactor finalize_snapshot out of snapshot table: snapshot: keep per-shard file sets in snapshot_manager table: take_snapshot: return foreign unique ptr table: take_snapshot: maybe yield in per-sstable loop table: take_snapshot: simplify tables construction code table: take_snapshot: reindent table: take_snapshot: simplify error handling table: refactor take_snapshot out of snapshot utils: get rid of joinpoint database: get rid of timestamp_func database: truncate: snapshot table in all-shards layer database: truncate: flush table and views in all-shards layer database: truncate: stop and disable compaction in all-shards layer database: truncate: move call to set_low_replay_position_mark to all-shards layer database: truncate: enter per-shard table async_gate in all-shards layer database: truncate: move check for schema_tables keyspace to all-shards layer. database: snapshot_table_on_all_shards: reindent table: add snapshot_on_all_shards database: add snapshot_table_on_all_shards database: rename {flush,snapshot}_on_all and make static database: drop_table_on_all_shards: truncate and stop table in upper layer database: drop_table_on_all_shards: get all table shards before drop_column_family on each database: drop_column_family: define table& cf database: drop_column_family: reuse uuid for evict_all_for_table database: drop_column_family: move log message up a layer database: truncate: get rid of the unused ks param database: add truncate_table_on_all_shards database: drop_table_on_all_shards: do not accept a truncated_at timestamp_func database: truncate: get optional snapshot_name from caller database: truncate: fix assert about replay_position low_mark database_test: apply_mutation on the correct db shard	2022-08-07 19:15:42 +03:00
Benny Halevy	45ce635527	table: reindent write_schema_as_cql Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	3b2cce068a	table: coroutinize write_schema_as_cql and make sure to always close the output stream. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	dbae7807d1	table: seal_snapshot: maybe_yield when iterating over the table names Add maybe_yield calls in tight loop, potentially over thousands of sstable names to prevent reactor stalls. Although the per-sstable cost is very small, we've experienced stalls realted to printing in O(#sstables) in compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	3ba0c72b77	table: reindent seal_snapshot Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	41a2d09a5d	table: coroutinize seal_snapshot Handle exceptions, making sure the output stream is properly closed in all cases, and an intermediate error, if any, is returned as the final future. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	5316dbbe78	table: delete unused snapshot_manager and pending_snapshots Now that snapshot orchestration in snapshot_on_all_shards doesn't use snapshot_manager, get rid of the data structure. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	cca9068cfb	table: delete unused snapshot function Now that snapshot orchestration is done solely in snapshot_on_all_shards, the per-shard snapshot function can be deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	351a3a313d	table: snapshot_on_all_shards: orchestrate snapshot process Call take_snapshot on each shard and collect the returns snapshot_file_set. When all are done, move the vector<snapshot_file_set> to finalize_snapshot. All that without resorting to using the snapshot_manager nor calling table::snapshot. Both will deleted in the following patches. Fixes #11132 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	84dfd2cabb	table: snapshot: move pending_snapshots.erase from seal_snapshot Now that seal_snapshot doesn't need to lookup the snapshot_manager in pending_snapshots to get to the file_sets, erasing the snapshot_manager object can be done in table::snapshot which also inserted it there. This will make it easier to get rid of it in a later patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	39276cacc3	table: finalize_snapshot: take the file sets as a param and pass it to seal_snapshot, so that the latter won't need to lookup and access the snapshot_manager object. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	4dd56bbd6d	table: make seal_snapshot a static member Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	7cb0a3f6f4	table: finalize_snapshot: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	12716866a9	table: refactor finalize_snapshot out of snapshot Write schema.cql and the files manifest in finalize_snapshot. Currently call it from table::snapshot, but it will be called in a later patch by snapshot_on_all_shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	240f83546d	table: snapshot: keep per-shard file sets in snapshot_manager To simplify processing of the per-shard file names for generating the manifest. We only need to print them to the manifest at the end of the process, so there's no point in copying them around in the process, just move the foreign unique unordered_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	5100c1ba68	table: take_snapshot: return foreign unique ptr Currently copying the sstable file names are created and destroyed on each shard and are copied by the "coordinator" shards using submit_to, while the coroutine holds the source on its stack frame. To prepare for the next patches that refactor this code so that the coordinator shard will submit_to each shard to perform `take_snapshot` and return the set of sstrings in the future result, we need to wrap the result in a foreign_ptr so it gets freed on the shard that created it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	b54626ad0e	table: take_snapshot: maybe yield in per-sstable loop There could be thousands of sstables so we better cosider yielding in the tight loop that copies the sstable names into the unordered_set we return. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	24a1a4069e	table: take_snapshot: simplify tables construction code Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	75e38ebccc	table: take_snapshot: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	67c1d00f44	table: take_snapshot: simplify error handling Don't catch exception but rather just return them in the return future, as the exception is handled by the caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	ff6508aa53	table: refactor take_snapshot out of snapshot Do the actual snapshot-taking code in a per-shard take_snapshot function, to be called from snapshot_on_all_shards in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	37b7a9cce2	utils: get rid of joinpoint Now that it is no longer used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	56f336d1aa	database: get rid of timestamp_func Pass an optional truncated_at time_point to truncate_table_on_all_shards instead of the over-complicated timestamp_func that returns the same time_point on all shards anyhow, and was only used for coordination across shards. Since now we synchronize the internal execution phase in truncate_table_on_all_shards, there is no longer need for this timestamp_func. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	b640c4fd17	database: truncate: snapshot table in all-shards layer With that the database layer does no longer need to invoke the private table::snapshot function, so it can be defriended from class table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	af0c71aa12	database: truncate: flush table and views in all-shards layer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	6e07e6b7ac	database: truncate: stop and disable compaction in all-shards layer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	e78dad1dfb	database: truncate: move call to set_low_replay_position_mark to all-shards layer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	a8bd3d97b6	database: truncate: enter per-shard table async_gate in all-shards layer Start moving the per-shard state establishment logic to truncate_table_on_all_shards, so that we would evetually do only the truncate logic per-se in the per-shard truncate function. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	ff028316f2	database: truncate: move check for schema_tables keyspace to all-shards layer. Now that the per-shard truncate function is called only from truncate_table_on_all_shards, we can reject the schema_tables keyspace in the upper layer. There's no need to check that on each shard. While at it, reuse `is_system_keyspace`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	fbe1fa1370	database: snapshot_table_on_all_shards: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	4d4ca40c38	table: add snapshot_on_all_shards Called from the respective database entry points. Will be called also from the database drop / truncate path and will be used for central coordination of per-shard table::snapshot so we don't have to depend on the snapshot_manager mechanism that is fragile and currently causes abort if we fail to allocate it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	be56a73e78	database: add snapshot_table_on_all_shards We need to snapshot a single table in several paths. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	d96b56fee2	database: rename {flush,snapshot}_on_all and make static Follow the convention of drop_table_on_all_shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	a1eed1a6e9	database: drop_table_on_all_shards: truncate and stop table in upper layer truncate the table on all shards then stop it on shards in the upper layer rather than in the per-shard drop_column_family() function, so we can further refactor truncate later, flushing and taking snapshot on all shards, before truncating. With that, rename drop_column_family to detach_columng_family as now it only deregisters the column family from containers that refer to it (even via its uuid) and then its caller is reponsible to take it from there. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	92cb7d448b	database: drop_table_on_all_shards: get all table shards before drop_column_family on each Se we the upper layer can flush, snapshot, and truncate the table on all shards, step by step. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	0aaaefbb5c	database: drop_column_family: define table& cf To reduce the churn in the following patch that will pass the table& as a parameter. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	bb1e5ffb8c	database: drop_column_family: reuse uuid for evict_all_for_table cf->schema()->id() is the same one returned by find_uuid(ks_name, cf_name); As a follow up, we should define a concrete table_id type and rename schema::id() to schema::table_id() to return it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	e800e1e720	database: drop_column_family: move log message up a layer Print once on "coordinator" shard. And promote to info level as it's important to log when we're dropping a table (and if we're going to take a snapshot). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	ca78a63873	database: truncate: get rid of the unused ks param Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	46e2a7c83b	database: add truncate_table_on_all_shards As a first step to decouple truncate from flush and snpashot. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:53:05 +03:00
Benny Halevy	5e8c05f1a8	database: drop_table_on_all_shards: do not accept a truncated_at timestamp_func Since in the drop_table case we want to discard ALL sstables in the table, not only those with `max_data_age()` up until drop started. Fixes #11232 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:52:51 +03:00
Benny Halevy	574909c78f	database: truncate: get optional snapshot_name from caller Before we change drop_table_on_all_shards to always pass db_clock::time_point::max() in the next patch, let it pass a unique snapshot name, otherwise the snapshot name will always be based on the constant, max time_point. Refs #11232 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 12:03:19 +03:00
Benny Halevy	474b2fdf37	database: truncate: fix assert about replay_position low_mark This assert was tweaked several times: Introduced in `83323e155e`, then fixed in `b2b1a1f7e1` to account for no rp from discard_sstables, then in `9620755c7f` to account for cases we do not flush the table, then again in `71c5dc82df` to make that more accurate. But, the assert wasn't correct in the first place in the sense that we first get `low_mark` which represents the highest replay_position at the time truncate was called, but then we call discard_sstables with a time_point of `truncated_at` that we get from the caller via the timestamp_func, and that one could be in the past, before truncate was called - hence discard_sstables with that timestamp may very well return a replay_position from older sstables, prior to flush that can be smaller than the low_mark. Fix this assert to account for that case. The real fix to this issue is to have a truncate_tombstone that will carry an authoritative api::timstamp (#11230) Fixes #11231 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 09:18:06 +03:00
Benny Halevy	9f5e13800d	database_test: apply_mutation on the correct db shard Following up on `1c26d49fba`, apply mutations on the correct db shard in all test cases before we define and use database::truncate_table_on_all_shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-07 09:18:06 +03:00
Tomasz Grabiec	7f80602b01	db: range_tombstone_list: Avoid quadratic behavior when applying Range tombstones are kept in memory (cache/memtable) in range_tombstone_list. It keeps them deoverlapped, so applying a range tombstone which covers many range tombstones will erase existing range tombstones from the list. This operation needs to be exception-safe, so range_tombstone_list maintains an undo log. This undo log will receive a record for each range tombstone which is removed. For exception safety reasons, before pushing an undo log entry, we reserve space in the log by calling std::vector::reserve(size() + 1). This is O(N) where N is the number of undo log entries. Therefore, the whole application is O(N^2). This can cause reactor stalls and availability issues when replicas apply such deletions. This patch avoids the problem by reserving exponentially increasing amount of space. Also, to avoid large allocations, switches the container to chunked_vector. Fixes #11211 Closes #11215	2022-08-05 20:34:07 +03:00
Kamil Braun	d84a93d683	Merge 'Raft test topology part 1' from Alecco These are the first commits out of #10815. It starts by moving pytest logic out of the common `test/conftest.py` and into `test/topology/conftest.py`, including removing the async support as it's not used anywhere else. There's a fix of a bug of leaving tables in `RandomTables.tables` after dropping all of them. Keyspace creation is moved out of `conftest.py` into `RandomTables` as it makes more sense and this way topology tests avoid all the workarounds for old version (topology needs ScyllaDB 5+ for Raft, anyway). And a minor fix. Closes #11210 * github.com:scylladb/scylladb: test.py: fix type hint for seed in ScyllaServer test.py: create/drop keyspace in tables helper test.py: RandomTables clear list when dropping all tables test.py: move topology conftest logic to its own test.py: async topology tests auto run with pytest_asyncio	2022-08-05 17:56:16 +02:00
Anna Stuchlik	d48ae5a9e0	doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1	2022-08-05 17:49:01 +02:00
Warren Krewenki	4178ccd27f	gossiper: Correct typo in log message Closes #11212	2022-08-05 18:21:36 +03:00
Anna Stuchlik	ceaf0c41bd	doc: add support for AWS i4g instances	2022-08-05 17:18:44 +02:00
Anna Stuchlik	7711436577	doc: extend the list of supported CPUs	2022-08-05 16:55:40 +02:00
Alejo Sanchez	ec70e26f12	test.py: fix type hint for seed in ScyllaServer Param seed can be None (e.g. first server) so fix type hint accordingly. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-05 13:05:26 +02:00
Alejo Sanchez	1d7789e5a9	test.py: create/drop keyspace in tables helper Since all topology test will use the helper, create the keyspace in the helper. Avoid the need of dropping all tables per test and just drop the keyspace. While there, use blocking CQL execution so it can be used in the constructor and avoids possible issues with scheduling on cleanup. Also, creation and drop should happen only once per cluster and no test should be running changes (either not started or finished). All topology tests are for Scylla with Raft. So don't use the Cassandra this_dc workaround as it's unnecessary for Scylla. Remove return type of random_tables fixture to match other fixtures everywhere else. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-05 13:05:26 +02:00
Alejo Sanchez	9a019628f5	test.py: RandomTables clear list when dropping all tables Clear the list of active tables when dropping them. While there do the list element exchange atomically across active and removed tables lists. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-05 13:05:26 +02:00
Alejo Sanchez	f6aa0d7bd7	test.py: move topology conftest logic to its own Move asyncio, Raft checks, and RandomTables to topology test suite's own conftest file. While there, use non-async version of pre-checks to avoid unnecessary complexity (we want async tests, not async setup, for now). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-05 13:05:26 +02:00
Alejo Sanchez	f665779cdb	test.py: async topology tests auto run with pytest_asyncio Async tests and fixtures in the topology directory are expected to run with pytest_asyncio (not other async frameworks). Force this with auto mode. CI has an older pytest_asyncio version lacking pytest_asyncio.fixture. Auto mode helps avoiding the need of it and tests and fixtures can just be marked with regular @pytest.mark.async. This way tests can run in both older and newer versions of the packages. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-08-05 13:05:26 +02:00
Botond Dénes	fbbe2529c1	Merge "Remove global snitch usage from consistency_level.cc" from Pavel Emelyanov " There are several helpers in this .cc file that need to get datacenter for endpoints. For it they use global snitch, because there's no other place out there to get that data from. The whole dc/rack info is now moving to topology, so this set patches the consistency_level.cc to get the topology. This is done two ways. First, the helpers that have keyspace at hand may get the topology via ks's effective_replication_map. Two difficult cases are db::is_local() and db.count_local_endpoints() because both have just inet_address at hand. Those are patched to be methods of topology itself and all their callers already mess with token metadata and can get topology from it. " * 'br-consistency-level-over-topology' of https://github.com/xemul/scylla: consistency_level: Remove is_local() and count_local_endpoints() storage_proxy: Use topology::local_endpoints_count() storage_proxy: Use proxy's topology for DC checks storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver storage_proxy: Use topology local_dc_filter in its methods storage_proxy: Mark some digest_read_resolver methods private forwarding_service: Use topology local_dc_filter storage_service: Use topology local_dc_filter consistency_level: Use topology local_dc_filter consitency-level: Call count_local_endpoints from topology consistency_level: Get datacenter from topology replication_strategy: Remove hold snitch reference effective_replication_map: Get datacenter from topology topology: Add local-dc detection shugar	2022-08-05 13:31:55 +03:00
Anna Stuchlik	4bc7833a0b	doc: update the link to CQL3 type mapping on GitHub Closes #11224	2022-08-05 13:21:29 +03:00
Pavel Emelyanov	c3718b7a6e	consistency_level: Remove is_local() and count_local_endpoints() No code uses them now -- switched to use topology -- so thse two can be dropped together with their calls for global snitch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:48 +03:00
Pavel Emelyanov	9c662ee0e5	storage_proxy: Use topology::local_endpoints_count() A continuation of the previous patches -- now all the code that needs this helper have proxy pointer at hand Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:48 +03:00
Pavel Emelyanov	9a50d318b6	storage_proxy: Use proxy's topology for DC checks Several proxy helper classes need to filter endpoints by datacenter. Since now the have shared_ptr<proxy> on-board, they can get topology via proxy's token metadata Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:48 +03:00
Pavel Emelyanov	183a2d5a83	storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver It will be needed to get token metadata from proxy. The resolver in question is created and maintained by abstract_read_executor which already has shared_ptr<proxy>, so it just gives its copy Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:48 +03:00
Pavel Emelyanov	e1ea801b67	storage_proxy: Use topology local_dc_filter in its methods The proxy has token metadata pointer, so it can use its topology reference to filter endpoints by datacenter Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:47 +03:00
Pavel Emelyanov	6f515f852d	storage_proxy: Mark some digest_read_resolver methods private Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:47 +03:00
Pavel Emelyanov	9a19414c62	forwarding_service: Use topology local_dc_filter The service needs to filter out non-local endpoints for its needs. The service carries token metadata pointer and can get topology from it to fulfill this goal Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:47 +03:00
Pavel Emelyanov	2423e1c642	storage_service: Use topology local_dc_filter The storage-service API calls use db::is_local() helper to filter out tokens from non-local datacenter. In all those places topology is available from the token metadata pointer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:47 +03:00
Pavel Emelyanov	0da8caba1d	consistency_level: Use topology local_dc_filter The filter_for_query() helper has keyspace at hand Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:47 +03:00
Pavel Emelyanov	de58b33eee	consitency-level: Call count_local_endpoints from topology Similar to previous patch, in those places with keyspace object at hand the topology can be obtained from ks' replication map Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:47 +03:00
Pavel Emelyanov	f84ee8f0fb	consistency_level: Get datacenter from topology In some of db/consistency_level.cc helpers the topology can be obtained from keyspace's effective replication map Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:47 +03:00
Pavel Emelyanov	00f166809e	replication_strategy: Remove hold snitch reference When the strategy is constructed there's no place to get snitch from so the global instance is used. However, after previous patch the replication strategy no longer needs snitch, so this dependency can be dropped Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:43 +03:00
Pavel Emelyanov	298213f27f	effective_replication_map: Get datacenter from topology Now it gets it from snitch, but the dc/rack info is being relocated onto topology. The topology is in turn already there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-05 12:19:31 +03:00
Calle Wilund	fac2bc41ba	commitlog: Include "segments_to_replay" in initial footprint Fixes #11184 Not including it here can cause our estimate of "delete or not" after replay to be skewed in favour of retaining segments as (new) recycles (or even flip a counter), and if we have repeated crash+restarts we could be accumulating an effectivly ever increasing segment footprint Closes #11205	2022-08-05 12:16:53 +03:00
Pavel Emelyanov	527b345079	Merge 'storage_proxy: introduce a `remote` "subservice"' from Kamil Braun Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema. The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`. The long game here is to split the initialization of `storage_proxy` into two steps: - the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`. - the second step will take those references and add the `remote` part to `storage_proxy`. This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`. Closes #11088 * github.com:scylladb/scylladb: service: storage_proxy: pass `migration_manager*` to `init_messaging_service` service: storage_proxy: `remote`: make `_gossiper` a const reference gms: gossiper: mark some member functions const db: consistency_level: `filter_for_query`: take `const gossiper&` replica: table: `get_hit_rate`: take `const gossiper&` gms: gossiper: move `endpoint_filter` to `storage_proxy` module service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager` service: storage_proxy: establish private section in `remote` service: storage_proxy: remove `migration_manager` pointer service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote` service: storage_proxy: remove `_gossiper` field alternator: ttl: pass `gossiper&` to `expiration_service` service: storage_proxy: move `truncate_blocking` implementation to `remote` service: storage_proxy: introduce `is_alive` helper service: storage_proxy: remove `_messaging` reference service: storage_proxy: move `connection_dropped` to `remote` service: storage_proxy: make `encode_replica_exception_for_rpc` a static function service: storage_proxy: move `handle_write` to `remote` service: storage_proxy: move `handle_paxos_prune` to `remote` service: storage_proxy: move `handle_paxos_accept` to `remote` service: storage_proxy: move `handle_paxos_prepare` to `remote` service: storage_proxy: move `handle_truncate` to `remote` service: storage_proxy: move `handle_read_digest` to `remote` service: storage_proxy: move `handle_read_mutation_data` to `remote` service: storage_proxy: move `handle_read_data` to `remote` service: storage_proxy: move `handle_mutation_failed` to `remote` service: storage_proxy: move `handle_mutation_done` to `remote` service: storage_proxy: move `handle_paxos_learn` to `remote` service: storage_proxy: move `receive_mutation_handler` to `remote` service: storage_proxy: move `handle_counter_mutation` to `remote` service: storage_proxy: remove `get_local_shared_storage_proxy` service: storage_proxy: (de)register RPC handlers in `remote` service: storage_proxy: introduce `remote`	2022-08-04 17:50:20 +03:00
Alejo Sanchez	97f0e11c3a	test.py: handle properly pytest ouput file for CQL tests Previously, if pytest itself failed (e.g. bad import or unexpected parameter), there was no output file but test.py tried to copy it and failed. Change the logic of handling the output file to first check if the file is there. Then if it's worth keeping it, move it to the test directory for easier comparison and maintenance. Else, if it's not worth keeping, discard it. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11193	2022-08-04 16:48:53 +02:00
Pavel Emelyanov	cf0f912e59	cdc: Handle sleep-aborted exception on stop When update_streams_description() fails it spawns a fiber and retries the update in the background once every 60s. If the sleeping between attempts is aborted, the respective exceptional future happens to be ignored and warned in logs. fixes: #11192 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220802132148.20688-1-xemul@scylladb.com>	2022-08-04 13:03:29 +02:00
Kamil Braun	0a4e701b50	service: storage_proxy: pass `migration_manager*` to `init_messaging_service` `migration_manager` lifetime is longer than the lifetime of "storage proxy's messaging service part" - that is, `init_messaging_service` is called after `migration_manager` is started, and `uninit_messaging_service` is called before `migration_manager` is stopped. Thus we don't need to hold an owning pointer to `migration_manager` here. Later, when `init_messaging_service` will actually construct `remote`, this will be a reference, not a pointer. Also observe that `_mm` in `remote` is only used in handlers, and handlers are unregistered before `_mm` is nullified, which ensures that handlers are not running when `_mm` is nullified. (This argument shows why the code made sense regardless of our switch from shared_ptr to raw ptr).	2022-08-04 12:19:43 +02:00
Kamil Braun	a08be82ce2	service: storage_proxy: `remote`: make `_gossiper` a const reference	2022-08-04 12:19:43 +02:00
Kamil Braun	a1aa9cf3f7	gms: gossiper: mark some member functions const	2022-08-04 12:19:43 +02:00
Kamil Braun	a9fd156a1b	db: consistency_level: `filter_for_query`: take `const gossiper&`	2022-08-04 12:19:38 +02:00
Kamil Braun	7b4146dd2a	replica: table: `get_hit_rate`: take `const gossiper&` It doesn't use any non-const members.	2022-08-04 12:16:09 +02:00
Kamil Braun	566e5f2a4f	gms: gossiper: move `endpoint_filter` to `storage_proxy` module The function only uses one public function of `gossiper` (`is_alive`) and is used only in one place in `storage_proxy`. Make it a static function private to the `storage_proxy` module. The function used a `default_random_engine` field in `gossiper` for generating random numbers. Turn this field into a static `thread_local` variable inside the function - no other `gossiper` members used the field.	2022-08-04 12:16:09 +02:00
Kamil Braun	078900042f	service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager` No need to call `_remote->gossiper().shared_from_this()` from within storage_proxy now.	2022-08-04 12:16:09 +02:00
Kamil Braun	d9d10d87ec	service: storage_proxy: establish private section in `remote` Only the (un)init, send_*, and `is_alive` functions are public, plus a getter for gossiper.	2022-08-04 12:16:05 +02:00
Kamil Braun	7364d453dd	service: storage_proxy: remove `migration_manager` pointer The ownership is passed to `remote`, which now contains a `shared_ptr<migration_manager>`.	2022-08-04 12:15:36 +02:00
Kamil Braun	bcc22ed1dc	service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote` Catch `this` in the lambdas.	2022-08-04 12:15:36 +02:00
Kamil Braun	eddd3b8226	service: storage_proxy: remove `_gossiper` field Access `gossiper` through `_remote`. Later, all those accesses will handle missing `remote`. Note that there are also accesses through the `remote()` internal getter. The plan is as follows: - direct accesses through `_remote` will be modified to handle missing `_remote` (these won't cause an error) - `remote()` will throw if `_remote` is missing (`remote()` is only used for operations which actually need to send a message to a remote node).	2022-08-04 12:15:35 +02:00
Kamil Braun	ab946e392f	alternator: ttl: pass `gossiper&` to `expiration_service` This allows us to remove the `gossiper()` getter from `storage_proxy`.	2022-08-04 12:12:43 +02:00
Kamil Braun	242e31d56e	service: storage_proxy: move `truncate_blocking` implementation to `remote` The truncate operation always truncates a table on the entire cluster, even for local tables. And it always does it by sending RPCs (the node sends an RPC to itself too). Thus it fits in the remote class. If we want to add a possibility to "truncate locally only" and/or change the behavior for local tables, we can add a branch in `storage_proxy::truncate_blocking`. Refs: #11087	2022-08-04 12:12:43 +02:00
Kamil Braun	3e73de9a40	service: storage_proxy: introduce `is_alive` helper A helper is introduced both in `remote` and in `storage_proxy`. The `storage_proxy` one calls the `remote` one. In the future it will also handle a missing `remote`. Then it will report only the local node to be alive and other nodes dead while `remote` is missing. The change reduces the number of functions using the `_gossiper` field in `storage_proxy`.	2022-08-04 12:12:41 +02:00
Jenkins Promoter	0ce19e7812	release: prepare for 5.2.0-dev	2022-08-04 13:09:55 +03:00
Botond Dénes	df203a48af	Merge "Remove reconnectable_snitch_helper" from Pavel Emelyanov " The helper is in charge of receiving INTERNAL_IP app state from gossiper join/change notifications, updating system.peers with it and kicking messaging service to update its preferred ip cache along with initiating clients reconnection. Effectively this helper duplicates the topology tracking code in storage-service notifiers. Removing it makes less code and drops a bunch of unwanted cross-components dependencies, in particular: - one qctx call is gone - snitch (almost) no longer needs to get messaging from gossiper - public:private IP cache becomes local to messaging and can be moved to topology at low cost Some nice minor side effect -- this helper was left unsubscribed from gossiper on stop and snitch rename. Now its all gone. " * 'br-remove-reconnectible-snitch-helper-2' of https://github.com/xemul/scylla: snitch: Remove reconnectable snitch helper snitch, storage_service: Move reconnect to internal_ip kick snitch, storage_service: Move system.peers preferred_ip update snitch: Export prefer-local	2022-08-04 13:06:05 +03:00
Anna Stuchlik	532aa6e655	doc: update the links to Manager and Operator Closes #11196	2022-08-04 11:38:39 +03:00
Avi Kivity	785ea869fb	Merge 'tools/scylla-sstable: introduce the write operation' from Botond Dénes Implementing json2sstable functionality. It allows generating an sstable from a JSON description of its content. Uses identical schema to dump-data, so it is possible to regenerate an existing sstable, by feeding the output of dump-data to write. Most of the scylla storage engine features are supported. The only non-supported features are counters and non-strictly atomic data types (including frozen collections, tuples and UDTs). Example invocation: ``` scylla sstable write --system-schema system_schema.columns --input-file ./input.json --generation 0 ``` Refs: https://github.com/scylladb/scylladb/issues/9681 Future plans: * Complete support for remaining features (counters and non-atomic types). * Make sstable format configurable on the command line. Closes #11181 * github.com:scylladb/scylladb: test/cql-pytest: test_tools.py: add test for sstable write test/cql-pytest: test-tools.py actually test with multiple sstables test/cql-pytest: test_tools.py: reduce the number of test-cases tools/scylla-sstable: introduce the write operation tools/scylla-sstable: add support for writer operations tools/scylla-sstable: dump-data: write bound-weight as int tools/scylla-sstable: dump-data: always write deletion time for cell tombstones tools/scylla-sstable: dump-data: add timezone to deletion_time types: publish timestamp_from_string()	2022-08-03 19:18:31 +03:00
Wojciech Mitros	64c03a2d24	wasm: fix compilation without libwasmtime Some segments of code using wasmtime were not under an ifdef SCYLLA_ENABLE_WASMTIME, making Scylla unable to compile on machines without wasmtime. This patch adds the ifdef where needed. Closes #11200	2022-08-03 18:16:02 +03:00
Anna Stuchlik	143455d7ac	doc: rewording	2022-08-03 16:58:29 +02:00
Anna Stuchlik	f2af63ddd5	doc: update the links to fix the warnings	2022-08-03 15:12:41 +02:00
Anna Stuchlik	1d61550c64	doc: add the new page to the toctree	2022-08-03 15:03:48 +02:00
Anna Stuchlik	756b9a278f	doc: add the descrption of specifying workload attributes with service levels	2022-08-03 14:57:50 +02:00
Raphael S. Carvalho	5757cc5160	mutation_reader_merger: fix indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220803003010.11551-1-raphaelsc@scylladb.com>	2022-08-03 14:33:07 +03:00
Anna Stuchlik	2fa175a819	doc: add the definition of workloads to the glossary	2022-08-03 13:31:07 +02:00
Piotr Dulikowski	4f2adc14de	db/system_keyspace: fix indentation after previous patch	2022-08-03 13:19:19 +02:00
Piotr Dulikowski	eff8a6368c	db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column Previously, the `system.local`'s `rpc_address` column kept local node's `rpc_address` from the scylla.yaml configuration. Although it sounds like it makes sense, there are a few reasons to change it to the value of scylla.yaml's `broadcast_rpc_address`: - The `broadcast_rpc_address` is the address that the drivers are supposed to connect to. `rpc_address` is the address that the node binds to - it can be set for example to 0.0.0.0 so that Scylla listens on all addresses, however this gives no useful information to the driver. - The `system.peers` table also has the `rpc_address` column and it already keeps other nodes' `broadcast_rpc_address`es. - Cassandra is going to do the same change in the upcoming version 4.1. Fixes: #11201	2022-08-03 13:19:03 +02:00
Botond Dénes	19441881bc	test/cql-pytest: test_tools.py: add test for sstable write We can now do a full circle: dump an sstable to json, generate an sstable from it, then dump again and compare to the original json. Expand the existing simple_no_clustering_table and simple_clustering_table schema/data to improve coverage of things like TTL, tombstones and static rows.	2022-08-03 14:00:50 +03:00
Botond Dénes	5d5c3b3fe3	test/cql-pytest: test-tools.py actually test with multiple sstables The test-cases in this suite have a parameter to run with one or multiple input sstables. This was broken as each test table generated a single sstable. Fix this so we actually get single/multiple input sstable coverage.	2022-08-03 14:00:50 +03:00
Botond Dénes	bd772d095f	test/cql-pytest: test_tools.py: reduce the number of test-cases Currently this test-case exercises all the available component dumpers with many different schemas. This doesn't add any value for most of the dumpers, save for the dump-data one. It does have a cost however in run-time of these test-cases. Test the dumpers which are mostly indifferent to the schema with just a single one, cutting the number of generated test-cases from 70 to 30.	2022-08-03 14:00:50 +03:00
Botond Dénes	d0eaa72bd7	tools/scylla-sstable: introduce the write operation Allows generating an sstable based on a JSON description of its content. Uses identical schema to dump-data, so it is possible to regenerate an existing sstable, by feeding the output of dump-data to write. Most of the scylladb storage engine features is supported, with the exception of the following: * counters * non-strictly atomic types, including frozen collections, tuples or UDTs.	2022-08-03 14:00:02 +03:00
Botond Dénes	4377be30ba	tools/scylla-sstable: add support for writer operations Currently it is assumed that all operations read sstables. They get a non-empty list of sstables as input and have no means to create sstable-writers. We want to add support for operations that write sstables. For this, we relax the current top-level check about the sstable list not being empty. We defer this empty-check for operations that actually need input sstables. Furthermore, the operation_func gains an sstable_manager& argument, to allow operations to create sstable writers. Operations are now read-write capable. In addition to the above the documentation language is adjusted to not assume read-only operations.	2022-08-03 13:49:22 +03:00
Botond Dénes	87443d2da0	tools/scylla-sstable: dump-data: write bound-weight as int No reason for it to be witten as string, the documentation even says it is an integer.	2022-08-03 13:49:22 +03:00
Botond Dénes	ef786f9b85	tools/scylla-sstable: dump-data: always write deletion time for cell tombstones Said field is not optional for dead cells - it is mandatory for all tombstones, including cell tombstones.	2022-08-03 13:49:22 +03:00
Botond Dénes	833ed03533	tools/scylla-sstable: dump-data: add timezone to deletion_time Deletion time is always in UTC but whoever looks at the JSON has no way to know that. In particular date-time parsers assume local timezone in its absence which of course results incorrect deletion_time after parsing.	2022-08-03 13:49:17 +03:00
Takuya ASADA	d7dfd0a696	main: run --version before app_template initialize Even on the environment which causes error during initalize Scylla, "scylla --version" should be able to run without error. To do so, we need to parse and execute these options before initializing Scylla/Seastar classes. Fixes #11117 Closes #11179	2022-08-03 11:25:28 +03:00
Kamil Braun	2aff2fea00	service: storage_proxy: remove `_messaging` reference All uses of `messaging_service&` have been moved to `remote`.	2022-08-02 19:55:12 +02:00
Kamil Braun	cf931c7863	service: storage_proxy: move `connection_dropped` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	2203d4fa09	service: storage_proxy: make `encode_replica_exception_for_rpc` a static function No need for this ugly template to be part of the `storage_proxy` header.	2022-08-02 19:55:12 +02:00
Kamil Braun	3499bc7731	service: storage_proxy: move `handle_write` to `remote` It is a helper used by `receive_mutation_handler` and `handle_paxos_learn`.	2022-08-02 19:55:12 +02:00
Kamil Braun	ba88ad8db0	service: storage_proxy: move `handle_paxos_prune` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	548767f91e	service: storage_proxy: move `handle_paxos_accept` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	807c7f32de	service: storage_proxy: move `handle_paxos_prepare` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	0e431e7c03	service: storage_proxy: move `handle_truncate` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	f8c1ba357f	service: storage_proxy: move `handle_read_digest` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	43997af40f	service: storage_proxy: move `handle_read_mutation_data` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	80586a0c7e	service: storage_proxy: move `handle_read_data` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	00c0ee44bd	service: storage_proxy: move `handle_mutation_failed` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	b9c436c6e0	service: storage_proxy: move `handle_mutation_done` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	178536d5d2	service: storage_proxy: move `handle_paxos_learn` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	f309886fac	service: storage_proxy: move `receive_mutation_handler` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	fad14d2094	service: storage_proxy: move `handle_counter_mutation` to `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	93325a220f	service: storage_proxy: remove `get_local_shared_storage_proxy` Its remaining uses are trivial to remove. Note: in `handle_counter_mutation` we had this piece of code: ``` }).then([trace_state_ptr = std::move(trace_state_ptr), &mutations, cl, timeout] { auto sp = get_local_shared_storage_proxy(); return sp->mutate_counters_on_leader(...); ``` Obtaining a `shared_ptr` to `storage_proxy` at this point is no different from obtaining a regular pointer: - The pointer is obtained inside `then` lambda body, not in the capture list. So if the goal of obtaining a `shared_ptr` here was to keep `storage_proxy` alive until the `then` lambda body is executed, that goal wasn't achieved because the pointer was obtained too late. - The `shared_ptr` is destroyed as soon as `mutate_counters_on_leader` returns, it's not stored anywhere. So it doesn't prolong the lifetime of the service. I replaced this with a simple capture of `this` in the lambda.	2022-08-02 19:55:12 +02:00
Kamil Braun	5148eafbd6	service: storage_proxy: (de)register RPC handlers in `remote`	2022-08-02 19:55:12 +02:00
Kamil Braun	f174645ab5	service: storage_proxy: introduce `remote` Move most accesses to `_messaging` to this struct (functions that send RPCs).	2022-08-02 19:55:10 +02:00
Avi Kivity	a4844826fc	Merge 'Decouple compaction manager from database' from Benny Halevy Start compaction_manager as a sharded service and pass a reference to it to the database rather than having the database construct its own compaction_manager. This is part of the wider scope effort to decouple compaction from replica database and table. Closes #11099 * github.com:scylladb/scylladb: compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges main: start compaction_manager as a sharded service compaction_manager: keep config as member backlog_controller: keep scheduling_group by value backlog_controller: scheduling_group: keep io_priority_class by value backlog_controller: scheduling_group: define default member initializers backlog_controller: get rid of _interval member	2022-08-02 19:02:46 +03:00
Avi Kivity	6fd2496501	Merge 'token_metadata: keep the set of normal token owners as a member' from Benny Halevy token_metadata: impl: keep the set of normal token owners as a member We don't need to recalculate the unique set of normal token everytime we change `_token_to_endpoint_map`. Similarly, this doesn't have to be done in `get_all_endpoints`. Instead we can maintain it inexpensively in `remove_endpoint`, and let `count_normal_token_owners` just return its size and `get_all_endpoints` just return the saved set. Closes #11128 Fixes #11146 Closes #11158 * github.com:scylladb/scylladb: token_metadata: allow update_normal_token_owners to yield token_metadata: get_all_endpoints: return const unordered_set<inet_address>& token_metadata: impl: keep the set of normal token owners as a member	2022-08-02 16:49:41 +03:00
Avi Kivity	665c85aefe	Merge 'multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes When stopping the read, the multishard reader will dismantle the compaction state, pushing back (unpopping) the currently processed partition's header to its originating reader. This ensures that if the reader stops in the middle of a partition, on the next page the partition-header is re-emitted as the compactor (and everything downstream from it) expects. It can happen however that there is nothing more for the current partition in the reader and the next fragment is another partition. Since we only push back the partition header (without a partition-end) this can result in two partitions being emitted without being separated by a partition end. We could just add the missing partition-end when needed but it is pointless, if the partition has no more data, just drop the header, we won't need it on the next page. The missing partition-end can generate an "IDL frame truncated" message as it ends up causing the query result writer to create a corrupt partition entry. Fixes: https://github.com/scylladb/scylladb/issues/9482 Closes #11175 * github.com:scylladb/scylladb: test/cql-pytest: add regression test for "IDL frame truncated" error mutation_compactor: detach_state(): make it no-op if partition was exhausted querier: use full_position in shard_mutation_querier	2022-08-02 16:41:15 +03:00
Avi Kivity	e526facef2	Merge 'Fix undefined behavior during eviction' from Tomasz Grabiec When the last non-dummy row is evicted from a partition, the partition entry is evicted as well. The existing logic in on_evicted() leaves the last dummy row in the partition version before evicting the partition entry. This row may still be attached to the LRU. Eviction of partition entry goes through mutation_cleaner::clear_gently(). If this is preempted, the destruction may proceed in the background. If evicition happens on the remaining row in that entry before it's destroyed, the code will hit undefined behavior. on_evicted() calls partition_version::is_referenced_from_entry(), which is unspecified when the version is enqueued in the mutation_cleaner. It returns incorrect value for the last item remaining in the LRU (middle entires evict fine). In that case, eviction will try to access non-existent containing partition_entry, causing undefined behavior. Caught by debug-mode cql_query_test.test_clustering_filtering with raft enabled. Where it manifested like this: partition_version.hh:328:16: runtime error: load of value 7, which is not a valid value for type 'bool' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior partition_version.hh:328:16 in Aborting on shard 0. Instances of this issue outside of the unit test environment are not known as of yet. This change makes is_referenced_from_entry() return the correct value even for versions which are queued in the mutation cleaner. Fixes https://github.com/scylladb/scylladb/issues/11140 The series also contains some related cleanups and minor fixes for issues which could come up later. Closes #11187 * github.com:scylladb/scylladb: cache_tracker: Make clear() leave no garbage partition_snapshot_row_cursor: Fix over-counting of rows row_cache: Fix undefined behavior during eviction under some conditions	2022-08-02 16:40:23 +03:00
Avi Kivity	268e4abe77	Merge 'wasm: reuse instances for wasm UDFs' from Wojciech Mitros Calling WebAssembly UDFs requires wasmtime instance. Creating such an instance is expensive, but these instances can be reused for subsequent calls of the same UDF on various inputs. This patch introduces a way of reusing wasmtime instances: a wasm instance cache. The cache stores a wasmtime instance for each UDF and scheduling group. The instances are evicted using LRU strategy and their size is based on the size of their wasm memories. The instances stored in the cache are also dropped when the UDF is dropped itself. For that reason, the first patch modifies the current implementation of UDF dropping, so that the instance dropping may be added later. The patch also removes the need of compiling the UDF again when dropping it. The second patch contains the implementation and use of the new cache. The cache is implemented in `lang/wasm_instance_cache.hh` and the main ways of using it are the `run_script` methods from `wasm.hh` The third patch adds tests to `test_wasm.py` that check the correctness and performance of the new cache. The tests confirm the instance reuse, size limits, instance eviction after timeout and after dropping the UDF. Closes #10306 * github.com:scylladb/scylladb: wasm: test instances reuse wasm: reuse UDF instances schema_tables: simplify merge_functions and avoid extra compilation	2022-08-02 13:51:16 +03:00
Botond Dénes	38d0db4be5	Merge 'doc: remove the Manger documentation from the core ScyllaDB docs' from Anna Stuchlik In this PR, I have: - removed the docs for Manager (including the sources for Manager 2.1 and the upgrade guides). - added redirects to https://manager.docs.scylladb.com/. - replaced the internal links with external links to https://manager.docs.scylladb.com/. Closes #11162 * github.com:scylladb/scylladb: doc: update the link to fix the warning about duplicate targets Update docs/kb/gc-grace-seconds.rst Update docs/_utils/redirects.yaml doc: update the links to Manager doc: add the link to manager.docs.scylladb.com to the toctree doc: remove the docs for Manager - the Manager page, the guide for Manager 2.1, Manger upgrade guides doc: add redirections from Manager 2.1 to the Manager docs doc: add redirections to manager.docs.scylladb.com	2022-08-02 12:29:37 +03:00
Anna Stuchlik	cec54229fa	doc: update the link to fix the warning about duplicate targets	2022-08-02 11:21:02 +02:00
Anna Stuchlik	780597b0f9	Update docs/kb/gc-grace-seconds.rst Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2022-08-02 11:21:02 +02:00
Anna Stuchlik	849cdd715b	Update docs/_utils/redirects.yaml Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2022-08-02 11:20:59 +02:00
Anna Stuchlik	8cad0de042	doc: update the links to Manager	2022-08-02 11:18:16 +02:00
Anna Stuchlik	ba67dfeca6	doc: add the link to manager.docs.scylladb.com to the toctree	2022-08-02 11:15:14 +02:00
Anna Stuchlik	f72e16b013	doc: remove the docs for Manager - the Manager page, the guide for Manager 2.1, Manger upgrade guides	2022-08-02 11:14:07 +02:00
Anna Stuchlik	c9db3bd7ea	doc: add redirections from Manager 2.1 to the Manager docs	2022-08-02 11:12:10 +02:00
Anna Stuchlik	3b5add05a7	doc: add redirections to manager.docs.scylladb.com	2022-08-02 11:12:05 +02:00
Tomasz Grabiec	4c33d1650d	cache_tracker: Make clear() leave no garbage Prremption during partition entry eviciton could put it in the mutation cleaner. No known issues caused by this. Affects only tests.	2022-08-02 11:02:22 +02:00
Tomasz Grabiec	a58fee1dcf	partition_snapshot_row_cursor: Fix over-counting of rows insert_before() may need to allocate memory for a btree, so may fail. Call cache_tracker::insert() only after successful instance so that row counters reflect the correct state. On failure, the entry will be unlinked automatically by rows_entry destructor, but row counters in the cache_tracker will not be automatically decremented.	2022-08-02 11:02:22 +02:00
Benny Halevy	0dfd92d0b3	token_metadata: allow update_normal_token_owners to yield Given #11146, we see a 10ms stall when calculate_natural_endpoints calls get_all_endpoints that up until this patch performed a similar loop on the `_token_to_endpoint_map`, so to prevent such a stall with large number of tokens, turn update_normal_token_owners async, and allow yielding in the per-token tight loop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 10:49:32 +03:00
Benny Halevy	4f8ccef2c1	token_metadata: get_all_endpoints: return const unordered_set<inet_address>& There's no need to transform it into a vector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 10:49:08 +03:00
Benny Halevy	a980f94d85	token_metadata: impl: keep the set of normal token owners as a member We don't need to recalculate the unique set of normal token everytime we change `_token_to_endpoint_map`. Similarly, this doesn't have to be done in `get_all_endpoints`. Instead we can maintain it inexpensively in `remove_endpoint`, and let `count_normal_token_owners` just return its size and `get_all_endpoints` just return the saved set. Note that currently topology is not updated accurately in update_normal_token() and it may contain endpoint that do no longer own any tokens. If we did update topology accurately there, we could use its locations map instead as its keys are equivalent to the unordered_set<inet_address> we implement here. Closes #11128 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 10:49:07 +03:00
Botond Dénes	5ea6700e23	types: publish timestamp_from_string() It looks like it is a better option for timestamp parsing than anything current C++ stdlib can offer. What a pity.	2022-08-02 10:33:01 +03:00
Benny Halevy	14faa3b6f4	compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges And completely get rid of the dependency on replica::database. Also, add respective rest_api tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 08:08:11 +03:00
Benny Halevy	e1fe598760	compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges Currently they are copied for the get_sstables function so this change reduces copies. Also, it will allow further decoupling of compaction_manager from replica::database, by letting the caller of perform_cleanup and perform_sstable_upgrade get the owned token ranges from db and pass it to the perform_* functions in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:57:41 +03:00
Benny Halevy	e4e92d44ae	main: start compaction_manager as a sharded service And pass a reference to it to the database rather than having the database construct its own compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:50:15 +03:00
Benny Halevy	7f70949693	compaction_manager: keep config as member Rather than keeping separate, duplicated members. And define helpers to get those members. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:48:01 +03:00
Benny Halevy	c9a9720247	backlog_controller: keep scheduling_group by value There is no need to keep a mutable reference to the scheduling_group passed at construction time since setting / updating shares is using the schedulig_group / io_priority_class id as a handle, and the id itself is never changed by the backlog_controller. Note that the class names are misleading, in hind sight, they would better be called scheduling_group_id and io_priority_class_id, respectively. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:38:40 +03:00
Benny Halevy	78ad1c70a2	backlog_controller: scheduling_group: keep io_priority_class by value Exactly like the cpu scheduling_group, io_priority_class contains the class id, which is a handle to the io_priority_class and so can be kept by value, rather than by reference, and be safely copied around. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:38:40 +03:00
Benny Halevy	450ecd60c6	backlog_controller: scheduling_group: define default member initializers To prepare for the next patch, implement default initialization of the scheduling_group and io_priority_class, to the default values. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:38:40 +03:00
Benny Halevy	3e6622180e	backlog_controller: get rid of _interval member It isn't used outside the constructor. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:38:40 +03:00
Botond Dénes	11af489e84	test/cql-pytest: add regression test for "IDL frame truncated" error	2022-08-02 06:43:24 +03:00
Botond Dénes	70b4158ce0	mutation_compactor: detach_state(): make it no-op if partition was exhausted detach_state() allows the user to resume a compaction process later, without having to keep the compactor object alive. This happens by generating and returning the mutation fragments the user has to re-feed to a newly constructed compactor to bring it into the exact same state the current compactor was at the point of stopping the compaction. This state includes the partition-header (partition-start and static-row if any) and the currently active range tombstone. Detaching the state is pointless however when the compaction was stopped such that the currently compacted partition was completely exhausted. Allowing the state to be detached in this case seems benign but it caused a subtle bug in the main user of this feature: the partition range scan algorithm, where the fragments included in the detached state were pushed back into the reader which produced them. If the partition happened to be exhausted -- meaning the next fragment in the reader was a partition-start or EOS -- this resulted in the partition being re-emitted later without a partition-end, resulting in corrupt query-result being generated, in turn resulting in an obscure "IDL frame truncated" error. This patch solves this seemingly benign but sinister bug by making the return value of `detach_state()` an std::optional and returning a disengaged optional when the partition was exhausted.	2022-08-02 06:43:24 +03:00
Botond Dénes	cdd3a364cb	querier: use full_position in shard_mutation_querier Instead of a separate partition key and position-in-partition. This continues the recently started effort to standardize storing of full positions on `full_position`. This patch is also a hidden preparation for read_context::save_readers() multishard_mutation_query.cc) no longer being able to get partition key from compaction state in the future.	2022-08-02 06:43:24 +03:00
Botond Dénes	768a5c8b5a	Merge 'doc: add the upgrade guide from 5.0 to 2022.1' from Anna Stuchlik Fix https://github.com/scylladb/scylla-docs/issues/4125 I've added the upgrade guides from 5.0 to 2022.1. They are based on the previous upgrade guides from Open Source to Enterprise. Closes #11108 * github.com:scylladb/scylladb: doc: apply feedback about scylla-enterprise-machine-image doc: update the note about installing scylla-enterprise-machine-image update the info about installing scylla-enterprise-machine-image during upgrade doc: add the requirement to install scylla-enterprise-machine-image if the previous version was installed with an image doc: update the info about metrics in 2022.1 compared to 5.0 doc: minor formatting and language fixes doc: add the new guide to the toctree doc: add the upgrade guide from 5.0 to 2022.1	2022-08-02 06:23:41 +03:00
Anna Stuchlik	8e0e603c48	doc: remove Drivers from getting startded index to avoid duplication and reflect the project structure Closes #11163	2022-08-02 06:20:18 +03:00
Botond Dénes	d532fd7896	Merge 'doc: remove the Monitoring Stack documentation from the core ScyllaDB docs' from Anna Stuchlik I created this branch to remove the external docs (Manager, Monitoring, Operator) from the core ScyllaDB documentation. However, to make reviewing easier, this PR only covers removing the docs for ScyllaDB Monitoring Stack. I'm going to send other PRs to cover Manager and Operator. In this PR, I have: - removed the docs for ScyllaDB Monitoring Stack (including the sources for old versions). - added redirects to https://monitoring.docs.scylladb.com/. - replaced the internal links with external links to https://monitoring.docs.scylladb.com/. Closes #11151 * github.com:scylladb/scylladb: doc: fix the link to the Monitoring Stack doc: fix the links in the manager section doc: add the external link to Monitoring Stack to the menu doc: replace the links to Monitoring Stack doc: add the redirections for Monitoring Stack doc: delete the Monitoring Stack documentation form the ScyllaDB docs and remove it from the toctree	2022-08-02 06:03:52 +03:00
Tomasz Grabiec	a459d9ab98	row_cache: Fix undefined behavior during eviction under some conditions When the last non-dummy row is evicted from a partition, the partition entry is evicted as well. The existing logic in on_evicted() leaves the last dummy row in the partition version before evicting the partition entry. This row may still be attached to the LRU. Eviction of partition entry goes through mutation_cleaner::clear_gently(). If this is preempted, the destruction may proceed in the background. If evicition happens on the remaining row in that entry before it's destroyed, the code will hit undefined behavior. on_evicted() calls partition_version::is_referenced_from_entry(), which is unspecified when the version is enqueued in the mutation_cleaner. It returns incorrect value for the last item remaining in the LRU. In that case eviction will try to access non-existent containing partition_entry, causing undefined behavior. Caught by debug-mode cql_query_test.test_clustering_filtering with raft enabled. Where it manifested like this: partition_version.hh:328:16: runtime error: load of value 7, which is not a valid value for type 'bool' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior partition_version.hh:328:16 in Aborting on shard 0. Instances of this issue outside of the unit test environment are not known as of yet. This change makes is_referenced_from_entry() return the correct value even for versions which are queued in the mutation cleaner. Fixes #11140	2022-08-01 23:53:15 +02:00
Raphael S. Carvalho	934af9be52	mutation_reader_merger: Drop unneeded readers as soon as possible Today, mutation_reader_merger drops unneeded readers in batches of 4, meaning that the merger is having to keep the memory used by 3 unneeded readers in addition to the ones being currently read from. As each may own a lot of memory, the combined effect of this waste, coming from parallel reads, can potentially cause memory pressure. This batching behavior was introduced in `b524f96a74`, when readers had to be destroyed synchronously, as flat_mutation_reader lacked an async close interface. But we have gone a long way since then. Readers can be closed asynchronously and outstanding I/O requests will be cancelled on close. Now, we'll close readers as soon they're uneeded, one at a time, using a continuation chain. If we're submitting close calls faster than we can retire them, then we wait for their completion, preventing memory usage from growing unbounded. The benefit of this new approach will be very good when combining disjoint readers, where only one is active at a time for producing fragments. As soon as we're done with the current one, then it will be closed allowing its memory to be released, before we move on to the next reader that follows. Refs #11040. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11167	2022-08-01 20:06:29 +03:00
Anna Stuchlik	f7269d0f3b	doc: update the description of vitrual tables on the Enterprise Features page Closes #11097	2022-08-01 17:52:43 +03:00
Benny Halevy	edd308c705	config: use ordered map for experimental features So that the help string will be sorted lexicographically. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11178	2022-08-01 17:40:10 +03:00
Piotr Sarna	dd2417618e	forward_service: limit the number of partition ranges fetched The forward service uses a vector of ranges owned by a particular shard in order to split and delegate the work. The number can grow large though, which can cause large allocations. This commit limits the number of ranges handled at a time to 256. Fixes #10725 Closes #11182	2022-08-01 17:36:34 +03:00
Benny Halevy	663f2e2a8f	Update seastar submodule * seastar 1d4432ed28...f9f5228b74 (33): > intent: drop unused headers > resource: Improve incorrect --smp option handling > util/log: make the width shard_id field fixed in log message > Update building-docker.md > batch_flush: Replace circular buffer with slist > linux-aio: Sanitize get_user_data helpers > pipe: add missing return in pipe's operator='s > Merge 'core, rpc: silence couple warnings from GCC-12' from Kefu Chai > tls: vec_push: handle synchronous error from put Fixes #11118 > seastar/rpc: add fmt::ostream_formatter<> for rpc::connection_id > test: io_queue_test: remove unused lambda capture > Merge "Split oversized requests" from Pavel E > test: Add test for over-sized request submission > io_queue: Add AIO stats for requests splitting > io_queue: Remove capped ticket making > io_queue: Relax ticket making > io_queue: Split oversized request on submission > io_request: Add .split(size_t max_length) method > io_queue: Keep the iovec memory on io-desc > io_queue: Add io_direction_and_length::read/write aliases > io_request: Simplify io_direction_and_length manipulations > reactor: Move submit_io_...() into io_queue > file, reactor: Use iovec_len() > utils: Put iovec manipulation helpers into util > rpc: init all member variables > core/simple-stream: do not qualify return type with "const" > core, rpc: do not pass unused parameters > reactor: Check the io-properties being YAML::Map > rpc: Fix formatting on some fmt lib versions > Merge "Make RPC server connection negotiation synchronous" from Pavel E > rpc: Fix indentation after previous patch > rpc: Make server::connection::negotiate synchronous Fixes #10950 > doc: fix redundant double wording in tutorial.md Closes #11176	2022-08-01 17:06:28 +03:00
Anna Stuchlik	4204fc3096	doc: apply feedback about scylla-enterprise-machine-image	2022-08-01 14:35:24 +02:00
Anna Stuchlik	9fe7aa5c9a	doc: update the note about installing scylla-enterprise-machine-image	2022-08-01 14:22:45 +02:00
Piotr Sarna	d4abb73389	Merge 'scrub compaction: count validation errors... and return status over the rest api' from Aleksandra Martyniuk Currently, scrub returns to user the number indicating operation result as follows: - 1 when the operation was aborted; - 3 in validate and segregate modes when validation errors were found (and in segregate mode - fixed); - 0 if operation ended successfully. To achieve so, if an operation was aborted in abort mode, then the exception is propagated to storage_service.cc. Also the number of validation errors for current scrub is gathered and summed from each shard there. The number of validation errors is counted and registered in metrics. Metrics provide common counters for all scrub operation within a compaction manager, though. Thus, to check the exact number of validation errors, the comparison of counter value before and after scrub operation needs to be done. Closes #11074 * github.com:scylladb/scylladb: scrub compaction: return status indicating aborted operations over the rest api test: move scylla_inject_error from alternator/ to cql-pytest/ scrub compaction: count validation errors and return status over the rest api scrub compaction: count validation errors for specific scrub task compaction: extract statistics in compaction_result scrub compaction: register validation errors in metrics scrub compaction: count validation errors	2022-08-01 12:05:00 +02:00
Anna Stuchlik	d84d2e6faa	Merge branch 'master' into anna-remove-external-docs	2022-08-01 11:50:03 +02:00
guy9	4d24097b4b	adding Documentation website top banner options, with the current setting set to hide the banner Closes #11172	2022-08-01 10:32:07 +03:00
Botond Dénes	2c4e06330d	Merge "Remove _replicating_nodes and _removing_node" from Pavel Emelyanov " Commit `829b4c14` (repair: Make removenode safe by default) turned these two to be read only (in fact, erase- and clear- from too). " * 'br-dangling-replicating-nodes' of https://github.com/xemul/scylla: storage_service: Relax confirm_replication() storage_service: Remove _removing_node storage_service: Remove _replicating_nodes	2022-08-01 10:25:40 +03:00
Tzach Livyatan	6088bdea91	Docs: Add more information about Raft v2 Closes #11057	2022-08-01 09:00:18 +03:00
Tzach Livyatan	33aa50e783	Docs: move the consistency calculator to a dedticate page Closes #11149	2022-08-01 08:59:30 +03:00
Botond Dénes	8ea1ebdb88	Merge 'doc: remove the Operator documentation pages from the core ScyllaDB docs' from Anna Stuchlik This PR removes the existing Operator documentation pages from the core ScyllaDB docs. I have: - removed the Operator page and replaced it with the link to the Operator documentation. - created a redirect. - updated the links to the Operator. Closes #11154 * github.com:scylladb/scylladb: Update docs/operating-scylla/index.rst doc: fix the link to Operator add the redirect to the Operator replace the internal links with the external link to Operator add the external link to Operator to the toctree doc: remove the Operator docs from the core documentation	2022-08-01 07:00:21 +03:00
Tzach Livyatan	520b9c88b7	Docs: Update the git repo path to scylladb/scylladb Closes #11171	2022-07-31 15:32:09 +03:00
Pavel Emelyanov	29768a2d02	gitattributes: Mark *.svg as binary The goal is to put .svg files under git grep's radar. Otherwise a pretty innocent 'git grep db::is_local' dumps the contents of the docs/kb/flamegraph.svg on the screen, because it a) contains the grep pattern and b) is looooong one-liner Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220730090026.8537-1-xemul@scylladb.com>	2022-07-31 15:25:24 +03:00
Avi Kivity	00cec159d6	Revert "Merge 'multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes" This reverts commit `c3bad157e5`, reversing changes made to `e66809d051`. The checks it adds are triggered by some dtests. While it's possible the check is triggered due to an existing problem, better to investigate it out-of-tree. Fixes #11169.	2022-07-31 15:24:33 +03:00
Pavel Emelyanov	ee0828b506	topology: Add local-dc detection shugar It's often needed to check if an endpoint sits in the same DC as the current node. It can be done by topo.get_datacenter() == topo.get_datacenter(endpoint) but in some cases a RAII filter function can be helpful. Also there's a db::count_local_endpoints() that is surprisingly in use, so add it to topology as well. Next patches will make use of both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-30 17:58:45 +03:00
Pavel Emelyanov	22fdc03b71	storage_service: Relax confirm_replication() This method is called from REPLICATION_FINISHED handler and now just logs a message. The verb is probably worth keeping for compatibility at least for some time. The logging itself can be moved into handler's lambda Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-29 11:47:37 +03:00
Pavel Emelyanov	c8f9d1237f	storage_service: Remove _removing_node This optional is always disengaged Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-29 11:47:11 +03:00
Pavel Emelyanov	4d08554a92	storage_service: Remove _replicating_nodes The set in question is read-and-ease-only Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-29 11:45:42 +03:00
Aleksandra Martyniuk	6ea5bc96d7	scrub compaction: return status indicating aborted operations over the rest api Performing compaction scrub user did not know whether an operation was aborted. If compaction scrub is aborted, return status the user gets over rest api is set to 1.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	8e892426e2	test: move scylla_inject_error from alternator/ to cql-pytest/ Move scylla_inject_error from alternator/ to cql-pytest/ so it can be reached from various tests dirs. alternator/util.py is renamed to alternator/alternator_util.py to avoid name shadowing.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	f1980f8dc6	scrub compaction: count validation errors and return status over the rest api Performing compaction scrub user did not know whether any validation errors were encountered. The number of validation errors per given compaction scrub is gathered and summed from each shard. Basing on that value return status over the rest api is set to 3 if any validation errors were encountered.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	7d457cffb8	scrub compaction: count validation errors for specific scrub task The number of validation errors per given compaction scrub on given shard is passed up to perform_task() function.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	3a805a9d9b	compaction: extract statistics in compaction_result Statistics from compaction_result are extracted to new struct compaction_stats and stored as a field of compaction_result.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	a80c187b20	scrub compaction: register validation errors in metrics The number of validation errors is registered in metrics. Metrics provide common counters for all scrub operation within a compaction manager, though. Thus, to check the exact number of validation errors, the comparison of counters before and after scrub operation needs to be done.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	ab85dab05d	scrub compaction: count validation errors The number of validation errors encountered during scrub compaction is counted.	2022-07-29 09:35:20 +02:00
Anna Stuchlik	5699da2357	Merge branch 'scylladb:master' into anna-remove-external-docs	2022-07-29 09:28:23 +02:00
Benny Halevy	cf47db2bdb	token_metadata: document that update_normal_tokens is unsafe Currently, if token_metadata_impl::update_normal_tokens throws an exception before it's done, it leaves the token_metadata_impl members partially updated and we have no way of recovering from that. The existing use cases take that into account and always call it on a cloned, temporary copy of the token metadata, so if it throws, the temporary copy is tossed away without being applied back. So just cement this, by adding cautions in the token_metadata class declaration. Closes #11127 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220728144821.130518-1-bhalevy@scylladb.com>	2022-07-29 05:38:56 +03:00
Avi Kivity	c3bad157e5	Merge 'multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes When stopping the read, the multishard reader will dismantle the compaction state, pushing back (unpopping) the currently processed partition's header to its originating reader. This ensures that if the reader stops in the middle of a partition, on the next page the partition-header is re-emitted as the compactor (and everything downstream from it) expects. It can happen however that there is nothing more for the current partition in the reader and the next fragment is another partition. Since we only push back the partition header (without a partition-end) this can result in two partitions being emitted without being separated by a partition end. We could just add the missing partition-end when needed but it is pointless, if the partition has no more data, just drop the header, we won't need it on the next page. The missing partition-end can generate an "IDL frame truncated" message as it ends up causing the query result writer to create a corrupt partition entry. Fixes: https://github.com/scylladb/scylla/issues/9482 Closes #11137 * github.com:scylladb/scylladb: test/cql-pytest: add regression test for "IDL frame truncated" error query: query_result_builder: add check for missing partition-end mutation_compactor: detach_state(): make it no-op if partition was exhausted querier: use full_position in shard_mutation_querier	2022-07-28 20:14:15 +03:00
Avi Kivity	e66809d051	Merge 'Memtable flush: wait for sstable count reduction if needed' from Benny Halevy Called from try_flush_memtable_to_sstable, maybe_wait_for_sstable_count_reduction will wait for compaction to catch up with memtable flush if there the bucket to compact is inflated, having too many sstables. In that case we don't want to add fuel to the fire by creating yet another sstable. Fixes #4116 Closes #10954 * github.com:scylladb/scylla: table: Add test where compaction doesn't keep up with flush rate. compaction_manager: add maybe_wait_for_sstable_count_reduction time_window_compaction_strategy: get_sstables_for_compaction: clean up code time_window_compaction_strategy: make get_sstables_for_compaction idempotent time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages leveled_manifest: pass compaction_counter as const&	2022-07-28 19:11:04 +03:00
Anna Stuchlik	4e2a41f53c	Update docs/operating-scylla/index.rst Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2022-07-28 15:25:16 +02:00
Anna Stuchlik	2845b4e598	doc: fix the link to the Monitoring Stack	2022-07-28 15:23:26 +02:00
Anna Stuchlik	69bf768907	doc: fix the link to Operator	2022-07-28 15:18:03 +02:00
Anna Stuchlik	792d1412d6	add the redirect to the Operator	2022-07-28 15:16:29 +02:00
Anna Stuchlik	302da44859	replace the internal links with the external link to Operator	2022-07-28 15:13:03 +02:00
Anna Stuchlik	70b79c6867	add the external link to Operator to the toctree	2022-07-28 15:04:15 +02:00
Anna Stuchlik	966c3423ad	doc: remove the Operator docs from the core documentation	2022-07-28 15:00:57 +02:00
Anna Stuchlik	63a8ef7030	doc: fix the links in the manager section	2022-07-28 14:54:30 +02:00
Anna Stuchlik	3e2ffaf91e	doc: add the external link to Monitoring Stack to the menu	2022-07-28 14:44:07 +02:00
Avi Kivity	09a6b93ddf	Merge 'logalloc: region: properly track listeners when moved' from Benny Halevy Currently logalloc::region is relying on boost binomial_heap handle to properly move listeners registration when the region (when derived from dirty_memory_manager_logalloc::size_tracked_region) is moved, like boost::intrusive link hooks do - hence `81e20ceaab/dirty_memory_manager.cc (L89-L90)` does nothing. Unfortunately, this doesn't work as expected. This series adds a unit test that verifies the move semantics and a fix to size_tracked_region and region_group code to make it pass. Also "logalloc: region: get_impl might be called on disengaged _impl when moved" fixes a couple corner cases where the shared _impl could be dereferenced when disengaged, and the change also adds a unit test for that too. Closes #11141 * github.com:scylladb/scylla: logalloc: region: properly track listeners when moved logalloc: region_impl: add moved method logalloc: region: merge: optimize getting other impl logalloc: region: merge: call region_impl::unlisten logalloc: region: call unlisten rather than open coding it logalloc: region move-ctor: initialize _impl logalloc: region: get_impl might be called on disengaged _impl when moved	2022-07-28 15:29:54 +03:00
Mikołaj Sielużycki	e0c6e1ef3c	table: Add test where compaction doesn't keep up with flush rate. The test simulates a situation where 2 threads issue flushes to 2 tables. Both issue small flushes, but one has injected reactor stalls. This can lead to a situation where lots of small sstables accumulate on disk, and, if compaction never has a chance to keep up, resources can be exhausted. (cherry picked from commit `b5684aa96d`) (cherry picked from commit `25407a7e41`)	2022-07-28 14:43:33 +03:00
Benny Halevy	f26e655646	compaction_manager: add maybe_wait_for_sstable_count_reduction Called from try_flush_memtable_to_sstable, maybe_wait_for_sstable_count_reduction will wait for compaction to catch up with memtable flush if there the bucket to compact is inflated, having too many sstables. In that case we don't want to add fuel to the fire by creating yet another sstable. Fixes #4116 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 14:43:30 +03:00
Benny Halevy	69d4a16908	time_window_compaction_strategy: get_sstables_for_compaction: clean up code Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 14:22:03 +03:00
Benny Halevy	c450f3ee11	time_window_compaction_strategy: make get_sstables_for_compaction idempotent To make sure fully_expired sstables are not missed if get_sstables_for_compaction is called just heuristically, change the state by setting _last_expired_check to the current time only when no fully_expired_sstables are found among the candidates. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 14:22:03 +03:00
Benny Halevy	3d07882431	time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages Print the compaction_strategy `this` pointer so we can distinguish between different instance of the compaction_strategy object (some code paths copy it and some may instantiate a branch new compaction_strategy object). The motivation is detecting when the side effects of this function are applied on the "master" instance, stored in the table shard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 14:22:03 +03:00
Benny Halevy	a149022ed4	leveled_manifest: pass compaction_counter as const& It is not modified by the leveld_manifest functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 14:22:03 +03:00
Botond Dénes	11985bb173	Update tools/java submodule * tools/java 1e7b872a61...ad6764b506 (1): > scylla-tools-java: Update "six" library used by cqlsh/python driver Closes #11148	2022-07-28 13:43:21 +03:00
Anna Stuchlik	4b0ec11136	update the info about installing scylla-enterprise-machine-image during upgrade	2022-07-28 12:07:01 +02:00
Anna Stuchlik	30f564f0e6	doc: replace the links to Monitoring Stack	2022-07-28 11:53:07 +02:00
Anna Stuchlik	fc71f6facb	doc: add the redirections for Monitoring Stack	2022-07-28 11:18:07 +02:00
Anna Stuchlik	0ce10be0a9	doc: delete the Monitoring Stack documentation form the ScyllaDB docs and remove it from the toctree	2022-07-28 11:09:43 +02:00
Nadav Har'El	cae1a41b30	Merge 'doc: update the README file in the docs directory' from Anna Stuchlik The purpose of this PR is to update the README file in the `docs` folder to: - Explain the contents of the folder (user docs vs developer docs). - Add more information to help contributors. - Remove outdated information. Closes #11134 * github.com:scylladb/scylla: docs: remove outdated information -Vale support, Lint, warning about livereload doc: improve the section about knowledge base articles in README doc: replace distribution names with a generic phrase: Linux distributions doc: remove irrelevant guidelines for contributors from README doc: language improvements in the doc's README doc: reogrganize the content in the doc's README doc: update the Prerequisites section in the doc's README doc: remove redundant information from README in the docs folder doc: add key information to the introduction in README in the docs folder	2022-07-28 11:48:27 +03:00
Botond Dénes	26f1295536	Merge 'mutation: Ignore dummy rows when consuming clustering fragments' from Mikołaj Sielużycki consume_clustering_fragments already ignores dummy rows, but does it in the wrong place. Currently they're ignored after comparing them with range tombstones. This change skips them before any useful work is done with them. Consider a simplified mutation reversal scenario scenario (ckp is clustering key prefix, -1, 0, 1 are bound_weights): schema_ptr s = schema_builder{"ks", "cf"} .with_column("pk", bytes_type, column_kind::partition_key) .with_column("ck1", bytes_type, column_kind::clustering_key) .build(); Input range tombstone positions: {clustered, ckp{}, before} {clustered, ckp{1}, after} Clustering rows: {clustered, ckp{2}, equal} {clustered, ckp{}, after} // dummy row During reversal, clustering rows are read backwards, and reversed range tombstone positions are read forwards (because the range tombstones are reversed and applied backwards). The read order in the example above is: Reversed range tombstone positions: 1: {clustered, ckp{}, before} 2: {clustered, ckp{1}, before} Clustering rows read backwards: 3: {clustered, ckp{}, after} // dummy row 4: {clustered, ckp{2}, equal} Then we effectively do the merge part of merge sort, trying to put all fragments in order according to their positions from the two lists above. However, the dummy row is used in the comparison, and it compares to be gt each of the reversed range tombstone positions. Then we try to emit the clustering row, but only at that point we notice it's dummy and should be skipped. Subsequent row with ckp{2} is compared to the last used range tombstone position and the fragments are out of order (in reversed schema, ckp{2} should come before ckp{1}). The solution is to move the logic skipping the dummy clustering rows to the beginning of the loop, so they can be ignored before they're used. Fixes: https://github.com/scylladb/scylla/issues/11147 Closes #11129 * github.com:scylladb/scylla: mutation: Add test if mutations are consumed in order test: Move validating_consumer to test/lib/mutation_assertions.hh mutation: Ignore dummy rows when consuming clustering fragments	2022-07-28 11:18:36 +03:00
Benny Halevy	f6645313d8	logalloc: region: properly track listeners when moved And add targeted unit tests for that. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 11:17:55 +03:00
Anna Stuchlik	da7f6cdec4	doc: add the requirement to install scylla-enterprise-machine-image if the previous version was installed with an image	2022-07-28 09:53:27 +02:00
Benny Halevy	1d9862dab3	logalloc: region_impl: add moved method Don't open-code calling the region_impl _listeners->moved() in region move-constructor and move-assignment op. The other._impl->_region might be different then &other post region::merge so let the region_impl decide which region* is moved from. The new_region is also set to region_impl->_region so need to open-code that either in the said call sites. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 10:49:49 +03:00
Benny Halevy	cd4dbb1cae	logalloc: region: merge: optimize getting other impl The other _impl is presumed to be engaged already, so just call other.get_impl() once for both use cases. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 10:49:36 +03:00
Benny Halevy	a547cb79e8	logalloc: region: merge: call region_impl::unlisten We can't be sure that the other_impl->_region == &other since it could be a result of a previous merge, so don't decide for it which region to unlisten to, let it use its current _region. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 10:49:27 +03:00
Benny Halevy	003216de59	logalloc: region: call unlisten rather than open coding it Current ~region and region::operator= open-code region_impl::unlisten. Just call it so it will be easier to maintain. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 10:49:11 +03:00
Benny Halevy	cff953535c	logalloc: region move-ctor: initialize _impl There's no need to default-initialize it and then move-assign it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 10:49:05 +03:00
Benny Halevy	c7d77e4076	logalloc: region: get_impl might be called on disengaged _impl when moved First check if _impl is engaged before accessing it to set its _region = this in the move constructor and move assignment operator. Add unit test for these odd orner cases. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-28 10:48:58 +03:00
Botond Dénes	079e425ef1	test/cql-pytest: add regression test for "IDL frame truncated" error	2022-07-28 09:02:28 +03:00
Botond Dénes	b23ce76b27	query: query_result_builder: add check for missing partition-end If the reader feeding the result builder is missing a partition-end between two partition, or at end-of-stream, the result builder will write a corrupt partition-entry into the result, ending up in an "IDL Frame truncated" error. It is trivial to add a check for this and this will result in a much more clear error message, then the mysterious frame truncated error mentioned above.	2022-07-28 09:02:28 +03:00
Botond Dénes	f119554106	mutation_compactor: detach_state(): make it no-op if partition was exhausted detach_state() allows the user to resume a compaction process later, without having to keep the compactor object alive. This happens by generating and returning the mutation fragments the user has to re-feed to a newly constructed compactor to bring it into the exact same state the current compactor was at the point of stopping the compaction. This state includes the partition-header (partition-start and static-row if any) and the currently active range tombstone. Detaching the state is pointless however when the compaction was stopped such that the currently compacted partition was completely exhausted. Allowing the state to be detached in this case seems benign but it caused a subtle bug in the main user of this feature: the partition range scan algorithm, where the fragments included in the detached state were pushed back into the reader which produced them. If the partition happened to be exhausted -- meaning the next fragment in the reader was a partition-start or EOS -- this resulted in the partition being re-emitted later without a partition-end, resulting in corrupt query-result being generated, in turn resulting in an obscure "IDL frame truncated" error. This patch solves this seemingly benign but sinister bug by making the return value of `detach_state()` an std::optional and returning a disengaged optional when the partition was exhausted.	2022-07-28 09:02:26 +03:00
Botond Dénes	afa694a20c	querier: use full_position in shard_mutation_querier Instead of a separate partition key and position-in-partition. This continues the recently started effort to standardize storing of full positions on `full_position`. This patch is also a hidden preparation for read_context::save_readers() multishard_mutation_query.cc) no longer being able to get partition key from compaction state in the future.	2022-07-28 08:19:23 +03:00
Anna Stuchlik	82f96327d4	docs: remove outdated information -Vale support, Lint, warning about livereload	2022-07-27 22:07:40 +02:00
Botond Dénes	c54d19427d	mutation_compactor: don't ignore consumer's stop request on range tombstone Broken since the v2 output support was introduced (`ad435dc`). No known adverse affects, besides mutation reads stopping a little later than desired (on the next non-range-tombstone-change fragment) and hence consuming more memory than the limit set for them. Fixes: #11138 Closes #11139	2022-07-27 22:24:29 +03:00
Avi Kivity	2c0932cc41	Merge 'Reduce the amount of per-table metrics' from Amnon Heiman This series is the first step in the effort to reduce the number of metrics reported by Scylla. The series focuses on the per-table metrics. The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode. The following series uses multiple tools to reduce the number of metrics. 1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added. 2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node. 3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas). Closes #11058 * github.com:scylladb/scylla: Add summary_test database: Reduce the number of per-table metrics replica/table.cc: Do not register per-table metrics for system histogram_metrics_helper.hh: Add to_metrics_summary function Unified histogram, estimated_histogram, rates, and summaries Split the timed_rate_moving_average into data and timer utils/histogram.hh: should_sample should use a bitmask estimated_histogram: add missing getter method	2022-07-27 22:01:08 +03:00
Avi Kivity	2d4caa0134	Update tools/java submodule * tools/java d0143b447c...1e7b872a61 (2): > scylla-tools-java: Update "six" library used by cqlsh/python driver > Add Scylla-specific table options to Option enum Fixes scylladb/scylla#10856.	2022-07-27 21:41:18 +03:00
Avi Kivity	4438865a26	Merge 'memtable flush error handling' from Benny Halevy The series unifies memtable flush error handling into table::seal_active_memtable following up on `f6d9d6175f`. The goal here is to prevent an infinite retry loop as in #10498 by aborting on any error that is not bad_alloc. Fixes #10498 Closes #10691 * github.com:scylladb/scylla: test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once test: memtable_test: failed_flush_prevents_writes: extend error injection table: seal_active_memtable: abort if retried for too long table: seal_active_memtable: abort on unexpected error table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable dirty_memory_manager: flush_permit: add has_sstable_write_permit dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept dirty_memory_manager: flush_permit: make _sstable_write_permit optional table: reindent seal_active_memtable table: coroutinize seal_active_memtable memtable_list: mark functions noexcept commitlog: make discard_completed_segments and friends noexcept dirty_memory_manager: flush_when_needed: target error handling at flush_one database: delete unused seal_delayed_fn_type dirty_memory_manager: mark functions noexcept memtable: mark functions noexcept memtable: memtable_encoding_stats_collector: mark functions noexcept encoding_state: mark functions noexcept logalloc: mark free functions noexcept logalloc: allocating_section: mark functions noexcept logalloc: allocating_section: guard: mark constructor noexcept logalloc: reclaim_lock: mark functions noexcept logalloc: tracker_reclaimer_lock: mark constructor noexcept logalloc: mark shard_tracker noexcept logalloc: region: mark functions const/noexcept logalloc: basic_region_impl: mark functions noexcept logalloc: region_impl: mark functions noexcept utils: log_heap: mark functions noexcept logalloc: region_impl: object_descriptor: mark functions noexcept logalloc: region_group: mark functions noexcept logalloc: tracker: mark functions const/noexcept logalloc: tracker::impl: make region_occupancy and friends const logalloc: tracker::impl: occupancy: get rid of reclaiming_lock logalloc: tracker::impl: mark functions noexcept logalloc: segment: mark functions const / noexcept logalloc: segment_pool: add const variant of descriptor method logalloc: segment_pool: move descriptor method to class definition logalloc: segment_pool: mark functions const/noexcept logalloc: segment_pool: delete unused free_or_restore_to_reserve method utils: dynamic_bitset: mark functions noexcept utils: dynamic_bitset: delete unused members logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const logalloc: segment_store: make can_allocate_more_segments const logalloc: segment_store: mark functions noexcept logalloc: segment_descriptor: mark functions noexcept logalloc: occupancy_stats: mark functions noexcept min_max_tracker: mark functions noexcept gc_clock, db_clock: mark functions noexcept dirty_memory_manager: region_group: mark functions noexcept dirty_memory_manager: region_group: make simple constructor noexcept dirty_memory_manager: region_group_reclaimer mark functions noexcept logalloc: lsa_buffer: mark functions noexcept	2022-07-27 19:08:59 +03:00
Amnon Heiman	3658aa9ec2	Add summary_test This patch adds unit tests for the summary implementation.	2022-07-27 16:58:52 +03:00
Amnon Heiman	99a060126d	database: Reduce the number of per-table metrics This patch reduces the number of metrics that is reported per table, when the per-table flag is on. When possible, it moves from time_estimated_histogram and timed_rate_moving_average_and_histogram to use the unified timer. Instead of a histogram per shard, it will now report a summary per shard and a histogram per node. Counters, histograms, and summaries will not be reported if they were never used. The API was updated accordingly so it would not break. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-27 16:58:52 +03:00
Amnon Heiman	c31a58f2e9	replica/table.cc: Do not register per-table metrics for system There is a set of per-table metrics that should only be registered for user tables. As time passes there are more keyspaces that are not for the user keyspace and there is now a function that covers all those cases. This patch replaces the implementation to use is_internal_keyspace. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-27 16:58:52 +03:00
Amnon Heiman	9a3e70adfb	histogram_metrics_helper.hh: Add to_metrics_summary function The to_metrics_summary is a helper function that create a metrics type summary from a timed_rate_moving_average_with_summary object. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-27 16:58:52 +03:00
Amnon Heiman	c220e3a00f	Unified histogram, estimated_histogram, rates, and summaries Currently, there are two metrics reporting mechanisms: the metrics layer and the API. In most cases, they use the same data sources. The main difference is around histograms and rate. The API calculates an exponentially weighted moving average using a timer that decays the average on each time tick. It calculates a poor-man histogram by holding the last few entries (typically the last 256 entries). The caller to the API uses those last entries to build a histogram. We want to add summaries to Scylla. Similar to the API rate and histogram, summaries are calculated per time interval. This patch creates a unified mechanism by introducing an object that would hold both the old-style histogram and the new (estimated_histogram). On each time tick, a summary would be calculated. In the future, we'll replace the API to report summaries instead of the old-style histogram and deprecate the old style completely. summary_calculator uses two estimated_histogram to calculate a summary. timed_rate_moving_average_summary_and_histogram is a unifed class for ihistogram, rates, summary, and estimated_histogram and will replace timed_rate_moving_average_and_histogram. Follow-up patches would move code from using timed_rate_moving_average_and_histogram to timed_rate_moving_average_summary_and_histogram. By keeping the API it would make the transition easy. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-27 16:58:25 +03:00
Avi Kivity	a03a33dcaf	Merge 'forward_service: reduce allocations in forward_service' from Piotr Sarna This series refactors the code to get rid of unnecessary allocations by extracing a helper requires_thread() function, as well as by removing std::optional usage in forward_result, now that it's possible to merge empty results with each other, both ways (#11064). Closes #11120 * github.com:scylladb/scylla: forward_service: remove redundant optional from forward_service forward_service: open-code running a Sestar thread forward_service: add requires_thread helper	2022-07-27 16:29:00 +03:00
Avi Kivity	71bec22117	Merge 'Speed up bootstrap with large number of tokens in the cluster 10X' from Asias He === Setup === 1) start node1 with ``` scylla --num-tokens 20000 --smp 1 ``` The large number of tokens per node is used to simulate large number of nodes in the cluster (large total number of tokens for the cluster). 2) start node2 with ``` scylla --num-tokens 20000 --smp 1 ``` 3) Measure the time to finish bootstrap === Result === 1) With speed up patch: ``` node1 (16s) INFO 2022-06-21 14:30:00,038 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ... INFO 2022-06-21 14:30:16,019 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed. node2 (bootstrap node,174s) INFO 2022-06-21 14:30:40,954 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ... INFO 2022-06-21 14:33:34,899 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed. ``` 2) Without speed up patch: ``` node1 (171s) INFO 2022-06-21 14:38:49,065 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ... INFO 2022-06-21 14:41:40,601 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed. node2 (bootstrap node, 1181s) INFO 2022-06-21 14:41:46,997 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ... INFO 2022-06-21 15:01:27,507 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed. ``` The improvements for bootstrap time: node1: 171s / 16s = 10.68X node2: 1181s / 174s = 6.78X Refs #10337 Refs #10817 Refs #10836 Refs #10837 Closes #10850 * github.com:scylladb/scylla: locator: Speed up abstract_replication_strategy::get_address_ranges locator: Speed up simple_strategy::calculate_natural_endpoint token_metadata: Speed up count_normal_token_owners	2022-07-27 16:04:09 +03:00
Anna Stuchlik	b31cb94944	doc: improve the section about knowledge base articles in README	2022-07-27 13:49:53 +02:00
Raphael S. Carvalho	0796b8c97a	sstables: Enforce disjoint invariant in sstable_run We know that sstable_run is supposed to contain disjoint files only, but this assumption can temporarily break when switching strategies as TWCS, for example, can incorrectly pick the same run id for sstables in different windows during segregation. So when switching from TWCS to ICS, it could happen a sstable_run won't contain disjoint files. We should definitely fix TWCS and any other strategy doing that, but sstable_run should have disjointness as actual invariant, not be relaxed on it. Otherwise, we cannot build readers on this assumption, so more complicated logic have to be added to merge overlapping files. After this patch, sstable_run will reject insertion of a file that will cause the invariant to break, so caller will have to check that and push that file into a different sstable run. Closes #11116	2022-07-27 14:48:28 +03:00
Anna Stuchlik	ecf1633cb3	doc: replace distribution names with a generic phrase: Linux distributions	2022-07-27 13:28:05 +02:00
Anna Stuchlik	456f2f7c47	doc: remove irrelevant guidelines for contributors from README	2022-07-27 13:18:20 +02:00
Benny Halevy	bb9eddc67f	test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once Now that memtable flush error handling was moved entirely to table::seal_active_memtable, we don't need to notify_soft_pressure to keep retry going. The inifinite retry loop should eventually either succeed or die (by isolating the node or aborting) on its own. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:59 +03:00
Benny Halevy	b5abbb971f	test: memtable_test: failed_flush_prevents_writes: extend error injection Inject errors into all seal_active_memtable distinct error handling sites. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:59 +03:00
Benny Halevy	a5911619c0	table: seal_active_memtable: abort if retried for too long If we haven't been able to flush the memtable in ~30 minutes (based on the number of retries) just abort assuming that the OOM condition is permanent rather than transient. Refs #4344 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:59 +03:00
Benny Halevy	bc18f750c6	table: seal_active_memtable: abort on unexpected error Currently when we can't write the flushed sstable due to corruption in the memtable we get into an infinite retry loop (see #10498). Until we can go into maintenance mode, the next best thing would be to abort, though there is still a risk that commitlog replay will reproduce the corruption in the memtable and we's end up with an infinite crash loop. (hence #10498 is not Fixed with this patch) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:57 +03:00
Benny Halevy	f0a597a252	table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable And let seal_active_memtable decide about how to handle them as now all flush error handling logic is implemented there. In particular, unlike today, sstable write errors will cause internal error rather than loop forever. Also, check for shutdown earlier to ignore errors like semaphore_broken that might happen when the table is stopped. Refs #10498 (The issue will be considered fixed when going into maintenance mode on write errors rather than throwing internal error and potentially retrying forever) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:04:55 +03:00
Benny Halevy	d55a2ac762	dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable Currently flush is retried both by dirty_memory_manager::flush_when_needed and table::seal_active_memtable, which may be called by other paths like table::flush. Unify the retry logic into seal_active_memtable so that we have similar error handling semantics on all paths. Refs #4174 Refs #10498 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	93f835a2dd	dirty_memory_manager: flush_permit: add has_sstable_write_permit after release_sstable_write_permit is called, _sstable_write_permit will have no value. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	b3dcc77c66	dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept Neither exchanging the std:;optional nor moving the sstable_write_permit throw. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	53355eb95d	dirty_memory_manager: flush_permit: make _sstable_write_permit optional So we can safely test whether it was released or not by release_sstable_write_permit in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	67479e4243	table: reindent seal_active_memtable	2022-07-27 13:43:17 +03:00
Benny Halevy	00941452d5	table: coroutinize seal_active_memtable As a first step to making it robust using state machine driven retries. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	d3acd80cf5	memtable_list: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	5991482049	commitlog: make discard_completed_segments and friends noexcept To simplify table::seal_active_memtable error handling and retry logic. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	863e9d9e6a	dirty_memory_manager: flush_when_needed: target error handling at flush_one Now that everything prior to flush_one is noexcept make table::seal_active_memtable and the paths that call it noexcept, making sure that any errors are returned only as exceptional futures, and handle them in flush_when_needed(). The original handle_exception had a broader scope than now needed, so this change is mostly technical, to show that we can narrow down the error handling to the continuation of flush_one - and verify that the unit test is not broken. A later patch moves this error handling logic away to seal_active_memtable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	73e50bc97d	database: delete unused seal_delayed_fn_type Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	73e5cd0448	dirty_memory_manager: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	fcb3347c7a	memtable: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	2d1ba0d7d8	memtable: memtable_encoding_stats_collector: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	ad85e720f9	encoding_state: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	6e961ead3b	logalloc: mark free functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	705b42efe2	logalloc: allocating_section: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	f9db708376	logalloc: allocating_section: guard: mark constructor noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	5416808367	logalloc: reclaim_lock: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	95b0e41abb	logalloc: tracker_reclaimer_lock: mark constructor noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	ed9e036509	logalloc: mark shard_tracker noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	d6e6ffc741	logalloc: region: mark functions const/noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	2beee4a6cd	logalloc: basic_region_impl: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	3ba85c3bbd	logalloc: region_impl: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	d838456be2	utils: log_heap: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	3f96818c03	logalloc: region_impl: object_descriptor: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	0866548b27	logalloc: region_group: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	fe50c76dbc	logalloc: tracker: mark functions const/noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:40:50 +03:00
Benny Halevy	71c21a83ad	logalloc: tracker::impl: make region_occupancy and friends const No that they don't modify the tracker impl. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:40:18 +03:00
Benny Halevy	1c0c01cc24	logalloc: tracker::impl: occupancy: get rid of reclaiming_lock It was added in `d20fae96a2` as a precaution not to invalidate iterators while traversing _regions. However it is not requried as no allocation is done on this synchronous path - therefore there is no point in preventing reclaim. This will allow making the respective functions const as they merely return stats and do not modify the tracker impl. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:39:18 +03:00
Benny Halevy	888e225113	logalloc: tracker::impl: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:39:16 +03:00
Benny Halevy	f0027f60d4	logalloc: segment: mark functions const / noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:34:56 +03:00
Benny Halevy	830912cfa0	logalloc: segment_pool: add const variant of descriptor method Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:34:48 +03:00
Benny Halevy	f318d1664e	logalloc: segment_pool: move descriptor method to class definition To make the implementation inline and to prepare for the next patch that adds a const overload of this method. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:34:37 +03:00
Benny Halevy	35899463d4	logalloc: segment_pool: mark functions const/noexcept Some methods were also marked inline when declared in the class definition and in the ir definition site to provide a hint to the compiler to inline them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:33:47 +03:00
Benny Halevy	02e74696f2	logalloc: segment_pool: delete unused free_or_restore_to_reserve method Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:33:21 +03:00
Benny Halevy	00dae56e19	utils: dynamic_bitset: mark functions noexcept dynamic_bitset allocates only when constructed. then on it doesn't throw. Though not that accessing bits out of range is undefined behavior. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:32:36 +03:00
Benny Halevy	d911d03344	utils: dynamic_bitset: delete unused members Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:32:08 +03:00
Benny Halevy	da87a4a248	logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload To maintain the const chain from segment via segment_store to segment_pool. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:28:21 +03:00
Benny Halevy	947f71ce91	logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const Maintain the const chain by returning a const segment* from segment_from_idx() const overload. And add a respective mutable overload to return a mutable segment*. This is done for a similar change in idx_from_segment. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:27:30 +03:00
Benny Halevy	17902da66c	logalloc: segment_store: make can_allocate_more_segments const Add a const noexcept overload of `find_empty()` so that can_allocate_more_segments can be const noexcept as well. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:26:07 +03:00
Benny Halevy	2ae61d5209	logalloc: segment_store: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:19:58 +03:00
Benny Halevy	852c23b97a	logalloc: segment_descriptor: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:18:15 +03:00
Benny Halevy	a49619a601	logalloc: occupancy_stats: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:17:43 +03:00
Benny Halevy	721e94dcf1	min_max_tracker: mark functions noexcept Based on tracked types being nothrow copy and move construtible. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:17:27 +03:00
Benny Halevy	b5f9a3d44e	gc_clock, db_clock: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:17:01 +03:00
Benny Halevy	724692e7f4	dirty_memory_manager: region_group: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:16:02 +03:00
Benny Halevy	6aaec0928a	dirty_memory_manager: region_group: make simple constructor noexcept By std::moving its sstring name arg. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:13:33 +03:00
Benny Halevy	c386339730	dirty_memory_manager: region_group_reclaimer mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:06:32 +03:00
Mikołaj Sielużycki	9f5655bb97	mutation: Add test if mutations are consumed in order It explicitly interleaves clustering rows with range tombstones and ensures the last clustering row is dummy.	2022-07-27 11:22:55 +02:00
Mikołaj Sielużycki	9c43f1266a	test: Move validating_consumer to test/lib/mutation_assertions.hh	2022-07-27 11:19:50 +02:00
Anna Stuchlik	b3648c1403	doc: language improvements in the doc's README	2022-07-27 10:28:59 +02:00
Anna Stuchlik	d4bc030705	doc: reogrganize the content in the doc's README	2022-07-27 10:18:01 +02:00
Anna Stuchlik	e989eed75b	doc: update the Prerequisites section in the doc's README	2022-07-27 10:03:09 +02:00
Anna Stuchlik	48703adf59	doc: remove redundant information from README in the docs folder	2022-07-27 09:56:24 +02:00
Anna Stuchlik	3f097f3285	doc: add key information to the introduction in README in the docs folder	2022-07-27 09:50:13 +02:00
Mikołaj Sielużycki	09da47d87e	mutation: Ignore dummy rows when consuming clustering fragments consume_clustering_fragments already ignores dummy rows, but does it in the wrong place. Currently they're ignored after comparing them with range tombstones. This change skips them before any useful work is done with them. Consider a simplified mutation reversal scenario scenario (ckp is clustering key prefix, -1, 0, 1 are bound_weights): schema_ptr s = schema_builder{"ks", "cf"} .with_column("pk", bytes_type, column_kind::partition_key) .with_column("ck1", bytes_type, column_kind::clustering_key) .build(); Range tombstones: range_tombstone rt1{ckp{}, bound_kind::incl_start, ckp{1}, bound_kind::incl_end, tombstone{ts + 0, tp}}; range_tombstone rt2{ckp{1}, bound_kind::excl_start, ckp{}, bound_kind::incl_end, tombstone{ts + 1, tp}}; Input range tombstone positions: {clustered, ckp{}, before} {clustered, ckp{1}, after} Clustering rows: {clustered, ckp{2}, equal} {clustered, ckp{}, after} // dummy row During reversal, clustering rows are read backwards, and reversed range tombstone positions are read forwards (because the range tombstones are reversed and applied backwards). Position of rows is not reversed, as regular rows always have equal positions (which does not hold for dummy rows, which causes the problem in this case). The read order in the example above is: Reversed range tombstone positions: 1: {clustered, ckp{}, before} 2: {clustered, ckp{1}, before} Clustering rows read backwards: 3: {clustered, ckp{}, after} // dummy row 4: {clustered, ckp{2}, equal} Then we effectively do the merge part of merge sort, trying to put all fragments in order according to their positions from the two lists above. However, the dummy row is used in the comparison, and it compares to be gt each of the reversed range tombstone positions. Then we try to emit the clustering row, but only at that point we notice it's dummy and should be skipped. Subsequent row with ckp{2} is compared to the last used range tombstone position and the fragments are out of order (in reversed schema, ckp{2} should come before ckp{1}). The solution is to move the logic skipping the dummy clustering rows to the beginning of the loop, so they can be ignored before they're used.	2022-07-27 09:32:56 +02:00
Benny Halevy	a6356539bf	logalloc: lsa_buffer: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 10:22:35 +03:00
Botond Dénes	81e20ceaab	Merge 'logalloc, dirty_memory_manager: move region_groups to dirty_memory_manager' from Avi Kivity logalloc manages regions of log-structured allocated memory, and region_groups containing such regions and other region_groups. region_groups were introduced for accounting purposes - first to limit the amount of memory in memtables, then to match new dirty memory allocation rate with memtable flushing rate so we never hit a situation where allocation rate exceeded flush rate, and we exceed our limit. The problem is that the abstraction is very weak - if we want to change anything in memtable flush control we'll need to change region_groups too - and also expensive to maintain. The solution is to break the abstraction and move region_groups to memtable dirty memory management code. Instead introduce a new, simpler abstraction, the region_listener, which communicates changes in region memory consumption to an external piece of code, which can then choose to do with it what it likes. The long term plan is to completely remove region_groups and fold them into dirty_memory_manager: - make each memtable a region_listener so it gets called back after size changes - make memtables inform their dirty_memory_manager about the size to dirty_memory_manager can decide to throttle writes and which memtable to pick to flush Closes #10839 * github.com:scylladb/scylla: logalloc: drop region_impl public accessors logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc logalloc: relax lifetime rules around region_listener logalloc, dirty_memory_manager: move region_group and associated code logalloc: expose tracker_reclaimer_lock logalloc: reimplement tracker_reclaim_lock to avoid using hidden classes logalloc: reduce friendship between region and region_group logalloc: decouple region_group from region memtable: stop using logalloc::region::group() to test for flushed memtables	2022-07-26 17:08:37 +03:00
Amnon Heiman	72414b613b	Split the timed_rate_moving_average into data and timer This patch split the timed_rate_moving_average functionality into two, a data class: rates_moving_average, and a wrapper class timed_rate_moving_average that uses a timer to update the rates periodically. To make the transition as simple as possible timed_rate_moving_average, takes the original API. A new helper class meter_timer was introduced to handle the timer update functionality. This change required minimal code adaptation in some other parts of the code. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-26 15:59:33 +03:00
Amnon Heiman	5bf51ed4af	utils/histogram.hh: should_sample should use a bitmask This patch fixes a bug in should_sample that uses its bitmask incorrectly. basic_ihistogram has a feature that allows it to sample values instead of taking a timer each time. To decide if it should sample or not, it uses a bitmask. The bitmask is of the form 2^n-1, which means 1 out of 2^n will be sampled. For example, if the mask is 0x1 (2^2-1) 1 out of 2 will be sampled. If the mask is 0x7 (2^3-1) 1 out of 8 will be sampled. There was a bug in the should_sampled() method. The correct form is (value&mask) == mask Ref #2747 It does not solve all of #2747, just the bug part of it. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-26 15:59:33 +03:00
Amnon Heiman	99bc6d882b	estimated_histogram: add missing getter method This patch adds the square bracket operator method that was missing.	2022-07-26 15:59:33 +03:00
Anna Stuchlik	844c875f15	doc: add info about the time-consuming step due to resharding	2022-07-26 14:52:11 +02:00
Nadav Har'El	cb8a67dc98	Merge 'Allow materialized views to by synchronous' from Piotr Sarna This pull request introduces a "synchronous mode" for global views. In this mode, all view updates are applied synchronously as if the view was local. Marking view as a synchronous one can be done using `CREATE MATERIALIZED VIEW` and `ALTER MATERIALIZED VIEW`. E.g.: ```cql ALTER MATERIALIZED VIEW ks.v WITH synchronous_updates = true; ``` Marking view as a synchronous one was done using tags (originally used by alternator). No big modifications in the view's code were needed. Fixes: https://github.com/scylladb/scylla/issues/10545 Closes #11013 * github.com:scylladb/scylla: cql-pytest: extend synchronous mv test with new cases cql-pytest: allow extra parameters in new_materialized_view docs: add a paragraph on view synchronous updates test/boost/cql_query_test: add test setting synchronous updates property test: cql-pytest: add a test for synchronous mode materialized views db: view: react to synchronous updates tag cql3: statements: cf_prop_defs: apply synchronous updates tag alternator, db: move the tag code to db/tags cql3: statements: add a synchronous_updates property	2022-07-26 15:42:51 +03:00
Alejo Sanchez	5014bd0d51	test.py: add missing CQL test discovery Add missing build_test_list() to CQLApproval test. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11124	2022-07-26 15:40:04 +03:00
David Garcia	18c7006ac5	doc: Remove unused redirects Closes #11112	2022-07-26 14:15:45 +03:00
Asias He	6152f5b858	locator: Speed up abstract_replication_strategy::get_address_ranges To get the list of tokens for a given node, we loop through all the tokens and calculate the nodes that are responsible for the token. In case of the everywhere_topology, we know any node that is part of the the ring will be responsible for all tokens. This patch adds a fast path for everywhere_topology to avoid calculating natural endpoints. Refs #10337 Refs #10817 Refs #10836 Refs #10837	2022-07-26 18:53:09 +08:00
Asias He	9a8a80527b	locator: Speed up simple_strategy::calculate_natural_endpoint If the number of nodes in the cluster is smaller than the desired replication factor we should return the loop when endpoints already contains all the nodes in the cluster because no more nodes could be added to endpoints lists Refs #10337 Refs #10817 Refs #10836 Refs #10837	2022-07-26 18:53:09 +08:00
Asias He	4c714dfe3b	token_metadata: Speed up count_normal_token_owners Currently, a set of nodes is built from _token_to_endpoint_map to get the number of nodes in _token_to_endpoint_map. To make it faster so we can call it on a fast path in the following patch, a _nr_normal_token_owners member is introduced to track the number. Refs #10337 Refs #10817 Refs #10836 Refs #10837	2022-07-26 18:53:09 +08:00
Pavel Emelyanov	40d6ea973c	snitch: Remove reconnectable snitch helper It's now no-op Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-26 13:51:05 +03:00
Pavel Emelyanov	b91f7e9ec4	snitch, storage_service: Move reconnect to internal_ip kick The same thing as in previous patch -- when gossiper issues on_join/_change notification, storage service can kick messaging service to update its internal_ip cache and reconnect to the peer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-26 13:48:46 +03:00
Pavel Emelyanov	1bf8b0dd92	snitch, storage_service: Move system.peers preferred_ip update Currently the INTERNAL_IP state is updated using reconnectable helper by subscribing on on_join/on_change events from gossiper. The same subscription exists in storage service (it's a bit more elaborated by checking if the node is the part of the ring which is OK). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-26 13:48:46 +03:00
Pavel Emelyanov	0abd2c1e52	snitch: Export prefer-local The boolean bit says whether "the system" should prefer connecting to the address gossiper around via INTERNAL_IP. Currently only gossiping property file snitch allows to tune it and ec2-multiregion snitch prefers internal IP unconditionally. So exporting consists of 2 pieces: - add prefer_local() snitch method that's false by default or returns the (existing) _prefer_local bit for production snitch base - set the _prefer_local to true by ec2-multiregion snitch While at it the _prefer_local is moved to production_snitch_base for uniformity with the new prefer_local() call Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-26 13:48:04 +03:00
Piotr Sarna	abc5a7b7ec	forward_service: remove redundant optional from forward_service This commit refactors the code to get rid of unnecessary std::optional usage in forward_result, since now it's possible to merge empty results with each other, both ways (#11064).	2022-07-26 12:02:55 +02:00
Alejo Sanchez	2e39642728	test.py: fix log handling on error for Python and CQL tests Fix mixing of log filename and log summary in error reporting for CQLApprovalTest and PythonTest. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11125	2022-07-26 11:38:14 +03:00
Alejo Sanchez	302b703efe	test.py: remove unused global random_tables gets the keyspace from caller so remove leftover counter. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11123	2022-07-26 11:35:40 +03:00
Anna Stuchlik	38c2c4f7df	doc: update the info about metrics in 2022.1 compared to 5.0	2022-07-26 10:25:26 +02:00
Avi Kivity	5b541bed72	logalloc: drop region_impl public accessors With the region heap handle removed from logalloc::region, there is nothing remaining there that needs violation of the abstraction boundary, so we can drop these hacks.	2022-07-26 11:12:10 +03:00
Avi Kivity	2cb5f79e9d	logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc The region_group mechanism used an intrusive heap handle embedded in logalloc::region to allow region_group:s to track the largest region. But with region_group moved out of logalloc, the handle is out of place. Move it out, introducing a new intermediate class size_tracked_region to hold the heap handle. We might eventually merge the new class into memtable (which derives from it), but that requires a large rearrangement of unit tests, so defer that.	2022-07-26 11:12:10 +03:00
Avi Kivity	ee720fa23b	logalloc: relax lifetime rules around region_listener Currently, a region_listener is added during construction and removed during destruction. This was done to mimick the old region(region_group&) constructor, as region_listener replaces region_group. However, this makes moving the binomial heap handle outside logalloc difficult. The natural place for the handle is in a derived class of logalloc::region (e.g. memtable), but members of this derived class will be destroyed earlier than the logalloc::region here. We could play trickes with an earlier base class but it's better to just decouple region lifecycle from listener lifecycle. Do that be adding listen()/unlisten() methods. Some small awkwardness remains in that merge() implicitly unlistens (see comment in region::unlisten). Unit tests are adjusted.	2022-07-26 11:12:10 +03:00
Avi Kivity	fbe8ea7727	logalloc, dirty_memory_manager: move region_group and associated code region_group is an abstraction that allows accounting for groups of regions, but the cost/benefit ratio of maintaining the abstraction is poor. Each time we need to change decision algorithm of memtable flushing (admittedly rarely), we need to distill that into an abstraction for region_groups and then use it. An example is virtual regions groups; we wanted to account for the partially flushed memtables and had to invent region groups to stand in their place. Rather than continuing to invest in the abstraction, break it now and move it to the memtable dirty memory manager which is responsible for making those decisions. The relevant code is moved to dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and a new unit test file is added as well. A downside of the change is that unit testing will be more difficult.	2022-07-26 11:12:10 +03:00
Avi Kivity	bffee2540f	logalloc: expose tracker_reclaimer_lock tracker_reclaimer_lock is used by region_group, which is being moved out of logalloc, so expose it.	2022-07-26 11:12:10 +03:00
Avi Kivity	4ba0658670	logalloc: reimplement tracker_reclaim_lock to avoid using hidden classes Right now tracker_reclaim_lock uses tracker::impl::reclaiming_lock, which won't be visible if we want to expose tracker_reclaim_lock and use it from another translation unit. However, it's simple to switch to an implementation that doesn't require an unknown-size data member, and instead increment a counter via a pointer, so do that.	2022-07-26 11:12:10 +03:00
Avi Kivity	652ab6f4a2	logalloc: reduce friendship between region and region_group - add conversions between region and region_impl - add accessor for the binomial heap handle - add accessor for region_impl::id() - remove friend declarations This helps in moving region_group to a different source file, where the definitions of region_impl will not be visible.	2022-07-26 11:12:10 +03:00
Avi Kivity	c91ee9d04e	logalloc: decouple region_group from region As a first step in moving region_group away from logalloc, decouple communications between region and region_group. We introduce region_listener, that listens for the events that region passed directly to region_group. A region_group now installs a region_listener in a region, instead of having region know about the region_group directly. This decoupling is still leaky: - merge() chooses to forget the merged-from region's region_listener. This happens to be suitable for the only user of merge(). - We're still embedding the binomial heap handle, used by region_group to keep track of region sizes, in regions. A complete decoupling would transfer that responsibility to region_group.	2022-07-26 11:12:03 +03:00
Avi Kivity	cb1251199a	memtable: stop using logalloc::region::group() to test for flushed memtables Currently, the memtable reader uses logalloc::region::group() to test for whether a memtable has been flushed. If a memtable doesn't belong to a region group (from dirty_memory_manager), it is flushed. This is quite tortuous - logalloc::region::merge() makes the merged-from region identical to the merged-to region. The merged-to region, the cache, doesn't have a group, so the check works. Since we're making region groups part of dirty_memory_manager, the cache will no longer have this indirect way of communication with memtable. But instead we can use a direct callback it already has - on_detach_from_region_group(). Use that to set a flag, and examine it in the read path.	2022-07-26 11:07:25 +03:00
David Garcia	5067de6d3f	docs: Fix broken links Closes #11092	2022-07-26 10:53:17 +03:00
Piotr Sarna	626fb75949	forward_service: open-code running a Sestar thread Previous interface forced the caller to allocate forward_aggregates in order to be able to conditionally run the merging code inside a Seastar thread, which is suboptimal. By open-coding the condition, it's possible to drop the do_with, saving an allocation.	2022-07-26 08:10:47 +02:00
Piotr Sarna	e8f2565371	forward_service: add requires_thread helper It will be needed later to be able to decide if seastar thread is needed for merging forward service results.	2022-07-26 08:10:47 +02:00
Avi Kivity	29c28dcb0c	Merge 'Unstall get_range_to_address_map' from Benny Halevy Prevent stalls in this path as seen in performance testing. Also, add a respective rest_api test. Fixes #11114 Closes #11115 * github.com:scylladb/scylla: storage_service: reserve space in get_range_to_address_map and friends storage_service: coroutinize get_range_to_address_map and friends storage_service: pass replication map to get_range_to_address_map and friends storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer test: rest_api: test range_to_endpoint_map and describe_ring	2022-07-25 18:06:28 +03:00
Piotr Sarna	c195ce1b82	query: allow merging non-empty forward_result with an empty one Merging empty results was already allowed, but in one way only: empty.merge(nonempty, r); // was permitted nonempty.merge(empty, r); // not permitted With this commit, both methods are permitted. In order to remove copying, the other result is now taken by rvalue reference, with all call sites being updated accordingly. Fixes #10446 Fixes #10174 Closes #11064	2022-07-25 18:06:28 +03:00
Benny Halevy	bc5f6cf45d	storage_service: reserve space in get_range_to_address_map and friends To reduce the chance of reallocation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Avi Kivity	4fde9414dc	Merge 'logalloc reclaim_timer improvements' from Michael Livshin * round up reported time to microseconds * add backtrace if stall detected * add call site name (hierarchical when timers are nested) * put timers in more places * reduce possible logspam in nested timers by making sure to report on things only once and to not report on durations smaller than those already reported on Closes #10576 * github.com:scylladb/scylla: utils: logalloc: fix indentation utils: logalloc: split the reclaim_timer in compact_and_evict_locked() utils: logalloc: report segment stats if reclaim_segments() times out utils: logalloc: reclaim_timer: add optional extra log callback utils: logalloc: reclaim_timer: report non-decreasing durations utils: logalloc: have reclaim_timer print reserve limits utils: logalloc: move reclaim timer destructor for more readability utils: logalloc: define a proper bundle type for reclaim_timer stats utils: logalloc: add arithmetic operations to segment_pool::stats utils: logalloc: have reclaim timers detect being nested utils: logalloc: add more reclaim_timers utils: logalloc: move reclaim_timer to compact_and_evict_locked utils: logalloc: pull reclaim_timer definition forward utils: logalloc: reclaim_timer make tracker optional utils: logalloc: reclaim_timer: print backtrace if stall detected utils: logalloc: reclaim_timer: get call site name utils: logalloc: reclaim_timer: rename set_result utils: logalloc: reclaim_timer: rename _reserve_segments member utils: logalloc: reclaim_timer round up microseconds	2022-07-25 18:06:28 +03:00
Benny Halevy	5eb31eff64	storage_service: coroutinize get_range_to_address_map and friends And add calls to maybe_yield to prevent stalls in this path as seen in performance testing. Also, add a respective rest_api test. Fixes #11114 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Tomasz Grabiec	76d20aeb96	Merge 'Refactor group 0 operations (joining, leaving, removing).' from Kamil Braun A series of refactors to the `raft_group0` service. Read the commits in topological order for best experience. This PR is more or less equivalent to the second-to-last commit of PR https://github.com/scylladb/scylla/pull/10835, I split it so we could have an easier time reviewing and pushing it through. Closes #11024 * github.com:scylladb/scylla: service: storage_service: additional assertions and comments service/raft: raft_group0: additional logging, assertions, comments service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0` service/raft: raft_group0: rewrite `remove_from_group0` service/raft: raft_group0: rewrite `leave_group0` service/raft: raft_group0: split `leave_group0` from `remove_from_group0` service/raft: raft_group0: introduce `setup_group0` service/raft: raft_group0: introduce `load_my_addr` service/raft: raft_group0: make some calls abortable service/raft: raft_group0: remove some temporary variables service/raft: raft_group0: refactor `do_discover_group0`. service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0` service/raft: raft_group0: extract `start_server_for_group0` function service/raft: raft_group0: create a private section service/raft: discovery: `seeds` may contain `self`	2022-07-25 18:06:28 +03:00
Benny Halevy	3d62a1592f	storage_service: pass replication map to get_range_to_address_map and friends Before they are made asynchronous in the next patch, so they work on a coherent snapshot of the token_metadata and replication map as their caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Petr Gusev	52142bb8b3	raft_group_registry, is_alive for non-existent server_id We could yield between updating the list of servers in raft/fsm and updating the raft_address_map, e.g. in case of a set_configuration. If tick_leader happens before the raft_address_map is updated, is_alive will be called with server_id that is not in the map yet. Fix: scylladb/scylla-dtest#2753 Closes #11111	2022-07-25 18:06:28 +03:00
Benny Halevy	0b474866a3	storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer It is only needed for the "storage_service/describe_ring" api and service/storage_service shouldn't bother with it. It's an api sugar coating. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Yaron Kaikov	c42c5111eb	SCYLLA-VERSION-GEN: use semver-compatible version Setting Scylla to use semantic versioning. (Ref: https://semver.org/) Closes: https://github.com/scylladb/scylla/issues/9543 Closes #10957	2022-07-25 18:06:28 +03:00
Benny Halevy	429f110110	test: rest_api: test range_to_endpoint_map and describe_ring Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Nadav Har'El	85688a7a7e	Merge 'cql3: grammar: make the whereClause production return a single expression' from Avi Kivity Currently, the WHERE clause grammar is constrained to a conjunction of relations: `WHERE a = ? AND b = ? AND c > ?`. The restriction happens in three places: 1. the grammar will refuse to parse anything else 2. our filtering code isn't prepared for generic expressions 3. the interface between the grammar and the rest of the cql3 layer is via a vector of terms rather than an expression While most of the work will be in extending the filtering code, this series tackles the interface; it changes the `whereClause` production to return an expression rather than a vector. Since much of cql3 layer is interested in terms, a new boolean_factors() function is introduced to convert an expression to its boolean terms. Closes #11105 * github.com:scylladb/scylla: cql3: grammar: make where clause return an expression cql3: util: deinline where clause utilities cql3: util: change where clause utilities to accept a single expression rather than a vector of terms cql3: statement_restrictions: accept a single expression rather than a vector cql3: statement_restrictions: merge `if` and `for` cql3: select_statement: remove wrong but harmless std::move() in prepare_restrictions cql3: expr: add boolean_factors() function to factorize an expression cql3: expression: define operator==() for expressions cql3: values: add operator==() for raw_value	2022-07-25 18:06:28 +03:00
Anna Stuchlik	ff5c4a33f5	doc: add the new KB to the toctree	2022-07-25 14:29:33 +02:00
Anna Stuchlik	f1daef4b1b	doc: doc: add a KB about updating the mode in perftune.yaml after upgrade	2022-07-25 14:22:02 +02:00
Piotr Sarna	277aa30965	cql-pytest: extend synchronous mv test with new cases The new cases cover: - a materialized view created with synchronous updates from the start - a materialized view created with synchronous updates, but then alter to not have synchronous updates anymore	2022-07-25 10:00:28 +02:00
Piotr Sarna	52f5ba16dc	cql-pytest: allow extra parameters in new_materialized_view The extra parameters can include a WITH clause.	2022-07-25 10:00:28 +02:00
Piotr Sarna	43c09eb9e6	docs: add a paragraph on view synchronous updates The paragraph explains what synchronous view updates are and how to set them up.	2022-07-25 10:00:28 +02:00
Michał Sala	c7b78cfd81	test/boost/cql_query_test: add test setting synchronous updates property The test checks if a synchronous_updates property can be set via ALTER MATERIALIZED VIEW or CREATE MATERIALIZED VIEW statements.	2022-07-25 09:53:33 +02:00
Michał Sala	2993bbc33b	test: cql-pytest: add a test for synchronous mode materialized views The test verifies if a synchronous updates code path was triggered in a view that had synchronous_updates property set to true. Done by inspecting query traces.	2022-07-25 09:53:33 +02:00
Michał Sala	d573ab0b58	db: view: react to synchronous updates tag Code that waited for all remote view updates was already there. This commit modifies the conditions of this wait to take into account the "synchronous mode" (enabled when db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY is set).	2022-07-25 09:53:33 +02:00
Michał Sala	128806f022	cql3: statements: cf_prop_defs: apply synchronous updates tag This commit defines a new tag key (SYNCHRONOUS_VIEW_UPDATES_TAG_KEY) to be used for marking "synchronous mode" views. This key is used in `cf_prop_defs::apply_to_builder` if the properties contain KW_SYNCHRONOUS_UPDATES.	2022-07-25 09:53:33 +02:00
Michał Sala	041cb77ad0	alternator, db: move the tag code to db/tags Tags are a useful mechanism that could be used outside of alternator namespace. My motivation to move tags_extension and other utilities to db/tags/ was that I wanted to use them to mark "synchronous mode" views. I have extracted `get_tags_of_table`, `find_tag` and `update_tags` method to db/tags/utils.cc and moved alternator/tags_extension.hh to db/tags/. The signature of `get_tags_of_table` was changed from `const std::map<sstring, sstring>&` to `const std::map<sstring, sstring>*` Original behavior of this function was to throw an `alternator::api_error` exception. This was undesirable, as it introduced a dependency on the alternator module. I chose to change it to return a potentially null value, and added a wrapper function to the alternator module - `get_tags_of_table_or_throw` to keep the previous throwing behavior.	2022-07-25 09:53:33 +02:00
Michał Sala	494e7fc5f5	cql3: statements: add a synchronous_updates property This property can be used with CREATE MATERIALIZED VIEW and ALTER MATERIALIZED VIEW statements. Setting it allows global views to enter "synchronous mode". In this mode, all view updates are also applied synchronously as if the view was local. This may reduce their availability, but has the benefit of propagating a potential inconsistency risk (in form of a write error) to the user, who can respond to it appropriately (e.g. retry the write or fix the view later).	2022-07-25 09:53:33 +02:00
Botond Dénes	b673b4bee3	Merge 'let scylla-gdb.py recognize coroutines' from Michael Livshin "scylla task_histogram" and "scylla fiber" will now show coroutine "promises". Refs #10894 Closes #11071 * github.com:scylladb/scylla: test: gdb: test that "task_histogram -a" finds some coroutines scylla-gdb.py: recognize coroutine-related symbols as task types scylla-gdb.py: whitelist the .text section for task "vtables" scylla-gdb.py: fix an error message	2022-07-25 06:48:45 +03:00
Nadav Har'El	f1e3494a10	cql-pytest: fix a test to not fail on very slow machines The cql-pytest cassandra_tests/validation/operations/select_test.py:: testSelectWithAlias uses a TTL but not because it wants to test the TTL feature - it just wants to check the SELECT aliasing feature. The test writes a TTL of 100 and then reads it back using an alias. We would normally expect to read back 100 or 99, but to guard against a very slow test machine, the test verified that we read back something between 70 and 100. I thought that allowing a ridiculous 30 second delay between the write and the read requests was more than enough. But in one run of the aarch64 debug build, this ridiculous 30 seconds wasn't ridiculous enough - the delay ended up 35 seconds, and the test failed! So in this patch, I just make it even more ridiculous - we write 1000 and expect to read something over 100 - allowing a 900 second delay in the test. Note that neither the earlier 30-second or current 900-second delay slows down the test in any way - this test will normally complete in milliseconds. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11085	2022-07-24 21:17:59 +03:00
Avi Kivity	9823e75d16	cql3: grammar: make where clause return an expression In preparation of the relaxation of the grammar to return any expression, change the whereClause production to return an expression rather than terms. Note that the expression is still constrained to be a conjunction of relations, and our filtering code isn't prepared for more. Before the patch, if the WHERE clause was optional, the grammar would pass an empty vector of expressions (which is exactly correct). After the patch, it would pass a default-constructed expression. Now that happens to be an empty conjunction, which is exactly what's needed, but it is too accidental, so the patch changes optional WHERE clauses to explicitly generate an empty conjunction if the WHERE clause wasn't specified.	2022-07-22 20:14:48 +03:00
Avi Kivity	a037f9a086	cql3: util: deinline where clause utilities Some where clause related functions were unnecessarily inline; another was just recently de-templated. Move them to .cc.	2022-07-22 20:14:48 +03:00
Avi Kivity	fd663bcb94	cql3: util: change where clause utilities to accept a single expression rather than a vector of terms Conversion to terms happens internally via boolean_factors().	2022-07-22 20:14:48 +03:00
Avi Kivity	a5dd588465	cql3: statement_restrictions: accept a single expression rather than a vector Move closer to the goal of accepting a generic expression for WHERE clause by accepting a generic expression in statement_restrictions. The various callers will synthesize it from a vector of terms.	2022-07-22 20:14:48 +03:00
Avi Kivity	43aca25496	cql3: statement_restrictions: merge `if` and `for` A `for` loop does nothing on an empty container, so no need for an extra `if` for that condition. Drop the `if`.	2022-07-22 20:14:48 +03:00
Avi Kivity	4aa0a03b7e	cql3: select_statement: remove wrong but harmless std::move() in prepare_restrictions std::move(_where_clause) is wrong, because _where_clause is used later (when analyzing GROUP BY), but also harmless (because the statement_restrictions constructor accepts it by const reference). To avoid confusion in the next patch where we'll pass _where_clause to a different function, remove the bad std::move() in advance here.	2022-07-22 20:14:48 +03:00
Avi Kivity	8085b9f57a	cql3: expr: add boolean_factors() function to factorize an expression When analyzing a WHERE clause, we want to separate individual factors (usually relations), and later partition them into partition key, clustering key, and regular column relations. The first step is separation, for which this helper is added. Currently, it is not required since the grammar supplies the expression in separated form, but this will not work once it is relaxed to allow any expression in the WHERE clause. A unit test is added.	2022-07-22 20:14:48 +03:00
Avi Kivity	1efb2fecbe	cql3: expression: define operator==() for expressions This is useful for tests, to check that expression manipulations yield the expected results.	2022-07-22 20:14:48 +03:00
Avi Kivity	eec441d365	cql3: values: add operator==() for raw_value This is useful for implementing operator==() for expressions, which in turn require comparing constants, which contain raw_values. Note that this is not CQL comparison (that would be implemented in cql3::expr::evaluate() and would return a CQL boolean, not a C++ boolean, but a traditional C++ value comparison.	2022-07-22 20:13:49 +03:00
Anna Stuchlik	f46b207472	doc: update the links that are false external links and result in 404 Closes #11086	2022-07-22 14:17:42 +03:00
Anna Stuchlik	4bb9060268	doc: minor formatting and language fixes	2022-07-22 12:50:53 +02:00
Anna Stuchlik	2ade33f317	doc: add the new guide to the toctree	2022-07-22 12:36:43 +02:00
Anna Stuchlik	c9b0c6fdbf	doc: add the upgrade guide from 5.0 to 2022.1	2022-07-22 12:32:08 +02:00
Botond Dénes	d12d429c47	Merge 'doc: add the upgrage guides from 2022.x.y to 2022.x.z' from Anna Stuchlik Fix https://github.com/scylladb/scylla-docs/issues/4041 I've added the upgrade guides from 2022.x.y to 2022.x.z. They are based on the previous upgrade guides for patch releases. Closes #11104 * github.com:scylladb/scylla: doc: add the new upgrade guide to the toctree doc: add the upgrage guides from 2022.x.y to 2022.x.z	2022-07-22 09:06:16 +03:00
Michael Livshin	0f1a884c90	test: gdb: test that "task_histogram -a" finds some coroutines Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-21 19:12:21 +03:00
Michael Livshin	6cbb367ba7	scylla-gdb.py: recognize coroutine-related symbols as task types The criteria is too permissive because coroutine symbols (those without the "[clone .resume]" part at the end, anyway) look like normal function names; hopefully this won't give too many false positives to become a problem. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-21 19:12:21 +03:00
Michael Livshin	f2c37b772d	scylla-gdb.py: whitelist the .text section for task "vtables" Actual vtables do not reside there, but coroutine object vptrs point at the actual coroutine code, which is. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-21 19:12:21 +03:00
Michael Livshin	080bd7c481	scylla-gdb.py: fix an error message Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-21 19:12:21 +03:00
Gleb Natapov	f1f1176963	service: raft: do not allow downgrading non expiring entry to expiring one in raft_address_map Expiring entries are added when a message is received from an unknown host. If the host is later added to the raft configuration they become non expiring. After that they can only be removed when the host is dropped from the configuration, but they should never become expiring again. Refs #10826	2022-07-21 17:40:04 +02:00
Anna Stuchlik	23515c8695	doc: add the new upgrade guide to the toctree	2022-07-21 17:05:28 +02:00
Anna Stuchlik	bf5bf44ddd	doc: add the upgrage guides from 2022.x.y to 2022.x.z	2022-07-21 16:46:06 +02:00
Asias He	39db15d2cb	misc_services: Fix cache hitrate update This patch avoids unncessary CACHE_HITRATES updates through gossip. After this patch: Publish CACHE_HITRATES in case: - We haven't published it at all - The diff is bigger than 1% and we haven't published in the last 5 seconds - The diff is really big 10% Note: A peer node can know the cache hitrate through read_data read_mutation_data and read_digest RPC verbs which have cache_temperature in the response. So there is no need to update CACHE_HITRATES through gossip in high frequency. We do the recalculation faster if the diff is bigger than 0.01. It is useful to do the calculation even if we do not publish the CACHE_HITRATES though gossip, since the recalculation will call the table->set_global_cache_hit_rate to set the hitrate. Fixes #5971 Closes #11079	2022-07-21 11:31:30 +03:00
Nadav Har'El	5faf3c711d	doc, alternator: document the possibility of write reordering In issue #10966, a user noticed that Alternator writes may be reordered (a later write to an item is ignored with the earlier write to the same item "winning") if Scylla nodes do not have synchronized time and if always_use_lwt write isolation mode is not used. In this patch I add to docs/alternator/compatibility.md a section about this issue, what causes it, and how to solve or at least mitigate it. Fixes #10966 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11094	2022-07-21 09:22:56 +02:00
Kamil Braun	4e42aeb0df	service: storage_service: additional assertions and comments	2022-07-20 19:39:29 +02:00
Kamil Braun	25bb8384af	service/raft: raft_group0: additional logging, assertions, comments Move some rare logs from TRACE to INFO level. Add some assertions. Write some more comments, including FIXMEs and TODOs. Remove unnecessary `_shutdown_gate.hold()` (this is not a background task).	2022-07-20 19:39:29 +02:00
Kamil Braun	c9f1ec1268	service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0` Group 0 discovery would internally fetch the seed list from gossiper. Gossiper would return the seed list from conf/scylla.yaml. This seed list is proper for the bootstrapping scenario - we specify the initial contact points for a node that joins a cluster. We'll have to use a different list of seeds for group 0 discovery for the upgrade scenario. Prepare for that by taking the seed list as a parameter. In the bootstrap scenario we'll pass the seed list down from `storage_service::join_cluster`. Additionally, `join_group0` now takes an `as_voter` flag, which is `false` in the bootstrap scenario (we initially join as a non-voter) but will be `true` in the upgrade scenario.	2022-07-20 19:39:29 +02:00
Kamil Braun	684d8171ca	service/raft: raft_group0: rewrite `remove_from_group0` See previous commit. `remove_from_group0` had a similar problem as `leave_group0`: it would handle the case where `raft_group0::_group0` variant was not `raft::group_id` (i.e. we haven't joined group 0), but RAFT local feature was enabled - i.e. the yet-unimplemented upgrade case - by running discovery and calling `send_group0_modify_config`. Instead, if we see that we've joined group 0 before, assume that we're still a member and simply use the Raft `modify_config` API to remove another server. If we're not a member it means we either decommissioned or were removed by someone else; then we have no business trying to remove others. There's also the unimplemented upgrade case but that will come in another pull request. Finally, add some logic for handling an edge case: suppose we joined group 0 recently and we still didn't fully update our RPC address map (it's being updated asynchronously by Raft's io_fiber). Thus we may fail to find a member of group 0 in the address map. To handle this, ensure we're up-to-date by performing a Raft read barrier. State some assumptions in a comment. Add a TODO for handling failures. Remove unnecessary `_shutdown_gate.hold()` (this is not a background task).	2022-07-20 19:39:29 +02:00
Kamil Braun	eeeef0bc50	service/raft: raft_group0: rewrite `leave_group0` One of the following cases is true: 1. RAFT local feature is disabled. Then we don't do anything related to group 0. 2. RAFT local feature is enabled and when we bootstrapped, we joined group 0. Then `raft_group0::_group0` variant holds the `raft::group_id` alternative. 3. RAFT local feature is enabled and when we bootstrapped we didn't join group 0. This means the RAFT local feature was disabled when we bootstrapped and we're in the (unimplemented yet) upgrade scenario. `raft_group0::_group0` variant holds the `std::monostate` alternative. The problem with the previous implementation was that it checked for the conditions of the third case above - that RAFT local feature is enabled but `_group0` does not hold `raft::group_id` - and if those conditions were true, it executed some logic that didn't really make sense: it ran the discovery algorithm and called `send_group0_modify_config` RPC. In this rewrite I state some assumptions that `leave_group0` makes: - we've finished the startup procedure. - we're being run during decommission - after the node entered LEFT status. In the new implementation, if `_group0` does not hold `raft::group_id` (checked by the internal `joined_group0()` helper), we simply return. This is the yet-unimplemented upgrade case left for a follow-up PR. Otherwise we fetch our Raft server ID (at this point it must be present - otherwise it's a fatal error) and simply call `modify_config` from the `raft::server` API. Remove unnecessary call to `_shutdown_gate.hold()` (this is not a background task).	2022-07-20 19:39:29 +02:00
Kamil Braun	75608bcd2f	service/raft: raft_group0: split `leave_group0` from `remove_from_group0` `leave_group0` was responsible for both removing a different node from group 0 and removing ourselves (leaving) group 0. The two scenarios are a bit different and the handling will be rewritten in following commits. Split `leave_group0` into two functions. Remove the incorrect comment about idempotency - saying that the procedure is idempotent is an oversimplification, one could argue it's incorrect since the second call simply hangs, at least in the case of leaving group 0; following commits will state what's happening more precisely. Add some additional logging and assertions where the two functions are called in `storage_service`.	2022-07-20 19:39:29 +02:00
Kamil Braun	ee0219dfe3	service/raft: raft_group0: introduce `setup_group0` Contains all logic for deciding to join (or not join) group 0. Prepare for the case where we don't want to join group 0 immediately on startup - the upgrade scenario (will be implemented in a follow-up). Move the group 0 setup step earlier in `storage_service::join_cluster`. `join_group0()` is now a private member of `raft_group0`. Some more comments were written.	2022-07-20 19:39:29 +02:00
Kamil Braun	4b0db59671	service/raft: raft_group0: introduce `load_my_addr` Compared to `load_or_create_my_addr` this function assumes that the address is already present on disk; if not, it's a fatal error. Use it in places where it would indeed be a fatal error if the address was missing.	2022-07-20 19:39:29 +02:00
Kamil Braun	f0f9aa5c7d	service/raft: raft_group0: make some calls abortable There are some calls to `modify_config` which should react to aborts (e.g. when we shutdown Scylla). There are also calls to `send_group0_modify_config` which should probably also react to aborts, but the functions don't take an abort_source parameter. This is fixable but I left TODOs for now.	2022-07-20 19:39:29 +02:00
Kamil Braun	ab8c3c6742	service/raft: raft_group0: remove some temporary variables Make the code a bit shorter.	2022-07-20 19:39:29 +02:00
Kamil Braun	b193ea8ec0	service/raft: raft_group0: refactor `do_discover_group0`. The function no longer accesses the `_group0` variant directly, instead it is made a member of `service::persistent_discovery`; the caller guarantees that `persistent_discovery` is not destroyed before the function finishes. The function is now named `run`. A short comment was written at the declaration site. Make some members of `persistent_discovery` private, as they are only used by `run`. Simplify `struct tracker`, store the discovery output separately (`struct tracker` is now responsible for a single thing). Enclose the `parallel_for_each` over requests in a common coroutine which keeps alive all the necessary things for the loop body and performs the last step which was previously inside a `then`.	2022-07-20 19:39:29 +02:00
Kamil Braun	6d9d493e2a	service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0`	2022-07-20 19:39:28 +02:00
Kamil Braun	54d9219257	service/raft: raft_group0: extract `start_server_for_group0` function Extract part of the code from `join_group0`. Add some comments. This part will be reused.	2022-07-20 19:38:53 +02:00
Kamil Braun	dca1ce52ed	service/raft: raft_group0: create a private section Move member functions and fields used internally by the `raft_group0` class into a private section. Write some comments.	2022-07-20 19:38:53 +02:00
Kamil Braun	d28170b1a5	service/raft: discovery: `seeds` may contain `self` The set of seeds passed to the discovery algorithm may contain `self`. The implementation will filter the `self` out (it calls `step(seeds)`; `step` iterates over the given list of peers and ignores `_self`). Specify this at the `discovery` constructor declaration site. Simplify the code constructing `persistent_discovery` in `raft_group0::discover_group0` using this assumption.	2022-07-20 19:38:53 +02:00
Wojciech Mitros	5590493abd	wasm: test instances reuse Add a test for a wasm aggregate function which uses the new metrics to check if the cache has been hit at least once. Also check that the cache can get reused on different queries, by testing that the number of queries is higher than the number of cache misses. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-07-20 18:19:25 +02:00
Wojciech Mitros	9281ba3919	wasm: reuse UDF instances When executing a wasm UDF, most of the time is spent on setting up the instance. To minimize its cost, we reuse the instance using wasm::instance_cache. This patch adds a wasm instance cache, that stores a wasmtime instance for each UDF and scheduling group. The instances are evicted using LRU strategy. The cache may store some entries for the UDF after evicting the instance, but they are evicted when the corresponding UDF is dropped, which greatly limits their number. The size of stored instances is estimated using the size of their WASM memories. In order to be able to read the size of memory, we require that the memory is exported by the client. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-07-20 18:19:22 +02:00
Wojciech Mitros	d7a933068a	schema_tables: simplify merge_functions and avoid extra compilation Currently, we have 2 mere_functions methods, where one is only the only call to the other. We can replace them with a simple one. The merge_functions method compiles a UDF (using create_func) only to read its signature. We can avoid that by reading it from the row ourselves. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-07-20 18:10:21 +02:00
Nadav Har'El	59f684d2c3	Merge 'doc: fix the links to Alternator' from Anna Stuchlik The Scylla Alternator documentation is now part of the Scylla user documentation (previously it was dev documentation). This PR updates the links to the Alternator documentation. Closes #11089 * github.com:scylladb/scylla: doc: update the link to Alternator for DynamoDB users doc: fix the links to Alternator	2022-07-20 18:38:08 +03:00
Anna Stuchlik	c1122c8f54	doc: update the link to Alternator for DynamoDB users	2022-07-20 17:02:48 +02:00
Avi Kivity	13a64d8ab2	Merge 'Remove all remaining restrictions classes' from Jan Ciołek This PR removes all code that used classes `restriction`, `restrictions` and their children. There were two fields in `statement_restrictions` that needed to be dealt with: `_clustering_columns_restrictions` and `_nonprimary_key_restrictions`. Each function was reimplemented to operate on the new expression representaiion and eventually these fields weren't needed anymore. After that the restriction classes weren't used anymore and could be deleted as well. Now all of the code responsible for analyzing WHERE clause and planning a query works on expressions. Closes #11069 * github.com:scylladb/scylla: cql3: Remove all remaining restrictions code cql3: Move a function from restrictions class to the test cql3: Remove initial_key_restrictions cql3: expr: Remove convert_to_restriction cql3: Remove _new from _new_nonprimary_key_restrictions cql3: Remove _nonprimary_key_restrictions field cql3: Reimplement uses of _nonprimary_key_restrictions using expression cql3: Keep a map of single column nonprimary key restrictions cql3: Remove _new from _new_clustering_columns_restrictions cql3: Remove _clustering_columns_restrictions from statement_restrictions cql3: Use a variable instead of dynamic cast cql3: Use the new map of single column clustering restrictions cql3: Keep a map of single column clustering key restrictions cql3: Return an expression in get_clustering_columns_restrctions() cql3: Reimplement _clustering_columns_restrictions->has_supporting_index() cql3: Don't create single element conjunction cql3: Add expr::index_supports_some_column cql3: Reimplement has_unrestricted_components() cql3: Reimplement _clustering_columns_restrictions->need_filtering() cql3: Reimplement num_prefix_columns_that_need_not_be_filtered cql3: Use the new clustering restrictions field instead of ->expression cql3: Reimplement _clustering_columns_restrictions->size() using expressions cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions cql3: expr: Add has_only_eq_binops function cql3: Reimplement _clustering_columns_restrictions->empty() using expressions	2022-07-20 18:01:15 +03:00
Anna Stuchlik	ee53105e12	doc: fix the links to Alternator	2022-07-20 16:52:52 +02:00
Avi Kivity	89a935625d	Merge 'doc: create a new CQL reference section to aggregate CQL information - V2' from Anna Stuchlik This PR is V2 of https://github.com/scylladb/scylla/pull/11065. The scope of updates: - Created a _/cql/_ folder. - Moved all the CQL-related pages from _/getting-started/_ to _/cql/_ . - Moved the _cql-extensions.md_ file from _/dev/_ to _/cql/_ . - Removed the outdated files and references. - Updated the links to the CQL-related pages. Closes #11083 * github.com:scylladb/scylla: doc: update the links following the content reorganization doc: remove the outdated cql pages and delete them from the indexes doc: add index.rst for the cql folder and add it to toctree doc: move cql-extensions.md from the dev docs to the cql folder doc: move the CQL pages from getting-started to cql doc: add redirections for the CQL pages	2022-07-20 17:28:24 +03:00
Benny Halevy	6fd479b151	Update seastar submodule * seastar 6d4a0cb7a3...1d4432ed28 (11): > rpc: Ignore failed future in connection::send() > install-dependencies: centos-{7,8}: use {DTS,GTS}-11 instead of {DTS,GTS}-9 > coroutine: change access specifier of seastar::task member > Merge 'build: try to enable io_uring if it is not specified' from Kefu Chai > build: find_package() only if necessary > Merge "rpc: handle connection negotiation error during stream sink creation " from Gleb > build: try to enable io_uring if it is not specified > test: rpc: add test that inject error during stream connection negotiation. > test: rpc: inject errors only on streaming connections > test: rpc: allow specifying after what limit a connection start producing errors > rpc: do not destroy stream connection without stopping in case of negotiation failure Fixes #10943 Closes #11082	2022-07-20 16:25:21 +03:00
Anna Stuchlik	4f2b12becc	doc: update the links following the content reorganization	2022-07-20 13:07:51 +02:00
Anna Stuchlik	969a7b44e9	doc: remove the outdated cql pages and delete them from the indexes	2022-07-20 12:34:09 +02:00
Botond Dénes	014c5b56a3	query-result: move last_pos up to query::result query_result was the wrong place to put last position into. It is only included in data-responses, but not on digest-responses. If we want to support empty pages from replicas, both data and digest responses have to include the last position. So hoist up the last position to the parent structure: query::result. This is a breaking change inter-node ABI wise, but it is fine: the current code wasn't released yet. Closes #11072	2022-07-20 13:28:09 +03:00
Anna Stuchlik	8915c5df0b	doc: add index.rst for the cql folder and add it to toctree	2022-07-20 12:25:13 +02:00
Tomasz Grabiec	04f9a150be	Merge 'raft: split `can_vote` field form `server_address` to separate struct' from Kamil Braun Whether a server can vote in a Raft configuration is not part of the address. `server_address` was used in many context where `can_vote` is irrelevant. Split the struct: `server_address` now contains only `id` and `server_info` as it did before `can_vote` was introduced. Instead we have a `config_member` struct that contains a `server_address` and the `can_vote` field. Also remove an "unsafe" constructor from `server_address` where `id` was provided but `server_info` was not. The constructor was used for tests where `server_info` is irrelevant, but it's important not to forget about the info in production code. Replace the constructor with helper functions which specify in comments that they are supposed to be used in tests or in contexts where `info` doesn't matter (e.g. when checking presence in an `unordered_set`, where the equality operator and hash operate only on the `id`). Closes #11047 * github.com:scylladb/scylla: raft: fsm: fix `entry_size` calculation for config entries raft: split `can_vote` field from `server_address` to separate struct serializer_impl: generalize (de)serialization of `unordered_set` to_string: generalize `operator<<` for `unordered_set`	2022-07-20 12:20:52 +02:00
Anna Stuchlik	4f897f149d	doc: move cql-extensions.md from the dev docs to the cql folder	2022-07-20 12:20:25 +02:00
Anna Stuchlik	2dda3dbfb5	doc: move the CQL pages from getting-started to cql	2022-07-20 12:18:54 +02:00
Anna Stuchlik	862bf306ad	doc: add redirections for the CQL pages	2022-07-20 12:12:53 +02:00
Asias He	482ee369d0	storage_service: Increase watchdog_interval for node ops The node operations using node_ops_cmd have the following procedure: 1) Send node_ops_cmd::replace_prepare to all nodes 2) Send node_ops_cmd::replace_heartbeat to all nodes In a large cluster 1) might take a long time to finish, as a result when the node starts to perform 2), the heartbeat timer on the peer nodes which is 30s might have already timed out. This fails the whole node opeartions. We have patches to make 1) more efficient and faster. https://github.com/scylladb/scylla/pull/10850 https://github.com/scylladb/scylla/pull/10822 In addition to that, this patch increases the heartbeat timeout to reduce the false positive of timeout. Refs #10337 Refs #11078 Closes #11081	2022-07-20 12:56:17 +03:00
Jan Ciolek	599bcd6ea7	cql3: Remove all remaining restrictions code The classes restriction, restrictions and its children aren't used anywhere now and can be safely removed. Some includes need to be modified for the code to compile. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:31 +02:00
Jan Ciolek	bff0b87c18	cql3: Move a function from restrictions class to the test statement_restrictions_test uses a function that is defined in multi_column_restriction.hh. This file will be removed soon and for the test to still work the function is moved to the test source. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:31 +02:00
Jan Ciolek	b269e5a24d	cql3: Remove initial_key_restrictions initial_key restrictions was a class used by statement_restrictions to represent empty restrictions of different types and simplify restriction merging logic. They are not used anymore and can be removed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:31 +02:00
Jan Ciolek	4f92c64e1b	cql3: expr: Remove convert_to_restriction This function isn't used anywhere anymore and can be removed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:31 +02:00
Jan Ciolek	d7e954307f	cql3: Remove _new from _new_nonprimary_key_restrictions The _new prefix was used to distinguish the new field from the old represenation. Now the new field has fully replaced the old one and _new can be removed from its name. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:31 +02:00
Jan Ciolek	b6ae72f095	cql3: Remove _nonprimary_key_restrictions field All code that made use of _nonprimary_key_restrictions has been modified to use _new_nonprimary_key_restrictions instead. The field can be removed. Additionally the old code responsible for adding new restrictions can be fully removed, everything is now done using add_restriction. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:31 +02:00
Jan Ciolek	9d1ba07471	cql3: Reimplement uses of _nonprimary_key_restrictions using expression All parts of the code that use _nonprimary_key_restrictions are changed to use _new_nonprimary_key_restrictions instead. I decided not to split this into multiple commits, as there isn't a lot of changes and they are analogous to the ones done before for partition and clustering columns. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:30 +02:00
Jan Ciolek	2c28554390	cql3: Keep a map of single column nonprimary key restrictions Keep a map of extracted restrictions for each restricted nonprimar column. This map will be useful, just like the ones for clustering and partition columns. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:30 +02:00
Jan Ciolek	0e8f437f24	cql3: Remove _new from _new_clustering_columns_restrictions The _new was used to distinguish from the old field during transition. Now the old field has been deleted and the new one can take its place. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 09:10:27 +02:00
Botond Dénes	6e20cb3255	Merge 'database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard' from Benny Halevy Currently, all the mutations this test generates are applied on shard 0. In rare cases, this may lead to the following crash, when the flushed sstable doesn't contain any key that belongs to the current shard, as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log ``` WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}} ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db) ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322 ``` Instead, generate random keys and apply them on their owning shard, and truncate all database shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11066 * github.com:scylladb/scylla: database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard table: try_flush_memtable_to_sstable: consume: close reader on error	2022-07-20 09:06:07 +03:00
Jan Ciolek	4fac3be535	cql3: Remove _clustering_columns_restrictions from statement_restrictions All code using the _clustering_columns_restrictions field has been modified to instead use _new_clustering_columns_restrictions expression representation. The old field can now be removed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 00:41:22 +02:00
Jan Ciolek	bf3f00413e	cql3: Use a variable instead of dynamic cast There is a dynamic cast used to determine whether clustering columns are restricted by a multi column restriction. Instead of doing that we can just use the _has_multi_column variable. It's also used a few lines higher, which means that it should be already initialized. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 00:41:22 +02:00
Jan Ciolek	a0884760ab	cql3: Use the new map of single column clustering restrictions Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-20 00:41:07 +02:00
Avi Kivity	5a30f9b789	Merge 'Distributed aggregate query' from Michał Jadwiszczak This PR extends #9209. It consists of 2 main points: To enable parallelization of user-defined aggregates, reduction function was added to UDA definition. Reduction function is optional and it has to be scalar function that takes 2 arguments with type of UDA's state and returns UDA's state All currently implemented native aggregates got their reducible counterpart, which return their state as final result, so it can be reduced with other result. Hence all native aggregates can now be distributed. Local 3-node cluster made with current master. `node1` updated to this branch. Accessing node with `ccm <node-name> cqlsh` I've tested belowed things from both old and new node: - creating UDA with reduce function - not allowed - selecting count() - distributed - selecting other aggregate function - not distributed Fixes: #10224 Closes #10295 github.com:scylladb/scylla: test: add tests for parallelized aggregates test: cql3: Add UDA REDUCEFUNC test forward_service: enable multiple selection forward_service: support UDA and native aggregate parallelization cql3:functions: Add cql3::functions::functions::mock_get() cql3: selection: detect parallelize reduction type db,cql3: Move part of cql3's function into db selection: detect if selectors factory contains only simple selectors cql3: reducible aggregates DB: Add `scylla_aggregates` system table db,gms: Add SCYLLA_AGGREGATES schema features CQL3: Add reduce function to UDA gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature	2022-07-19 19:05:19 +03:00
Avi Kivity	1f21c1ecc8	Merge "Add IO throttling to streaming class" from Pavel E " Same thing was done for compaction class some time ago, now it's time for streaming to keep repair-generated IO in bounds. This set mostly resembles the one for compaction IO class with the exception that boot-time reshard/reshape currently runs in streaming class, but that's nod great if the class is throttled, so the set also moves boot-time IO into default IO class. " * 'br-streaming-class-throttling-2' of https://github.com/xemul/scylla: distributed_loader: Populate keyspaces in default class streaming: Maintain class bandwidth streaming: Pass db::config& to manager constructor config: Add stream_io_throughput_mb_per_sec option sstables: Keep priority class on sstable_directory	2022-07-19 17:10:25 +03:00
Jan Ciolek	9a03a09422	cql3: Keep a map of single column clustering key restrictions Having this map is useful in a bunch of places. To keep code simple it could be created from scratch each time, but it's also used in do_filter, so this could actually affect performance. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-19 16:02:01 +02:00
Jan Ciolek	2b7ffd57fb	cql3: Return an expression in get_clustering_columns_restrctions() get_clustering_columns_restrctions() used to return a shared pointer to the clustering_restrictions class. Now everything is being converted to expression, so it should return an expression as well. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-19 16:02:01 +02:00
Jan Ciolek	ebbbc3291a	cql3: Reimplement _clustering_columns_restrictions->has_supporting_index() The code is copied from the corresponding restrictions classes. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-19 16:01:40 +02:00
Pavel Emelyanov	07460761fb	Merge "Make compaction_static_shares and memtable_flush_static_shares live updateable" from Igor Ribeiro Barbosa Duarte (3): Currently, after updating the static shares it's necessary to restart the cluster. This patch series makes compaction_static_shares and memtable_flush_static_shares live updateable so that this restart isn't necessary anymore. dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_liveupdate_compaction_static_shares ci: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1412/ * https://github.com/igorribeiroduarte/scylla/tree/make_compaction_static_shares_live_updateable: memtable_flush: Make memtable_flush_static_shares liveupdateable compaction: Make compaction_static_shares liveupdateable backlog_controller: Unify backlog_controller constructors	2022-07-19 16:55:55 +03:00
Benny Halevy	1c26d49fba	database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard Currently, all the mutations this test generates are applied on shard 0. In rare cases, this may lead to the following crash, when the flushed sstable doesn't contain any key that belongs to the current shard, as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log ``` WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}} ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db) ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322 ``` Instead, generate random keys and apply them on their owning shard, and truncate all database shards. Fixes #11076 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-19 16:55:11 +03:00
Jan Ciolek	991fd5e4db	cql3: Don't create single element conjunction In case the expression is empty and we want to merge it with a new restriction we can just set the expression to the new restriction. Later this will make it easier to distinguish which case of multi column restrictions are we dealing with. IN and EQ can only have a single binary operator, but slice might have two. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-19 15:38:33 +02:00
Jan Ciolek	c7495fa59e	cql3: Add expr::index_supports_some_column Add a function that checks if there is an index which supports one of the columns present in the given expression. This functionality will soon be needed for clustering and nonprimary columns so it's good to separate into a reusable function. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-19 15:38:20 +02:00
Benny Halevy	f60ff44fdf	table: try_flush_memtable_to_sstable: consume: close reader on error If an exception is throws in `consume` before write_memtable_to_sstable is called or if the latter fails, we must close the reader passed to it. Fixes #11075 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-19 16:35:59 +03:00
Igor Ribeiro Barbosa Duarte	3b19bcf1a1	memtable_flush: Make memtable_flush_static_shares liveupdateable This patch makes memtable_flush_static_shares liveupdateable to avoid having to restart the cluster after updating this config. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-07-19 10:10:46 -03:00
Igor Ribeiro Barbosa Duarte	8dd0f4672d	compaction: Make compaction_static_shares liveupdateable This patch makes compaction_static_shares liveupdateable to avoid having to restart the cluster after updating this config. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-07-19 10:10:46 -03:00
Igor Ribeiro Barbosa Duarte	c2ee6492e6	backlog_controller: Unify backlog_controller constructors This patch adds the _static_shares variable to the backlog_controller so that instead of having to use a separate constructor when controller is disabled, we can use a single constructor and periodically check on the adjust method if we should use the static shares or the controller. This will be useful on the next patches to make compaction_static_shares and memtable_flush_static_shares live updateable. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-07-19 10:06:12 -03:00
Takuya ASADA	752be6536a	rename relocatable packages Currently, we use following naming convention for relocatable package filename: ${package_name}-${arch}-package-${version}.${release}.tar.gz But this is very different with Linux standard packaging system such as .rpm and .deb. Let's align the convention to .rpm style, so new convention should be: ${package_name}-${version}-${release}.${arch}.tar.gz Closes #9799 Closes #10891 * tools/java de8289690e...d0143b447c (1): > build_reloc.sh: rename relocatable packages * tools/jmx fe351e8...06f2735 (1): > build_reloc.sh: rename relocatable packages * tools/python3 e48dcc2...bf6e892 (1): > reloc/build_reloc.sh: rename relocatable packages	2022-07-19 15:46:49 +03:00
David Garcia	dcb5550bc3	doc: update url to docs.scylladb.com Closes #11050	2022-07-19 13:42:25 +03:00
Pavel Emelyanov	85d32485d9	config: Mark compaction_throughput_mb_per_sec option as Used Otherwise it's not shown in the --help output. Should've been the part of `868c3be0` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220716085221.26634-1-xemul@scylladb.com>	2022-07-19 13:18:17 +03:00
Pavel Emelyanov	55d4fa49f7	distributed_loader: Populate keyspaces in default class The streaming class throughput can be limitd with the respective option. Doing boot-time reshard/reshape doesn't need to obey it, as the node is not yet up but instead should get there as soon as possible. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:21:13 +03:00
Pavel Emelyanov	96d6be7daf	streaming: Maintain class bandwidth Same as was done in `b112a983` for compaction manager Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:19:56 +03:00
Pavel Emelyanov	a246b6d3eb	streaming: Pass db::config& to manager constructor The stream_manager will bookkeep the streaming bandwidth option, to subscribe on its changes it needs the config reference. It would be better if it was stream_manager::config, but currently subscription on db::config::<stuff> updates is not very shard-friendly, so we need to carry the config reference itself around. Similar trouble is there for compaction_manager. The option is passed through its own config, but the config is created on each shard by database code. Stream manager config would be created once by main code on shard 0. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:18:08 +03:00
Pavel Emelyanov	7d0110cd31	config: Add stream_io_throughput_mb_per_sec option It's going to control the bandwidth for the streaming prio class. For now it's jsut added but does't work for real Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:14:41 +03:00
Pavel Emelyanov	a56e2c83f3	sstables: Keep priority class on sstable_directory Current code accepts priotity class as an argument to various functions that need it and all its callers use streaming class. Next patches will needs to sometimes use default class, but it will require heavy patching of the distributed loader. Things get simpler if the priority class is kept on sstable_directory on start. This change also simplifies the ongoing effort on unification of sched and IO classes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:14:41 +03:00
Pavel Emelyanov	2a63c6f647	scylla-gdb: Don't show empty smp queues When collecting a histogram of smp-queues population empty queues also count, but it makes the output very long and not very informative. Skipping empty queues increases signal / noise ratio. v2: - print the number of omitted empty queues Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220718180912.2931-1-xemul@scylladb.com>	2022-07-19 11:33:24 +03:00
Gleb Natapov	d40106d3a9	raft: remove unused code Message-Id: <YtUh8Hs+nQQ8+hLY@scylladb.com>	2022-07-18 21:21:45 +03:00
Botond Dénes	af31a89afa	scylla-gdb.py: scylla fiber: reverse backward fiber We want to print the backwards fiber in reverse, starting with the furthest-away task in the chain. For this, the task list returned by `_walk()` has to be reversed. Closes #11062	2022-07-18 20:20:37 +03:00
Kamil Braun	7c377cc457	raft: fsm: fix `entry_size` calculation for config entries We forgot about `can_vote`. Stumbled on this while separating `can_vote` to separate struct. Note that `entry_size` is still inaccurate (#11068) but the patch is an improvement. Refs: #11068	2022-07-18 18:24:50 +02:00
Kamil Braun	daf9c53bb8	raft: split `can_vote` field from `server_address` to separate struct Whether a server can vote in a Raft configuration is not part of the address. `server_address` was used in many context where `can_vote` is irrelevant. Split the struct: `server_address` now contains only `id` and `server_info` as it did before `can_vote` was introduced. Instead we have a `config_member` struct that contains a `server_address` and the `can_vote` field. Also remove an "unsafe" constructor from `server_address` where `id` was provided but `server_info` was not. The constructor was used for tests where `server_info` is irrelevant, but it's important not to forget about the info in production code. The constructor was used for two purposes: - Invoking set operations such as `contains`. To solve this we use C++20 transparent hash and comparator functions, which allow invoking `contains` and similar functions by providing a different key type (in this case `raft::server_id` in set of addresses, for example). - constructing addresses without `info`s in tests. For this we provide helper functions in the test helpers module and use them.	2022-07-18 18:22:10 +02:00
Kamil Braun	f5d274d866	serializer_impl: generalize (de)serialization of `unordered_set` Be able to (de)serialize sets with different Hash or KeyEqual specializations.	2022-07-18 18:20:33 +02:00
Kamil Braun	5907049ecc	to_string: generalize `operator<<` for `unordered_set` Be able to print sets with different Hash or KeyEqual specializations. Use a variadic template, as it was done for `operator<<` for `unordered_map`.	2022-07-18 18:20:33 +02:00
Jan Ciolek	c2d20adc49	cql3: Reimplement has_unrestricted_components() The code is copied from: clustering_key_restrictions::has_unrestricted_components Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:49:23 +02:00
Jan Ciolek	85ebe99eb5	cql3: Reimplement _clustering_columns_restrictions->need_filtering() The code is copied from: single_column_primary_key_restrictions<clustering_key>::needs_filtering Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:49:09 +02:00
Jan Ciolek	d3a2a77b99	cql3: Reimplement num_prefix_columns_that_need_not_be_filtered The code is copied from: single_column_primary_key_restrictions<clustering_key> ::num_prefix_columns_that_need_not_be_filtered Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:48:55 +02:00
Jan Ciolek	1914d21f7b	cql3: Use the new clustering restrictions field instead of ->expression Instead of writing _clustering_columns_restrictions->expression It's better to use the new field: _new_clustering_columns_restrictions These expressions should be the same. It removes another use of the unwanted restrictions field. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:48:36 +02:00
Jan Ciolek	360087c580	cql3: Reimplement _clustering_columns_restrictions->size() using expressions Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:46:14 +02:00
Jan Ciolek	92df275868	cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:45:50 +02:00
Jan Ciolek	88da7ae0dc	cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions Use the freshly added function to replace old calls to ->is_all_eq(). Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:45:35 +02:00
Jan Ciolek	6cf0981aa6	cql3: expr: Add has_only_eq_binops function Add a function which checks that an expression contains only binary operators with '='. Right now this check is done only in a single place, but soon the same check will have to be done for clustering columns as well, so the code is moved to a separate function to prevent duplication. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:45:06 +02:00
Jan Ciolek	b84787efac	cql3: Reimplement _clustering_columns_restrictions->empty() using expressions All occurences of _clustering_columns_restrictions->empty() have been replaced with code that operates on the new expression representation: _new_clustering_columns_restrictions. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-18 17:44:50 +02:00
Anna Stuchlik	274691f45e	doc: fix the example of UDT in the SELECT statement Closes #11067	2022-07-18 18:26:06 +03:00
Pavel Emelyanov	62d95f09de	view: De-futurize make_view_update_builder() It doesn't sleep, just returns ready future with builder tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1384 it's red because e-mail notification is broken (scylla-pkg#2988) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220718132529.30751-1-xemul@scylladb.com>	2022-07-18 17:15:48 +03:00
Jadw1	7497fda370	test: add tests for parallelized aggregates	2022-07-18 15:25:42 +02:00
Jadw1	c95a0a9fe6	test: cql3: Add UDA REDUCEFUNC test Adds test checking if reduction function is correctly assigned to UDA and if the information is stored in `scylla_aggregates` system table.	2022-07-18 15:25:41 +02:00
Jadw1	182438c5f8	forward_service: enable multiple selection Enables parallelization of query like `SELECT MIN(x), MAX(x)`. Compatibility is ensured under the same cluster feature as UDA and native aggregates parallelization. (UDA_NATIVE_PARALLELIZED_AGGREGATION)	2022-07-18 15:25:41 +02:00
Jadw1	29a0be75da	forward_service: support UDA and native aggregate parallelization Enables parallelization of UDA and native aggregates. The way the query is parallelized is the same as in #9209. Separate reduction type for `COUNT(*)` is left for compatibility reason.	2022-07-18 15:25:41 +02:00
Jadw1	a0a6d87c1b	cql3:functions: Add cql3::functions::functions::mock_get() `mock_get` was created only for forward_service use, thus it only checks for aggregate functions if no declared function was found. The reason for this function is, there is no serialization of `cql3::selection::selection`, so functions lying underneath these selections has to be refound. Most of this code is copied from `functions::get()`, however `functions::get()` is not used because it requires to mock or serialize expressions and `functions::find()` is not enough, because it does not search for dynamic aggregate functions	2022-07-18 15:25:41 +02:00
Jadw1	6d977fcf88	cql3: selection: detect parallelize reduction type Detects type of reduction if it is possible. Separate case for `COUNT(*)` is left for compatibility reason. By now only single selection is supported.	2022-07-18 15:25:41 +02:00
Jadw1	59498caeca	db,cql3: Move part of cql3's function into db Moving `function`, `function_name` and `aggregate_function` into db namespace to avoid including cql3 namespace into query-request. For now, only minimal subset of cql3 function was moved to db.	2022-07-18 15:25:41 +02:00
Jadw1	6b63417bc8	selection: detect if selectors factory contains only simple selectors Because `selection` is not serializable and it has to be send via network to parallelize query, we have to mock the selection. To simplify the mocking, for now only single selectors for aggregate's arguments are allowed (no casting or other functions as arguments).	2022-07-18 15:25:41 +02:00
Jadw1	0f08c8e099	cql3: reducible aggregates Introduces reducible aggregates which don't return final result but accumulator, that can be later reduced.	2022-07-18 15:25:41 +02:00
Jadw1	d13f347621	DB: Add `scylla_aggregates` system table Saving information about UDA's reduce function to `scylla_aggregates` table and distributing it across cluster.	2022-07-18 15:25:37 +02:00
Jadw1	2c46222e31	db,gms: Add SCYLLA_AGGREGATES schema features This schema feature will be used to guard system_schema.scylla_aggregates schema table.	2022-07-18 14:18:48 +02:00
Jadw1	d8f3461147	CQL3: Add reduce function to UDA Add optional field to UDA, that describes reduce function to allow parallelization of UDA aggregates.	2022-07-18 14:18:48 +02:00
Jadw1	346fb08680	gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature Feature that indicate whether the cluter supports optional UDA parameter (reduction function) and parallelization of uda and native aggregates.	2022-07-18 14:18:48 +02:00
Botond Dénes	9afd2dc428	Merge 'Make compaction manager switch to table abstraction ' from Raphael "Raph" Carvalho This work gets us a step closer to compaction groups. Everything in compaction layer but compaction_manager was converted to table_state. After this work, we can start implementing compaction groups, as each group will be represented by its own table_state. User-triggered operations that span the entire table, not only a group, can be done by calling the manager operation on behalf of each group and then merging the results, if any. Closes #11028 * github.com:scylladb/scylla: compaction: remove forward declaration of replica::table compaction_manager: make add() and remove() switch to table_state compaction_manager: make run_custom_job() switch to table_state compaction_manager: major: switch to table_state compaction_manager: scrub: switch to table_state compaction_manager: upgrade: switch to table_state compaction: table_state: add get_sstables_manager() compaction_manager: cleanup: switch to table_state compaction_manager: offstrategy: switch to table_state() compaction_manager: rewrite_sstables(): switch to table_state compaction_manager: make run_with_compaction_disabled() switch to table_state compaction_manager: compaction_reenabler: switch to table_state compaction_manager: make submit(T) switch to table_state compaction_manager: task: switch to table_state compaction: table_state: Add is_auto_compaction_disabled_by_user() compaction: table_state: Add on_compaction_completion() compaction: table_state: Add make_sstable() compaction_manager: make can_proceed switch to table_state compaction_manager: make stop compaction procedures switch to table_state compaction_manager: make get_compactions() switch to table_state compaction_manager: change task::update_history() to use table_state instead compaction_manager: make can_register_compaction() switch to table_state compaction_manager: make get_candidates() switch to table_state compaction_manager: make propagate_replacement() switch to table_state compaction: Move table::in_strategy_sstables() and switch to table_state compaction: table_state: Add maintenance sstable set compaction_manager: make has_table_ongoing_compaction() switch to table_state compaction_manager: make compaction_disabled() switch to table_state compaction_manager: switch to table_state for mapping of compaction_state compaction_manager: move task ctor into source	2022-07-18 15:18:29 +03:00
Nadav Har'El	a02db6f928	Merge 'doc: add the upgrade guide from 2021.1 to 2022.1' from Anna Stuchlik Fix https://github.com/scylladb/scylla-docs/issues/4040 Fix https://github.com/scylladb/scylla-docs/issues/4128 This PR adds the upgrade guides from ScyllaDB Enterprise 2021.1 to 2022.1. They are based on the previous guides. Closes #11036 * github.com:scylladb/scylla: doc: add the description of the new metrics in 2022.1 doc: remove the upgrade guide for Ubuntu 16.04 (no longer supported in version 2022.1) doc: remove the outdated warning Update docs/upgrade/_common/upgrade-guide-from-2021.1-to-2022.1-ubuntu-and-debian.rst Update docs/upgrade/_common/upgrade_to_2022_warning.rst doc: add a space on line 60 to fix the warning doc: document metric update for 2022.1 doc: add the upgrade guide from 2021.1 to 2022.1	2022-07-18 14:14:57 +03:00
Avi Kivity	94b6aebf01	Merge 'Remove dropped table directory' from Benny Halevy This series adds removal of dropped table directory when it has no remaining snapshots. There are 2 code paths that take of that: 1. when the table is dropped and there are no active snapshots for it (typically when auto_snapshot disabled). 2. or when the last snapshot is cleared, leaving no other snapshot for a dropped table. Unit tests were extended to covert these scenarios. Fixes #10896 Closes #11001 * github.com:scylladb/scylla: legacy_schema_migrator: simplify drop_legacy_tables database: clear_snapshot: remove dropped table directory when it has no remaining snapshots database: clear_snapshot: make it a coroutine and use thread database_test: add clear_multiple_snapshots test database: make drop_column_family private schema_tables: merge_tables_and_views: use drop_table_on_all_shards database_test: drop_table_with_snapshots: test auto_snapshot database_test: populate_from_quarantine_works: pass optional db:config to do_with_some_data database: drop_table_on_all_shards: remove table directory having no snapshots sstables: define table_subdirectories sstables: officially define pending_delete_dir database: add drop_table_on_all_shards	2022-07-18 13:41:35 +03:00
Benny Halevy	3f0402db68	legacy_schema_migrator: simplify drop_legacy_tables There is no need for utils::make_joinpoint now that the function calls replica::database::drop_table_on_all_shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-18 10:28:18 +03:00
Benny Halevy	bbbbea65fb	database: clear_snapshot: remove dropped table directory when it has no remaining snapshots Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	c70a675d77	database: clear_snapshot: make it a coroutine and use thread and use an async thread around `directory_lister` rather than `lister::scan_dir` to simplify the implementation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	e710fe527c	database_test: add clear_multiple_snapshots test Based on the `clear_snapshot` test. Test with multiple snapshots and different combinations of parameters to database::clear_snapshot. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	d7564b9081	database: make drop_column_family private Now that all users are converted to use the public entry point - drop_table_on_all. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	71aad45757	schema_tables: merge_tables_and_views: use drop_table_on_all_shards So that the dropped table's directory can be removed after it has been dropped on all shards if it has no snapshots. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	ae3b1b5a64	database_test: drop_table_with_snapshots: test auto_snapshot Refactor test_drop_table_with_auto_snapshot out of drop_table_with_snapshots, adding a auto_snapshot param, controlling how to configure the cql_test_env db:.config::auto_snapshot, so we can test both cases - auto_snapshot enabled and disabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	af6805dd75	database_test: populate_from_quarantine_works: pass optional db:config to do_with_some_data Instead of just `tmpdir_for_data`, so we can easily set auto_snapshot for `drop_table_with_snapshots` in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	2e37dcf62a	database: drop_table_on_all_shards: remove table directory having no snapshots If the table to remove has no snapshots then completely remove its directory on storage as the left-over directory slows down operations on the keyspace and makes searching for live tables harder. Fixes #10896 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	dd481e9f58	sstables: define table_subdirectories Define a constexpr array of all official table sub-dorectories. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	c4a42c3a3f	sstables: officially define pending_delete_dir Rather than using the "pending_delete" string in `pending_delete_dir_basename()`, so it can be orderly removed in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Benny Halevy	e005629afb	database: add drop_table_on_all_shards Runs drop_column_family on all database shards. Will be extended later to consider removing the table directory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-17 14:33:34 +03:00
Raphael S. Carvalho	246e945086	compaction: remove forward declaration of replica::table compaction_manager.cc still cannot stop including replica/database.hh because upgrade and scrub still take replica::database as param, but I'll remove it soon in another series. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	a94d974835	compaction_manager: make add() and remove() switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	31655acb5e	compaction_manager: make run_custom_job() switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	9a1efc69d0	compaction_manager: major: switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	cebe6e22cb	compaction_manager: scrub: switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	d29f7070d9	compaction_manager: upgrade: switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	c2678ca661	compaction: table_state: add get_sstables_manager() That will be needed for retrieving sstable manager in perform_sstable_upgrade(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	bdd049afd6	compaction_manager: cleanup: switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	f547e0f2fb	compaction_manager: offstrategy: switch to table_state() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	538d412fba	compaction_manager: rewrite_sstables(): switch to table_state rewrite_sstables() is used by maintenance compactions that perform an operation on a single file at a time. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	79e385057f	compaction_manager: make run_with_compaction_disabled() switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	79f91fe61e	compaction_manager: compaction_reenabler: switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	7c1d178f4e	compaction_manager: make submit(T) switch to table_state Now that submit() switched to table_state, compaction_reenabler and friends can switch to table_state too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	a176022272	compaction_manager: task: switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	43136a3ca7	compaction: table_state: Add is_auto_compaction_disabled_by_user() auto_compaction_disabled_by_user is a configuration that can be enabled or disabled on a particular table. We're adding this interface to avoid having to push the configuration for every compaction_state, which would result in redundant information as the configuration value is the same for all table states. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	1deeeff825	compaction: table_state: Add on_compaction_completion() The idea is that we'll have a single on-completion interface for both "in-strategy" and off-strategy compactions, so not to pollute table_state with one interface for each. replica::table::on_compaction_completion is being moved into private namespace. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	1520580212	compaction: table_state: Add make_sstable() compaction_manager needs this interface when setting the sstable creation lambda in compaction_descriptor, which is then forwarded into the actual compaction procedure. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	956c3997cb	compaction_manager: make can_proceed switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	7a9908dbf1	compaction_manager: make stop compaction procedures switch to table_state they're used to stop all ongoing compaction on behalf of a given table T. Today, each table has a single table_state representing it, but after we implement compaction groups, we'll need to call the procedure for each group in a table. But the discussion doesn't belong here, as compaction group work will only come later. By the time being, we're only making compaction manager fully switch to table_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	b6126395e1	compaction_manager: make get_compactions() switch to table_state The only external user of get_compactions() doesn't use any filtering, so after table_state switch, one will be allowed to get all jobs running associated with a table_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	309d73c584	compaction_manager: change task::update_history() to use table_state instead Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	598ede607f	compaction_manager: make can_register_compaction() switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	61510af62a	compaction_manager: make get_candidates() switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	b5417096e2	compaction_manager: make propagate_replacement() switch to table_state propagate_replacement is used by incremental compaction to notify ongoing compaction about sstable list updates, such that the ongoing job won't hold reference to exhausted sstables. So it needs to switch to table_state, too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	cb05142d58	compaction: Move table::in_strategy_sstables() and switch to table_state in_strategy_sstables() doesn't have to be implemented in table, as it's simply about main set with maintenance and staging files filtered out. Also, let's make it switch to table_state as part of ongoing work. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	23e21ed5bc	compaction: table_state: Add maintenance sstable set Needed for off-strategy compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	e4d9cdf284	compaction_manager: make has_table_ongoing_compaction() switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	ff9e9524e6	compaction_manager: make compaction_disabled() switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	b47ed727c7	compaction_manager: switch to table_state for mapping of compaction_state manager stores a state for each table. As we're transitioning towards table_state, the mapping of a table to compaction state will now use table_state ptr as key. table_state ptr is stable and its lifetime is the same as table. we're temporarily adding a ptr to compaction_state, as there's lots of dependency on replica::table, but we'll get rid of it once we complete the transition. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	45a4f8d1fa	compaction_manager: move task ctor into source That's to be able to get table_state from table in subsequent patch, as table only has a forward declaration to it in compaction_manager.hh to avoid including database.hh. Once everything is moved to table_state, then ctor can be moved back into header. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	4bfcead2ba	compaction_manager: stop using infinite loop in run_offstrategy_compaction() we can have a better flow than infinite loop -> break for exit condition. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11045	2022-07-16 16:39:58 +03:00
Botond Dénes	b3227ee9b4	CODEOWNERS: add @psarna and @nyh as owners for docs/alternator Closes #11048	2022-07-16 11:39:04 +03:00
Aleksandra Martyniuk	7871989551	api: list of the user keyspaces contains only user keyspaces storage_service/keyspaces?type=user along with user keyspaces returned the keyspaces that were internal but non-system. The list of the keyspaces for the user option (storage_service/keyspaces?type=user) contains neither system nor internal but only user keyspaces. Fixes: #11042 Closes #11049	2022-07-15 20:42:30 +02:00
Michael Livshin	ca21ce8e6f	utils: logalloc: fix indentation Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	bcb7404a0e	utils: logalloc: split the reclaim_timer in compact_and_evict_locked() (Into one for the compact part and one for the evict part) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	007d8fb5c9	utils: logalloc: report segment stats if reclaim_segments() times out Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	1d700442ae	utils: logalloc: reclaim_timer: add optional extra log callback The idea is to let the caller add arbitrary extra info to the timeout report. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	abd7b9f01c	utils: logalloc: reclaim_timer: report non-decreasing durations The hope is that this reduces logspam without losing utility. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	07fdcb268e	utils: logalloc: have reclaim_timer print reserve limits Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	256b911fbd	utils: logalloc: move reclaim timer destructor for more readability Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	c15e384507	utils: logalloc: define a proper bundle type for reclaim_timer stats And define/use arithmetics on it. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	0eefbfa3cc	utils: logalloc: add arithmetic operations to segment_pool::stats Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Michael Livshin	3fced65542	utils: logalloc: have reclaim timers detect being nested Make sure that inner timers don't waste CPU measuring anything. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	76ca93b779	utils: logalloc: add more reclaim_timers Measure stalls at higher resolution. Refs #6189 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	42db63d012	utils: logalloc: move reclaim_timer to compact_and_evict_locked track compact_and_evict_locked timing from all call paths, not only from compact_and_evict. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	fd2b4a4b7d	utils: logalloc: pull reclaim_timer definition forward So it can be used in functions defined earlier in the source file in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	33785d261e	utils: logalloc: reclaim_timer make tracker optional Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	acd82d3b25	utils: logalloc: reclaim_timer: print backtrace if stall detected Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	239992f16c	utils: logalloc: reclaim_timer: get call site name Before adding even more call sites, print the call site name in the report. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	c4d64c3bf7	utils: logalloc: reclaim_timer: rename set_result Rename set_result to set_memory_released to make it clearer what the result means. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	5ce0038e6a	utils: logalloc: reclaim_timer: rename _reserve_segments member Rename reclaim_timer::_reserve_segments to _segments_to_release as it is clearer and more suitable for later patches that will add reclaim_timers in more functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Benny Halevy	c34d1a7705	utils: logalloc: reclaim_timer round up microseconds better report 29000 us than 28999 us. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-14 19:40:09 +03:00
Avi Kivity	1cb64de8d8	Merge 'Codeowners update' from Botond Dénes Remove some stale entries, add new entries for docs/. Closes #11046 * github.com:scylladb/scylla: CODEOWNERS: add owners for docs/ CODEOWNERS: remove @haaawk	2022-07-14 15:51:08 +03:00
Botond Dénes	a7a36c5189	CODEOWNERS: add owners for docs/ User documentation was recently migrated to scylla.git, and this is maintained by non scylla-core people. Add entries for docs/ so they are notified when somebody submits changes to docs/.	2022-07-14 15:43:04 +03:00
Botond Dénes	2af12bbaaa	CODEOWNERS: remove @haaawk He is no longer with the company.	2022-07-14 15:41:26 +03:00
Anna Stuchlik	68750b7612	doc: migrate the update about dropping tables from the scylla-docs repo Closes #11032	2022-07-14 14:08:54 +03:00
Anna Stuchlik	95725c04d1	doc: remove the info about Katacoda and the link to the nonexistent lab Closes #11031	2022-07-14 14:05:59 +03:00
Anna Stuchlik	3254378375	doc: add the description of the new metrics in 2022.1	2022-07-14 12:29:46 +02:00
Avi Kivity	e69e485396	Merge 'Preparatory work for compaction manager switch to table state' from Raphael "Raph" Carvalho These are cleanups needed for upcoming series that will make manager switch to table abstraction. Closes #11037 * github.com:scylladb/scylla: compaction_manager: remove unused variable in rewrite_sstable() table: remove ref from on_compaction_completion() signature table: use compaction_completion_desc to describe changes for off-strategy compaction_manager: rename table_state's get_sstable_set to main_sstable_set	2022-07-14 13:08:38 +03:00
Anna Stuchlik	e756bf5067	doc: remove the upgrade guide for Ubuntu 16.04 (no longer supported in version 2022.1)	2022-07-14 12:02:58 +02:00
Anna Stuchlik	68eae4d4e0	doc: remove the outdated warning	2022-07-14 11:55:15 +02:00
Petr Gusev	86299ad194	raft: server: fix comment for set_configuration follow-up to https://github.com/scylladb/scylla/pull/10905 as discussed in the comments. Closes #11035	2022-07-14 11:37:35 +02:00
Tomasz Grabiec	cfd785a02b	utils: memory_data_sink: Override mandatory buffer_size() The default implementation aborts. The class has bit rot because it was unused.	2022-07-14 11:56:20 +03:00
David Garcia	0ee5b50bac	doc: create migration redirections Update redirects Closes #11022	2022-07-14 11:53:41 +03:00
Anna Stuchlik	773be9fc02	Update docs/upgrade/_common/upgrade-guide-from-2021.1-to-2022.1-ubuntu-and-debian.rst Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2022-07-14 10:24:27 +02:00
Anna Stuchlik	bca51c9bb9	Update docs/upgrade/_common/upgrade_to_2022_warning.rst Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2022-07-14 09:17:51 +02:00
Benny Halevy	dc93564247	storage_proxy: abstract_read_resolver: swallow gate_closed exception Like other errors triggered on shutdown, this one is triggered by #8995. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11029	2022-07-14 09:26:34 +03:00
Avi Kivity	98aa3ec99b	Update seastar submodule * seastar 7d8d846b26...6d4a0cb7a3 (18): > io: Adjust IO latency goal on fair-queue level Fixes #10927 > coroutine: exception: deprecate return_exception(exception_ptr) > Merge "Make fair-queue class manipulations noexcept" from Pavel E > lowres_timers: Put timeout to infinity if no timers armed > util/conversion: support IEC prefix like "Ki" > util/conversion: use string_view instead of string > thread: fix backtrace termination for s390x on clang > *: add fmt::ostream_formatter<> so {fmt} can use operator<< > net: Remove operator<< for ipv4_addr > rpc-impl: Log "caught exception" when catching exception > rpc: Don't format non-trivial types with format specifier > sharded: use std::invoke() to call mapper function > Merge 'Avoid false-positive warnings in Gcc 12.1.1' from Nadav Har'El > tls_test: Remove dns bottle neck + improve read loop in google connect test > Revert "sstring: restore compatibility with std::string" > sstring: restore compatibility with std::string > tls_test: Make google https connect routine loop buffer reads > coroutine: add buffer support to async generator Closes #11033	2022-07-13 18:34:15 +03:00
Anna Stuchlik	ca8c5fd2c4	doc: add a space on line 60 to fix the warning	2022-07-13 16:37:54 +02:00
Raphael S. Carvalho	f6ab220c2a	compaction_manager: remove unused variable in rewrite_sstable() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-13 11:26:57 -03:00
Raphael S. Carvalho	d3d9b13d9d	table: remove ref from on_compaction_completion() signature Now update_sstable_lists_on_off_strategy_completion() and on_compaction_completion() can be called from the same unified interface. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-13 11:25:51 -03:00
Anna Stuchlik	ee1eb85ebc	doc: document metric update for 2022.1	2022-07-13 16:24:30 +02:00
Anna Stuchlik	b3566391e6	doc: add the upgrade guide from 2021.1 to 2022.1	2022-07-13 16:17:51 +02:00
Raphael S. Carvalho	ca58054485	table: use compaction_completion_desc to describe changes for off-strategy To make it possible to add a single interface in table_state for updating sstable list on behalf of both off-strategy and in-strategy compactions, update_sstable_lists_on_off_strategy_completion() will work with compaction_completion_desc too for describing sstable set changes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-13 11:16:19 -03:00
Raphael S. Carvalho	f52ad722f3	compaction_manager: rename table_state's get_sstable_set to main_sstable_set With compaction_manager switching to table_state, we'll need to introduce a method in table_state to return maintenance set. So better to have a descriptive name for main set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-13 11:12:33 -03:00
Raphael S. Carvalho	7d97e15c43	bytes_ostream: Avoid waste by rounding up allocation size to power-of-two - bytes_ostream has a default initial chunk size of 512. - let's say we call bytes_ostream::write() to write 500 bytes. - as next_alloc_size() takes into account space to hold chunk metadata (24 bytes) + chunk data, then 512 bytes is not enough, so it returns 500 + 24 instead to be allocated. - when allocating next chunk, next_alloc_size() will use the size of existing chunk, which is 500 bytes (without metadata) and multiply it to 2 (growth factor), so 1000 bytes is allocated for it. So allocations can be non power-of-two, resulting in memory waste. When seastar is allocating from small pools, the waste is not terrible (although accumulated small wastes can be problematic), but once allocations pass the large threshold (16k), then alignment is 4k (page size) and the waste is not negligible. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11027	2022-07-13 16:51:13 +03:00
Konstantin Osipov	a645bb3622	test.py: recursively search for pytests cql-pytest contains subdirectories with tests ported from Cassandra. It's desirable to preserve the same layout and file names for these tests as in the original source tree. To do that, add support for recursive search of tests to PythonTestSuite. The log files for the tests which are found recursively are created in subdirs of the test tmpdir. While implementing the feature, switch to using pathlib, since a) it supports rglob (recursive glob) and b) it was requested in one of the earlier reviews. Closes #11018	2022-07-13 14:59:29 +03:00
Nadav Har'El	eaf3579c15	test/alternator: several more simple tests for UpdateItem This patch adds several more tests for Alternator's UpdateItem operation. These tests verify a few simple cases that, surprisingly, never had test coverage. The new tests pass (on both DynamoDB and Alternator) so did not expose any bug. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11025	2022-07-12 21:48:33 +02:00
Avi Kivity	e1e4a73793	Merge 'Always register fully expired sstables for compaction' from Benny Halevy If the compaction_descriptor returned by `time_window_compaction_strategy::get_sstables_for_compaction` is marked with `has_only_fully_expired::yes` it should always be compacted since `time_window_compaction_strategy::get_sstables_for_compaction` is not idempotent. It sets `_last_expired_check` and if compaction is postponed and retried before `expired_sstable_check_frequency` has passed, it will not look for those fully-expired sstables again. Plus, compacting them is the cheapest possible as it does not require reading anything, just deleting the input sstables, so there's no reason not postpone it. Also, extend `max_ongoing_compaction_test` to test serialization of compaction jobs with the same weight. Fixes #10989 Closes #10990 * github.com:scylladb/scylla: compaction_manager: always register descriptor with fully expired sstables for compaction test: max_ongoing_compaction_test: test serialization of regular compaction with same weight test: max_ongoing_compaction_test: reindent refactored code test: max_ongoing_compaction_test: define compact_all_tables lambda test: max_ongoing_compaction_test: refactor make_table_with_single_fully_expired_sstable test: max_ongoing_compaction_test: reduce number of tables	2022-07-12 18:40:01 +03:00
Nadav Har'El	761ca88aa8	Merge 'Add test cases for granting and revoking data permissions' from Piotr Sarna This series adds the infrastructure needed for testing user permissions, like the ability to create temporary roles and CQL sessions which log in as different users, and a few initial test cases for granting and revoking permissions. Closes #10998 * github.com:scylladb/scylla: cql-pytest: add a case for granting/revoking data permissions cql-pytest: add new_user and new_session utils cql-pytest: speed up permissions refresh period for tests	2022-07-12 18:31:33 +03:00
Avi Kivity	ea4a907090	Merge 'mutation_compactor: remove emit only live rows parameter' from Botond Dénes Said parameter is a convenience so downstream consumers of the mutation compactors don't have to check the `bool is_live` already passed to them. This convenience however causes a template parameter and additional logic for the compactor. As the most prominent of these consumers (the query result builder) will soon have to switch to `emit_only_live_rows::no` for other reasons anyway (it will want to count tombstones), we take the opportunity to switch everybody to ::no. This can be done with very little additional complexity to these consumers -- basically an additional if or two. With everybody using the `::no` variant of the compactor, we can remove this template parameter and the logic associated with it altogether. Closes #10931 * github.com:scylladb/scylla: multishard_mutation_query: remove now pointless compact_for_result_state typedef mutation_compactor: remove only-live related logic mutation_compactor: remove emit_only_live_rows template parameter mutation_compactor: remove unused compact_mutation_state::parameters querier: remove {data,mutation}_querier aliases querier: remove now pointless emit_only_live_rows template parameter tree: use emit_only_live_rows::no querier: querier_cache: de-override insert() methods	2022-07-12 17:30:46 +03:00
Takuya ASADA	23973f9591	Support installing pip provided command symlinks to /usr/bin This is part of support installing executables from PIP package, now we support installing executable from PIP package but it will install under /opt/scylladb/python3/bin. To call these commands without speciying full path, we also need to install symlink to /usr/bin. To do this, we need new list which specifies command name for symlink. Closes #10748	2022-07-12 17:26:05 +03:00
David Garcia	0c2a18af2d	doc: enable faster builds Closes #11023	2022-07-12 16:33:38 +03:00
Nadav Har'El	15ed0a441e	Merge 'scylla-gdb.py: assortment of task filtering improvements, scylla fiber going backwards' from Botond Dénes This series includes an assortment of loosely related improvements developed for a recent investigation. The changes include: * Fix broken `std_deque` wrapper. * Make `scylla smp-queues` fast. * Teach `scylla smp-queues` to filter for both sender CPU (`--from`) and receiver CPU (`--to`) or both. * Teach `scylla smp-queues` to make histogram over content of the queues -- i.e. the type of tasks in the smp queues. * Teach `scylla smp-queues` to filter for tasks belonging to a certain scheduling group. * Teach `scylla task_histogram` to include only tasks in the histogram. * Teach `scylla task_histogram` to filter for tasks belonging to a certain scheduling group. * Teach `scylla-fiber` to walk in both directions. And some refactoring. Fixes: https://github.com/scylladb/scylla/issues/7059 Closes #11019 * github.com:scylladb/scylla: docs/dev/debugging.md: update continuation chain traversal guide scylla-gdb.py: scylla fiber: walk continuation chain in both directions scylla-gdb.py: scylla fiber: allow passing analyzed pointers to _probe_pointer() scylla-gdb.py: scylla fiber: hoist preparatory code out of _walk() scylla-gdb.py: scylla task_histogram: add --scheduling-groups option scylla-gdb.py: scylla task_histogram: add --filter-tasks option scylla-gdb.py: scylla task_histogram: use histogram class scylla-gdb.py: scylla-fiber: extract symbol matching logic scylla-gdb.py: histogram: add limit feature scylla-gdb.py: histogram: handle formatting errors scylla-gdb.py: intrusive_slist: avoid infinite recursion in __len__() scylla-gdb.py: scylla smp-queues: add --scheduling-group option scylla-gdb.py: scylla smp-queues: add --content switch scylla-gdb.py: smp-queue: add filtering capability scylla-gdb.py: make scylla smp-queues fast scylla-gdb.py: fix disagreement between std_deque len() and iter()	2022-07-12 15:38:23 +03:00
Piotr Sarna	fcd8dfa694	cql-pytest: add a case for granting/revoking data permissions The test cas checks if permissions set for a non-superuser user are enforced.	2022-07-12 13:44:21 +02:00
Benny Halevy	6332816ccf	compaction_manager: always register descriptor with fully expired sstables for compaction If the compaction_descriptor returned by time_window_compaction_strategy::get_sstables_for_compaction is marked with has_only_fully_expired::yes it should always be compacted since time_window_compaction_strategy::get_sstables_for_compaction is not idempotent. It sets _last_expired_check and if compaction is postponed and retried before expired_sstable_check_frequency has passed, it will not look for those fully-expired sstables again. Plus, compacting them is the cheapest possible as it does not require reading anything, just deleting the input sstables, so there's no reason not postpone it. Fixes #10989 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-12 12:04:04 +03:00
Benny Halevy	cfc7a5065a	test: max_ongoing_compaction_test: test serialization of regular compaction with same weight Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-12 12:04:03 +03:00
Benny Halevy	65a5e0a7bb	test: max_ongoing_compaction_test: reindent refactored code Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-12 12:03:32 +03:00
Benny Halevy	5212e81475	test: max_ongoing_compaction_test: define compact_all_tables lambda To test both expired and non-expired sstables scenarios we need to pass this helper function the expected number of sstables before compaction and after compaction. When compaction a set of fully-expired sstables, we expect none to remain, while when the set of sstables is not fully expired, we'll expect 1 output sstable after compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-12 12:00:27 +03:00
Benny Halevy	fe4a59372e	test: max_ongoing_compaction_test: refactor make_table_with_single_fully_expired_sstable So we can use the lower-level build blocks to test compaction serialization of both fully-expired and non-fully-expired sstables scenarios in the following patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-12 11:56:41 +03:00
Benny Halevy	d18fc6a7ed	test: max_ongoing_compaction_test: reduce number of tables There is no need to test 100 tables. 10 tables are enough so make the test complete faster. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-12 11:53:01 +03:00
Konstantin Osipov	4a2d645a0f	scylla-gdb: fix "scylla netw" command The command wasn't tested fully, and when tested, started failing scylla-gdb test. Before Raft, the list of connections to print in a single-node setup was always empty, so a mistake in the gdb script command 'scylla netw' didn't lead to a test failure. With raft, there is always an RPC connection to self after initial bootstrap, and the test begins to print connections (and fail, because there is a bug in the printing code). Fix that bug. Closes #11012	2022-07-12 10:10:40 +03:00
Botond Dénes	ac9935b645	multishard_mutation_query: remove now pointless compact_for_result_state typedef No need to switch on the now defunct emit_only_live_rows.	2022-07-12 08:44:33 +03:00
Botond Dénes	17509e9664	mutation_compactor: remove only-live related logic We removed the template parameter in the previous patch, now we can remove the logic related to it.	2022-07-12 08:44:32 +03:00
Botond Dénes	4d2ce5c304	mutation_compactor: remove emit_only_live_rows template parameter Now that we use emit_only_live_rows::no everywhere we can remove this template parameters. Only the template parameter is removed, the internal logic around it is left in place (will be removed in a next patch), by hard-wiring `only_live()`.	2022-07-12 08:43:49 +03:00
Botond Dénes	9ee8ef5930	mutation_compactor: remove unused compact_mutation_state::parameters	2022-07-12 08:41:51 +03:00
Botond Dénes	f912f5f373	querier: remove {data,mutation}_querier aliases They now both mean the same thing: querier.	2022-07-12 08:41:51 +03:00
Botond Dénes	c77fe427c5	querier: remove now pointless emit_only_live_rows template parameter	2022-07-12 08:41:51 +03:00
Botond Dénes	bedc82e52c	tree: use emit_only_live_rows::no emit_only_live_rows is a convenience so downstream consumers of the mutation compactors don't have to check the `bool is_live` already passed to them. This convenience however causes a template parameter and additional logic for the compactor. As the most prominent of these consumers (the query result builder) will soon have to switch to emit_only_live_rows::no for other reasons anyway (it will want to count tombstones), we take the opportunity to switch everybody to ::no. This can be done with very little additional complexity to these consumer -- basically an additional if or two. This prepares the ground for removing this template parameter and the associate logic from the compactor.	2022-07-12 08:41:51 +03:00
Botond Dénes	742dc10185	querier: querier_cache: de-override insert() methods Soon, the currently two distinct types of queriers will be merged, as the template parameter differentiating them will be gone. This will make using type based overload for insert() impossible, as 2 out of the 3 types will be the same. Use different names instead.	2022-07-12 08:41:48 +03:00
Botond Dénes	5c56125187	docs/dev/debugging.md: update continuation chain traversal guide `scylla fiber` is the way to traverse in both directions now.	2022-07-12 07:27:45 +03:00
Botond Dénes	89595f5b12	scylla-gdb.py: scylla fiber: walk continuation chain in both directions Parameterize _walk() with a method that does the actual walking. This is a trivial change as it was already delegating all the walking logic to _do_walk(). The latter is renamed to _walk_forward() and we add a new method called _walk_backward() which implements walking the continuation chain backwards (towards tasks waited on by the queried task). The starting task is now printed at index #0, tasks waited on by the starting task have negative indexes, tasks waiting on the starting task have positive indexes (like before). With this scylla fiber can be used to dump an entire fiber (barring any difficulties detecting following more special tasks like threads).	2022-07-12 07:08:54 +03:00
Botond Dénes	1d5547ae22	scylla-gdb.py: scylla fiber: allow passing analyzed pointers to _probe_pointer() A future caller will have pre-analyzed pointers to pass to said method, in which case we want to avoid re-running the expensive process.	2022-07-12 06:21:03 +03:00
Botond Dénes	b41493d165	scylla-gdb.py: scylla fiber: hoist preparatory code out of _walk() We soon want to teach _walk() to walk in both directions. In preparation to that, we extract all generic preparatory code that is related to the starting task and combining arguments. This now resides in invoke(), _walk() should only be concerned with traversing the continuation chain.	2022-07-12 06:14:44 +03:00
Nadav Har'El	f5ff687b64	Merge 'cql3: Reorganize expr::to_restriction' from Jan Ciołek This PR introduces improvements to `expr::to_restriction` and prepares the validation part for restriction classes removal. `expr::to_restriction` is currently used to take a restriction from the WHERE clause, prepare it, perform some validation checks and finally convert it to an instance of the restriction class. Soon we will get rid of the restriction class. In preparation for that `expr::to_restriction` is split into two independent parts: * The part that prepares and validates a binary_operator * The part that converts a binary_operator to restriction Thanks to this split getting rid of restriction class will be painless, we will just stop using the second part. `to_restriction.cc` is replaced by `restrictions.hh/cc`. In the future we can put all the restriction expressions code there to avoid clutter in `expression.hh/cc`. This change made it much easier to fix #10631, so I did that as well. Fixes: #10631 Closes #10979 * github.com:scylladb/scylla: cql-pytest: Test that IS NOT only accepts NULL cql-pytest: Enable testInvalidCollectionNonEQRelation cql3: Move single element IN restrictions handling cql3: Check for disallowed operators early cql3: Simplify adding restrictions cql3: Reorganize to_restriction code cql3: Fix IS NOT NULL check in to_restriction cql3: Swap order of arguments in error message	2022-07-12 00:26:34 +03:00
Avi Kivity	53e0dc7530	bytes_ostream: base on managed_bytes bytes_ostream is an incremental builder for a discontiguous byte container. managed_bytes is a non-incremental (size must be known up front) byte container, that is also compatible with LSA. So far, conversion between them involves copying. This is unfortunate, since query_result is generated as a bytes_ostream, but is later converted to managed_bytes (today, this is done in cql3::expr::get_non_pk_values() and compound_view_wrapper::explode(). If the two types could be made compatible, we could use managed_bytes_view instead of creating new objects and avoid a copy. It's also nicer to have one less vocabulary type. This patch makes bytes_ostream use managed_bytes' internal representation (blob_storage instead of bytes_ostream::chunk) and provides a conversion to managed_bytes. All bytes_ostream users are left in place, but the goal is to make bytes_ostream a write-only type with the only observer a conversion to managed_bytes. It turns out to be relatively simple. The internal representations were already similar. I made blob_storage::ref_type self-initializing to reduce churn (good practice anyway) and added a private constructor to managed_bytes for the conversion. Note that bytes_ostream can only be used to construct a non-LSA managed_bytes, but LSA uses of managed_bytes are very strictly controlled (the entry points to memtable and cache) so that's not a problem. A unit test is added. Closes #10986	2022-07-12 00:23:29 +03:00
Pavel Emelyanov	5526738794	view: Fix trace-state pointer use after move It's moved into .mutate_locally() but it captured and used in its continuation. It works well just because moved-from pointer looks like nullptr and all the tracing code checks for it to be non-such. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/ (CI job failed on post-actions thus it's red) Fixes #11015 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220711134152.30346-1-xemul@scylladb.com>	2022-07-11 17:20:51 +03:00
Avi Kivity	34886ce1a1	Merge 'Allow regular compaction during major' from Benny Halevy After acquiring the _compaction_state write lock, select all sstables using get_candidates and register them as compacting, then unlock the _compaction_state lock to let regular compaction run in parallel. Also, run major compaction in maintenance scheduling group. We should separate the scheduling groups used for major compaction from the the regular compaction scheduling group so that the latter can be affected by the backlog tracker in case backlog accumulates during a long running major compaction. Fixes #10961 Closes #10984 * github.com:scylladb/scylla: compaction_manager: major_compaction_task: run in maintenance scheduling groupt compaction_manager: allow regular compaction to run in parallel to major	2022-07-11 17:11:51 +03:00
Jan Ciolek	012f7d5b1a	cql-pytest: Test that IS NOT only accepts NULL The IS_NOT operator can only be used during materialized view creation and it can only be used to express IS NOT NULL. Trying to write something like IS NOT 42 should cause an error. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:16 +02:00
Jan Ciolek	22e605f823	cql-pytest: Enable testInvalidCollectionNonEQRelation The wrong error message has been fixed and now the test passes. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:16 +02:00
Jan Ciolek	38e115edf7	cql3: Move single element IN restrictions handling Restrictions like col IN (1) get converted to col = 1 as an optimization/simplification. This used to be done in prepare_binary_operator, but it fits way better inside of validate_and_prepare_new_restriction. When it was being done in prepare_binary_operator the conversion happened before validation checks and the error messages would describe an equality restriction despite the user making an IN restriction. Now the conversion happens after all validation is finished, which ensures that all checks are being done on the original expression. Fixes: #10631 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:16 +02:00
Jan Ciolek	cb504b2d6e	cql3: Check for disallowed operators early Move checking for disallowed operators earlier in the code flow. This is needed to pass some tests that expect one error message instead of the other. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:16 +02:00
Jan Ciolek	62155846bc	cql3: Simplify adding restrictions The code that adds restrictions in statement_restrictions.cc is unnecessarily convoluted. The code to handle IS NOT NULL is actually repeated twice, once in the constructor and once in add_is_not_restriction. I missed this when I orignally modified this code. There is no need to keep duplicate code, we can just use the new add_is_not_restriction. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:16 +02:00
Jan Ciolek	debd7399fd	cql3: Reorganize to_restriction code expr::to_restriction is currently used to take a restriction from the WHERE clause, prepare it, perform some validation checks and finally convert it to an instance of the restriction class. Soon we will get rid of the restriction class. In preparation for that expr::to_restriction is split into two independent parts: * The part that prepares and validates a binary_operator * The part that converts a binary_operator to restriction Thanks to this split getting rid of restriction class will be painless, we will just stop using the second part. This commit splits expr::to_restriction into two functions; * validate_and_prepare_new_restriction * convert_to_restriction that handle each of those parts. All helper validation methods in the anonymous namespace are copied from the to_restriction.cc file. to_restriction.cc isn't the best filename for the new functionality, so it has been renamed to restrictions.hh/cc. In the future all the code regarding restrictions could be put there to reduce clutter in expression.hh/cc Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:16 +02:00
Jan Ciolek	5be574fe51	cql3: Fix IS NOT NULL check in to_restriction expr::to_restriction performs a check to see if the restriction is of form: `col IS NOT NULL` There is a mistake in this check. It uses is<null>(prepared_binop.rhs) to determine if the right hand side of binary operator is a null, but the binary operator is already prepared. During preparation expr::null is converted to expr::constant and that wouldn't be detected by this check. The check has been changed to check for null constant instead of expr::null. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:15 +02:00
Jan Ciolek	4142b27d85	cql3: Swap order of arguments in error message The error message displays two arguments in a specific order, but the tests actually expect them to be swapped. Swap the arguments to match the expected error messages in tests. It wasn't detected earlier because the check was never reached, but this will change soon in the following commits. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-11 15:47:13 +02:00
Botond Dénes	c11875d025	scylla-gdb.py: scylla task_histogram: add --scheduling-groups option Causing the histogram to be made from the scheduling groups of the found tasks. Allows for finding out which scheduling group dominates in-memory tasks. This currently cannot be determined, scylla task-queues only includes ready tasks.	2022-07-11 16:39:49 +03:00
Botond Dénes	2c3c9563b6	scylla-gdb.py: scylla task_histogram: add --filter-tasks option Allowing to include only task objects in the histogram. Leads to histograms with less noise but might exclude potentially important items due to the filtering being inexact.	2022-07-11 16:35:09 +03:00
Botond Dénes	b3fd03f02d	scylla-gdb.py: scylla task_histogram: use histogram class Instead of open-coded alternative. Spares some lines of code and makes future patching easier.	2022-07-11 16:17:29 +03:00
Botond Dénes	46eeb874fc	scylla-gdb.py: scylla-fiber: extract symbol matching logic Into its own class. Soon there will be another user for it.	2022-07-11 16:17:29 +03:00
Botond Dénes	79d5a4cccb	scylla-gdb.py: histogram: add limit feature Limiting the number of printed lines to the top "limit" ones if provided.	2022-07-11 16:08:29 +03:00
Botond Dénes	a80f5f2ba4	scylla-gdb.py: histogram: handle formatting errors Both the content and the formatter method is caller-provided. Mistakes are easy to come by. Instead of aborting the entire operation, just a print an error if an item fails to format.	2022-07-11 16:06:25 +03:00
Botond Dénes	753c1608dd	scylla-gdb.py: intrusive_slist: avoid infinite recursion in __len__() Said method currently uses a list() to iterate over all elements, determining the length. Passing `self` to `list()` will however make call `len()` first, causing infinite recursion.	2022-07-11 15:53:01 +03:00
Botond Dénes	935f9f7ac4	scylla-gdb.py: scylla smp-queues: add --scheduling-group option Allowing for filtering for async work items belonging to a certain scheduling group.	2022-07-11 15:41:29 +03:00
Botond Dénes	fe0f4d4dc1	scylla-gdb.py: scylla smp-queues: add --content switch When present on the command line, the histogram is created over the content of the queues, rather than the number of items in them. It is possible to filter in combination with --content. In particular it can be used to see the content of a single queue when all three of `--to`, `--from` and `--content` is present on the command line.	2022-07-11 15:41:02 +03:00
Nadav Har'El	a504d120d0	Merge 'docs: migrate the docs from the scylla-docs repo' from Anna Stuchlik This PR migrates the ScyllaDB end-user documentation from the [scylla-docs](https://github.com/scylladb/scylla-docs/) repository, according to the [migration plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit?usp=sharing). All the files are added to the `docs` subfolder. This PR does not cover any content changes. How to test this PR: 1. Go to `scylla/docs`. 2. Run `make preview`. The docs should build without any warnings. 3. Open http://127.0.0.1:5500/ in your browser. You should see the documentation landing page: ![image](https://user-images.githubusercontent.com/37244380/177358869-af9f1b78-e528-4d0d-9479-cc69e25f3b67.png) Closes #10976 * github.com:scylladb/scylla: doc: fix errors -fix the indent in the conf.py file doc: fix the path to Alternator doc: fix errors - add Alternator to the toctree doc: fix errors- update the conf.py file doc: fix errors - remove the CNAME file doc: add the CNAME and robots files doc: move index and README from scylla-docs repo doc: move the documentation from the scylla-docs repo doc: remove the old index file	2022-07-11 15:26:06 +03:00
Botond Dénes	abc77d07d5	scylla-gdb.py: smp-queue: add filtering capability Allow filtering for the from/to cpu (or both). Useful when looking for queues going to a certain CPU.	2022-07-11 15:19:13 +03:00
Botond Dénes	73563d9800	scylla-gdb.py: make scylla smp-queues fast Currently scylla smp-queues has O(count(vobjects)) time complexity as it works by scanning all objects with a vptr and searching them for a pointer to one of the smp message queues. This is very inefficient and unnecessary. It much better to just look at the queues themselves and sum up the number of items in them. This completes in 1-2 seconds on a core where the old algorithm didn't complete in 2h+.	2022-07-11 15:19:13 +03:00
Botond Dénes	966373fcfb	scylla-gdb.py: fix disagreement between std_deque len() and iter() std_deque implementation was broken, with __len__() and __iter__() disagreeing about the size of the container. Turns out both are wrong in certain situations. Fix the iteration logic and re-base both __len__() and __iter__() on the same node iteration code to prevent future disagreements.	2022-07-11 15:19:13 +03:00
Avi Kivity	957bf48eb2	Merge 'Don't throw exceptions on the replica side when handling single partition reads and writes' from Piotr Dulikowski This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions: 1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260. This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics. Example: ```c++ // Before try { std::rethrow_exception(eptr); } catch (std::runtime_exception& ex) { // 1 } catch (...) { // 2 } // After if (auto* ex = try_catch<std::runtime_exception>(eptr)) { // 1 } else { // 2 } ``` 2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself. Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write). Results from `perf_simple_query`: ``` Before (`719724e4df`): Writes: Normal: 127841.40 tps ( 56.2 allocs/op, 13.2 tasks/op, 50042 insns/op, 0 errors) Timeouts: 94770.81 tps ( 53.1 allocs/op, 5.1 tasks/op, 78678 insns/op, 1000000 errors) Reads: Normal: 138902.31 tps ( 65.1 allocs/op, 12.1 tasks/op, 43106 insns/op, 0 errors) Timeouts: 62447.01 tps ( 49.7 allocs/op, 12.1 tasks/op, 135984 insns/op, 936846 errors) After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f): Writes: Normal: 127359.12 tps ( 56.2 allocs/op, 13.2 tasks/op, 49782 insns/op, 0 errors) Timeouts: 163068.38 tps ( 52.1 allocs/op, 5.1 tasks/op, 40615 insns/op, 1000000 errors) Reads: Normal: 151221.15 tps ( 65.1 allocs/op, 12.1 tasks/op, 43028 insns/op, 0 errors) Timeouts: 192094.11 tps ( 41.2 allocs/op, 12.1 tasks/op, 33403 insns/op, 960604 errors) ``` Closes #10368 * github.com:scylladb/scylla: database: avoid rethrows when handling exceptions from commitlog database: convert throw_commitlog_add_error to use make_nested_exception_ptr utils: add make_nested_exception_ptr storage_proxy: don't rethrow when inspecting replica exceptions on write path database: don't rethrow rate_limit_exception storage_proxy: don't rethrow the exception in abstract_read_resolver::error utils/exceptions.cc: don't rethrow in is_timeout_exception utils/exceptions: add try_catch utils: add abi/eh_ia64.hh storage_proxy: don't rethrow exceptions from replicas when accounting read stats message: get rid of throws in send_message{,_timeout,_abortable} database/{query,query_mutations}: don't rethrow read semaphore exceptions	2022-07-11 14:01:41 +03:00
Anna Stuchlik	0bf99b25c9	doc: fix errors -fix the indent in the conf.py file	2022-07-11 12:31:59 +02:00
Anna Stuchlik	ae9ed315d1	doc: fix the path to Alternator	2022-07-11 12:31:59 +02:00
Anna Stuchlik	a1d9f0f0c8	doc: fix errors - add Alternator to the toctree	2022-07-11 12:31:30 +02:00
Anna Stuchlik	81949bbc7a	doc: fix errors- update the conf.py file	2022-07-11 12:18:47 +02:00
Anna Stuchlik	2e95bd0ed1	doc: fix errors - remove the CNAME file	2022-07-11 12:17:33 +02:00
Anna Stuchlik	7b5dfde56a	doc: add the CNAME and robots files	2022-07-11 12:16:53 +02:00
Anna Stuchlik	8d86dfa929	doc: move index and README from scylla-docs repo	2022-07-11 12:14:40 +02:00
Anna Stuchlik	6e97b83b60	doc: move the documentation from the scylla-docs repo	2022-07-11 12:14:02 +02:00
Anna Stuchlik	bb41457f73	doc: remove the old index file	2022-07-11 12:12:15 +02:00
Piotr Sarna	23acc2e848	cql-pytest: add new_user and new_session utils These helpers can be used to create a new user and connect to the cluster using custom credentials to log in.	2022-07-11 10:49:15 +02:00
Piotr Sarna	cf57b369e7	cql-pytest: speed up permissions refresh period for tests The default refresh period for permissions in both Scylla and Cassandra is 2 seconds, which is usually perfectly fine for production environments, but it introduces a significant delay in automatic test cases. The refresh period is hereby set to 100ms, which allows test_permissions.py cases to run in around 1s for Scylla instead of tens of seconds.	2022-07-11 10:30:01 +02:00
Nadav Har'El	cc69177dcc	config: fix printing of experimental feature list Recently we noticed a regression where with certain versions of the fmt library, SELECT value FROM system.config WHERE name = 'experimental_features' returns string numbers, like "5", instead of feature names like "raft". It turns out that the fmt library keep changing their overload resolution order when there are several ways to print something. For enum_option<T> we happen to have to conflicting ways to print it: 1. We have an explicit operator<<. 2. We have an implicit convertor to the type held by T. We were hoping that the operator<< always wins. But in fmt 8.1, there is special logic that if the type is convertable to an int, this is used before operator<<()! For experimental_features_t, the type held in it was an old-style enum, so it is indeed convertible to int. The solution I used in this patch is to replace the old-style enum in experimental_features_t by the newer and more recommended "enum class", which does not have an implicit conversion to int. I could have fixed it in other ways, but it wouldn't have been much prettier. For example, dropping the implicit convertor would require us to change a bunch of switch() statements over enum_option (and not just experimental_features_t, but other types of enum_option). Going forward, all uses of enum_option should use "enum class", not "enum". tri_mode_restriction_t was already using an enum class, and now so does experimental_features_t. I changed the examples in the comments to also use "enum class" instead of enum. This patch also adds to the existing experimental_features test a check that the feature names are words that are not numbers. Fixes #11003. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11004	2022-07-11 09:17:30 +02:00
Nadav Har'El	4a4d9ec9c0	cql-pytest: remove "xfail" mark from two passing tests Fix two cql-pytest that have been "XPASS"ing (unexpectedly passing) by removing the "xfail" (expecting failure) mark from them: One test was for an issue that has already been fixed (refs #10081). The second test was a translated Cassandra test that should never have failed because it doesn't trigger the issue that supposedly failed it (that test sets a large value for a non-indexed column, so doesn't trigger the problem we have with large values in an indexed column). Closes #11006	2022-07-11 08:34:19 +03:00
Nadav Har'El	0a71151bc4	test/cql-pytest: avoid deprecation message When running test/cql-pytest, pytest prints one warning at the end: /home/nyh/scylla/test/cql-pytest/test_secondary_index.py:82: DeprecationWarning: ResultSet indexing support will be removed in 4.0. Consider using ResultSet.one() to get a single row. assert any([index_name in event.description for event in cql.execute(query, trace=True).get_query_trace().events]) So in this patch I do exactly what the warning recommends - use one(). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11002	2022-07-11 08:01:23 +03:00
Nadav Har'El	2581b54ea0	test/{alternator,redis}: stop using deprecated "disutils" package Python has deprecated the distutils package. In several places in the Alternator and Redis test suites, we used distutils.version to check if the library is new enough for running the test (and skip the test if it's too old). On new versions of Python, we started getting deprecation warnings such as: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives PEP 632 recommends using package.version instead of distutils.version, and indeed it works well. After applying this patch, Alternator and Redis test runs no long end in silly deprecation warnings. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11007	2022-07-11 08:00:45 +03:00
Benny Halevy	7e2d2cf1c1	table: snapshot: coroutine::return_exception_ptr Otherwise, we lose the returned exception_ptr type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11000	2022-07-10 17:56:24 +03:00
Nadav Har'El	2437a42b64	Merge 'cql-pytest: add test_permissions.py' from Piotr Sarna This new test suite is expected to gather all kinds of permissions tests - granting, revoking, authorizing, and so on. Right now it contains a single minimal test which ensures that the default superuser can be granted applicable permissions, which they already have anyway. The test suite added in this pull request will also be useful when developing #10633 - permissions for UDF/UDA infrastructure. Closes #10991 * github.com:scylladb/scylla: cql-pytest: add initial permissions test suite cql-pytest: enable CassandraAuthorizer for Scylla and Cassandra	2022-07-10 09:30:10 +03:00
cvybhu	80dda2bb97	cql3: expr: Fix handling reversed types in limits() There was a bug which caused incorrect results of limits() for columns with reversed clustering order. Such columns have reversed_type as their type and this needs to be taken into account when comparing them. It was introduced in `6d943e6cd0`. This commit replaced uses of get_value_comparator with type_of. The difference between them is that get_value_comparator applied ->without_reversed() on the result type. Because the type was reversed, comparisons like 1 < 2 evaluated to false. This caused the test testIndexOnKeyWithReverseClustering to fail, but sadly it wasn't caught by CI because the CI itself has a bug that makes it skip some tests. The test passes now, although it has to be run manually to check that. Fixes: #10918 Signed-off-by: cvybhu <jan.ciolek@scylladb.com> Closes #10994	2022-07-10 09:24:06 +03:00
Nadav Har'El	a7fa29bceb	cross-tree: fix header file self-sufficiency Scylla's coding standard requires that each header is self-sufficient, i.e., it includes whatever other headers it needs - so it can be included without having to include any other header before it. We have a test for this, "ninja dev-headers", but it isn't run very frequently, and it turns out our code deviated from this requirement in a few places. This patch fixes those places, and after it "ninja dev-headers" succeeds again. Fixes #10995 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10997	2022-07-08 12:59:14 +03:00
Avi Kivity	3b20407f25	Merge 'db: Avoid memtable flush latency on schema merge' from Tomasz Grabiec Currently, applying schema mutations involves flushing all schema tables so that on restart commit log replay is performed on top of latest schema (for correctness). The downside is that schema merge is very sensitive to fdatasync latency. Flushing a single memtable involves many syncs, and we flush several of them. It was observed to take as long as 30 seconds on GCE disks under some conditions. This patch changes the schema merge to rely on a separate commit log to replay the mutations on restart. This way it doesn't have to wait for memtables to be flushed. It has to wait for the commitlog to be synced, but this cost is well amortized. We put the mutations into a separate commit log so that schema can be recovered before replaying user mutations. This is necessary because regular writes have a dependency on schema version, and replaying on top of latest schema satisfies all dependencies. Without this, we could get loss of writes if we replay a write which depends on the latest schema on top of old schema. Also, if we have a separate commit log for schema we can delay schema parsing for after the replay and avoid complexity of recognizing schema transactions in the log and invoking the schema merge logic. I reproduced bad behavior locally on my machine with a tired (high latency) SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement). Without load, it is 300ms vs 50ms. Fixes #8272 Fixes #8309 Fixes #1459 Closes #10333 * github.com:scylladb/scylla: config: Introduce force_schema_commit_log option config: Introduce unsafe_ignore_truncation_record db: Avoid memtable flush latency on schema merge db: Allow splitting initiatlization of system tables db: Flush system.scylla_local on change migration_manager: Do not drop system.IndexInfo on keyspace drop Introduce SCHEMA_COMMITLOG cluster feature frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations db/commitlog: Improve error messages in case of unknown column mapping db/commitlog: Fix error format string to print the version db: Introduce multi-table atomic apply()	2022-07-07 16:03:50 +03:00
Benny Halevy	acae3cc223	treewide: stop use of deprecated coroutine::make_exception Convert most use sites from `co_return coroutine::make_exception` to `co_await coroutine::return_exception{,_ptr}` where possible. In cases this is done in a catch clause, convert to `co_return coroutine::exception`, generating an exception_ptr if needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10972	2022-07-07 15:02:16 +03:00
Piotr Sarna	2e61e50e97	cql-pytest: add initial permissions test suite This new test suite is expected to gather all kinds of permissions tests - granting, revoking, authorizing, and so on. Right now it contains a single minimal test which ensures that the default superuser can be granted applicable permissions, which they already have anyway.	2022-07-07 13:45:26 +02:00
Piotr Sarna	1dc116f4dc	cql-pytest: enable CassandraAuthorizer for Scylla and Cassandra In order to be able to test permissions, an authorizer different than AllowAllAuthorizer (default) must be set. CassandraAuthorizer is thus enabled - it works on default user/password pair, so it doesn't introduce any regressions to the test suite.	2022-07-07 13:45:26 +02:00
Avi Kivity	bfc521ee9c	Merge "Activate compaction_throughput_mb_per_sec option" from Pavel E " The option controlls the IO bandwidth of the compaction sched class. It's not set to be 16MB/s, but is unused. This set makes it 0 by default (which means unlimited), live-updateable and plugs it to the seastar sched group IO throttling. branch: https://github.com/xemul/scylla/tree/br-compaction-throttling-3 tests: unit(dev), v2: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1010/ , v2: manual config update " * 'br-compaction-throttling-3-a' of https://github.com/xemul/scylla: compaction_manager: Add compaction throughput limit updateable_value: Support dummy observing serialized_action: Allow being observer for updateable_value config: Tune the config option	2022-07-07 13:14:07 +03:00
Tomasz Grabiec	6622e3369a	config: Introduce force_schema_commit_log option	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	b8d20335a4	config: Introduce unsafe_ignore_truncation_record The node now refuses to boot if schema tables were truncated. This adds a config option to ignore truncation records as a workaround if user truncated them manually.	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	6b316f267f	db: Avoid memtable flush latency on schema merge Currently, applying schema mutations involves flushing all schema tables so that on restart commit log replay is performed on top of latest schema (for correctness). The downside is that schema merge is very sensitive to fdatasync latency. Flushing a single memtable involves many syncs, and we flush several of them. It was observed to take as long as 30 seconds on GCE disks under some conditions. This patch changes the schema merge to rely on a separate commit log to replay the mutations on restart. This way it doesn't have to wait for memtables to be flushed. It has to wait for the commitlog to be synced, but this cost is well amortized. We put the mutations into a separate commit log so that schema can be recovered before replaying user mutations. This is necessary because regular writes have a dependency on schema version, and replaying on top of latest schema satisfies all dependencies. Without this, we could get loss of writes if we replay a write which depends on the latest schema on top of old schema. Also, if we have a separate commit log for schema we can delay schema parsing for after the replay and avoid complexity of recognizing schema transactions in the log and invoking the schema merge logic. One complication with this change is that replay_position markers are commitlog-domain specific and cannot cross domains. They are recorded in various places which survive node restart: sstables are annotated with the maximum replay position, and they are present inside truncation records. The former annotation is used by "truncate" operation to drop sstables. To prevent old replay positions from being interpreted in the context in the new schema commitlog domain, the change refuses to boot if there are truncation records, and also prohibits truncation of schema tables. The boot sequence needs to know whether the cluster feature associated with this change was enabled on all nodes. Fetaures are stored in system.scylla_local. Because we need to read it before initializing schema tables, the initialization of tables now has to be split into two phases. The first phase initializes all system tables except schema tables, and later we initialize schema tables, after reading stored cluster features. The commitlog domain is switched only when all nodes are upgraded, and only after new node is restarted. This is so that we don't have to add risky code to deal with hot-switching of the commitlog domain. Cold switching is safer. This means that after upgrade there is a need for yet another rolling restart round. Fixes #8272 Fixes #8309 Fixes #1459	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	c5ad05c819	db: Allow splitting initiatlization of system tables We will need some system tables to be initialized earlier in the boot so that system.scylla_local can be read before schema tables are initialized.	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	9b3f96047f	db: Flush system.scylla_local on change So that it can be read before commit log replay. SCHEMA_COMMITLOG feature relies on that.	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	609bf1d547	migration_manager: Do not drop system.IndexInfo on keyspace drop It's not needed anymore because system.IndexInfo is a virtual table calculated from view info. The drop accesses a table which is outside system_schema keyspace so crosses commit log domain. This will trigger an internal from database::apply() on schema merge once the code switches to use the schema commit log and require that all mutations which are part of the schema change belong to a single commit log domain. We could theoretically move system.IndexInfo to the schema commit log domain. It's not easy though because table initialization at boot needs to be split, and current functions for initailization work at keyspace granularity, not table granularity.	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	62df9f446c	Introduce SCHEMA_COMMITLOG cluster feature	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	6c112bf854	frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	4eb4689d8c	db/commitlog: Improve error messages in case of unknown column mapping Include the table id, and also add a debug-level log line with replay pos which is similar to the one logged when no error happens.	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	f62eb186b4	db/commitlog: Fix error format string to print the version It always printed {} instead.	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	6444d959dc	db: Introduce multi-table atomic apply() Will be used to apply schema mutations atomically.	2022-07-06 22:08:56 +02:00
Benny Halevy	e3f561db31	compaction_manager: major_compaction_task: run in maintenance scheduling groupt We should separate the scheduling groups used for major compaction from the the regular compaction scheduling group so that the latter can be affected by the backlog tracker in case backlog accumulates during a long running major compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-06 18:18:45 +03:00
Benny Halevy	a9dc7b1841	compaction_manager: allow regular compaction to run in parallel to major After acquiring the _compaction_state write lock, select all sstables using get_candidates and register them as compacting, then unlock the _compaction_state lock to let regular compaction run in parallel. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-06 14:44:27 +03:00
Petr Gusev	6cdd5b9ff5	raft, set_configuration fix: don't use dummy entries Leader which ceases to be a leader as a result of a execute_modify_config cannot wait for a dummy record to be committed because io_fiber aborts current waiters as soon as it detects a lost of leadership. This commit excludes dummy entries from the configuration change procedure. A special promise is set on io_fiber when it gets a non-joint configuration, and set_configuration just waits for the corresponding future instead of a dummy record. Fixes: #10010 Closes #10905	2022-07-06 11:26:59 +02:00
Avi Kivity	419fe65259	Revert "Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki" This reverts commit `aa8f135f64`, reversing changes made to `9a88bc260c`. The patch causes hangs during flush. Also reverts parts of `411231da75` that impacted the unit test. Fixes #10897.	2022-07-06 12:19:02 +03:00
Asias He	a33c370f9a	gossip: Speed up wait for gossip settle In a large cluster, a node would receive frequent and periodic gossip application state updates like CACHE_HITRATES or VIEW_BACKLOG from peer nodes. Those states are not critical. They should not be counted for the _msg_processing counter which is used to decide if gossip is settled. This patch fixes the long settle on every restart issue reported by users. Refs #10337 Closes #10892	2022-07-06 11:26:32 +03:00
Pavel Emelyanov	b112a98318	compaction_manager: Add compaction throughput limit Re-use eisting compaction_throughput_mb_per_sec option, push it down to compaction manager via config and update the nderlying compaction sched class when the option is (live)updated. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-06 08:17:08 +03:00
Pavel Emelyanov	b86d11cf67	updateable_value: Support dummy observing An updateable_value() may come without source attached. One of the options how this can happen is if the value sits on a service config. It's a good option to make the config have some default initialization for the option, but in this case observe()ing an option by the service would step on null pointer dereference. Said that, if a value without source is tried to be observed -- assume that it's OK, but the value would never change, so a dummy observer is to be provided. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-06 08:17:08 +03:00
Pavel Emelyanov	dd0f60ef24	serialized_action: Allow being observer for updateable_value Live-updating an option may involve running some action when the option changes, not just getting its new value into somewhere. The action is nice to be run as serialized action to batch config updates. Said that, here's a sugar to write serialized_action _foo = [this] { return foo(); }; observer<> _o = option.observe(_foo.make_observer()); instead of serialized_action _foo = [this] { return foo(); }; observer<> _o = option.observe([this] { // waited with .join on stop (void)_foo.trigger(); }); Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-06 08:17:08 +03:00
Mikołaj Grzebieluch	ea3da23f3b	cql3: refactor of native types parsing Native types were parsed directly to data_type, where varchar and text were parsed to utf8_type. To get the name of the type there was a call to the data_type method thus getting the name of the varchar type returns "text". To fix this, added new nonterminal type_unreserved_keyword, which parse native types to their names. It replaced native_or_internal_type in unreserved_function_keyword. unreserved_function_keyword is also used to parse usernames, keyspace names, index names, column identifieres, service levels and role names, so this bug was repaired also in them. Fixes: #10642 Closes #10960	2022-07-05 18:09:17 +02:00
Piotr Dulikowski	b2504a707c	database: avoid rethrows when handling exceptions from commitlog The database::do_apply and database::apply_with_commitlog are now changed so that they don't rethrow exceptions returned from the commitlog.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	5264fdb3d0	database: convert throw_commitlog_add_error to use make_nested_exception_ptr Now, throw_commitlog_add_error is renamed to throw_commitlog_add_error. Instead of wrapping the currently executing exception and rethrowing it, it takes an std::exception_ptr, wraps it and also returns std::exception_ptr.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	1b8aacfee1	utils: add make_nested_exception_ptr The utils::make_nested_exception_ptr function works similar to std::throw_with_nested, but instead of storing the currently thrown exception as the nested exception and then immediately throwing the new exception, it receives the nested exception as an std::exception_ptr and also returns an std::exception_ptr. If the standard library supports it, the function does not perform any throws. Otherwise the fallback logic performs two throws.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	2008db58c4	storage_proxy: don't rethrow when inspecting replica exceptions on write path Now, storage_proxy::send_to_live_endpoints doesn't rethrow exceptions received from the replica logic when inspecting them.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	eff462a0e7	database: don't rethrow rate_limit_exception Now, utils::try_catch is used to detect whether the write operation failed due to a rate limit exception.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	ffb95c4840	storage_proxy: don't rethrow the exception in abstract_read_resolver::error Now, the abstract_read_resolver::error uses the utils::try_catch utility to analyse the error received from replica instead of rethrowing it.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	969a2b4b47	utils/exceptions.cc: don't rethrow in is_timeout_exception Now, is_timeout_exception doesn't need to rethrow the exception in order to determine whether it's a timeout exception.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	18f43fa00e	utils/exceptions: add try_catch Introduces a utility function which allows obtaining a pointer to the exception data held behind an std::exception_ptr if the data matches the requested type. It can be used to implement manual but concise try..catch chains. The `try_catch` has the best performance when used with libstdc++ as it uses the stdlib specific functions for simulating a try..catch without having to actually throw. For other stdlibs, the implementation falls back to a throw surrounded by an actual try..catch.	2022-07-05 16:41:09 +02:00
Nadav Har'El	a0ffbf3291	test/cql-pytest: fix test that started failing after error message change Recently a change to Scylla's expression implementation changed the standard error message copied from Cassandra: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING In the special case where the filter is on the partition key, we changed the message to: Only EQ and IN relation are supported on the partition key (unless you use the token() function or allow filtering) We had a cql-pytest test translated from Cassandra's unit test that checked the old message, and started to fail. Unfortunately nobody noticed because a bug in test.py caused it to stop running these translated unit tests. So in this patch, we trivially fix the test to pass again. Instead of insisting on the old message, we check jsut for the string "allow filtering", in lowercase or uppercase. After this patch, the tests passes as expected on both Scylla and Cassandra. Refs #10918 (this test failing is one of the failures reported there) Refs #10962 (test.py stopped running this test) Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10964	2022-07-05 15:24:20 +03:00
Takuya ASADA	ce87e15ecf	scylla_prepare: fix Exception when SET_NIC_AND_DISKS=no and SET_CLOCKSOURCE=yes We shouldn't call get_tune_mode() when NIC tuning is disabled. fixes #10412 Closes #10959	2022-07-05 14:52:52 +03:00
Takuya ASADA	7501465b7c	scylla_util.py: change debug log directory to /var/tmp/scylla Current debug log is bit difficult to collect in CI, to find the debug log we must know which script caused Exception. Because the filename does not include prefix, and also specified directory is shared with other programs. To make things more easily, let's change debug log directory to /var/tmp/scylla. Closes #10730	2022-07-05 14:49:00 +03:00
Avi Kivity	74b02b9719	Merge 'storage_service: track restore_replica_count' from Benny Halevy This mini-series adds an _async_gate to storage_service that is closed on stop() and it performs restore_replica_count under this gate so it can be orderly waited on in stop() Fixes #10672 Closes #10922 * github.com:scylladb/scylla: storage_service: handle_state_removing: restore_replica_count under _async_gate storage_service: add async_gate for background work	2022-07-05 13:18:59 +03:00
Michał Chojnowski	152eff249c	dbuild: fix --security-opt syntax A recent change added `--security-opt label:disable` to the docker options. There are examples of this syntax on the web, but podman and docker manuals don't mention it and it doesn't work on my machine. Fix it into `--security-opt label=disable`, as described by the manuals. Closes #10965	2022-07-05 10:50:31 +03:00
Piotr Dulikowski	c1ac116eb9	utils: add abi/eh_ia64.hh Adds a header for utility functions/structures, based on the Itanium ABI for C++, necessary for us to inspect exceptions behind std::exception_ptr without having to actually rethrow the exception.	2022-07-04 19:27:06 +02:00
Piotr Dulikowski	491cc2a8df	storage_proxy: don't rethrow exceptions from replicas when accounting read stats Now, make_{data,mutation_data,digest}_requests don't rethrow the exception received from replicas when increasing the error count metric.	2022-07-04 19:27:06 +02:00
Piotr Dulikowski	da571ed93b	message: get rid of throws in send_message{,_timeout,_abortable} Now, those function don't rethrow existing or throw new exceptions.	2022-07-04 19:27:06 +02:00
Piotr Dulikowski	902f1b7cfe	database/{query,query_mutations}: don't rethrow read semaphore exceptions Now, read semaphore exceptions are propagated from database::query and database::query_mutations without rethrowing them.	2022-07-04 19:26:02 +02:00
Avi Kivity	33fe28b0c5	Merge 'commitlog allocation/deletion/flush request rate counters + footprint projection' from Calle Wilund Adds measuring the apparent delta vector of footprint added/removed within the timer time slice, and potentially include this (if influx is greater than data removed) in threshold calculation. The idea is to anticipate crossing usage threshold within a time slice, so request a flush slightly earlier, hoping this will give all involved more time to do their disk work. Obviously, this is very akin to just adjusting the threshold downwards, but the slight difference is that we take actual transaction rate vs. segment free rate into account, not just static footprint. Note: this is a very simplistic version of this anticipation scheme, we just use the "raw" delta for the timer slice. A more sophisiticated approach would perhaps do either a lowpass filtered rate (adjust over longer time), or a regression or whatnot. But again, the default persiod of 10s is something of an eternity, so maybe that is superfluous... Closes #10651 * github.com:scylladb/scylla: commitlog: Add (internal) measurement of byte rates add/release/flush-req commitlog: Add counters for # bytes released/flush requested commitlog: Keep track of last flush high position to avoid double request commitlog: Fix counter descriptor language	2022-07-04 16:26:17 +03:00
Botond Dénes	553538392e	Merge "Improve shutdown logging" from Pavel Emelyanov " On stop there's a rather long log-less gap in the middle of storage_service::drain_on_shutdown(). This set adds log in interesting places and while at it tosses the patched code. refs: #10941 " * 'br-shutdown-logging' of https://github.com/xemul/scylla: batchlog_manager: Add drain and stop logging batchlog_manager: Coroutinize drain and stop batchlog_manager: Drain it with shared future commitlog: Add shutdown message database: Move flushing logging compaction_manager: Add logging around drain compaction_manager: Coroutinize drain storage_service: Sanitize stop_transport()	2022-07-04 13:50:16 +03:00
Pavel Emelyanov	5a4e15f65d	Add .mailmap Google group started replacing sender email with the group email recently. Here's the list of spoiled entries combined from seastar and scylla repos Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220701160252.11967-1-xemul@scylladb.com>	2022-07-04 13:44:28 +03:00
Pavel Emelyanov	98ff779676	batchlog_manager: Add drain and stop logging Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-04 13:42:46 +03:00
Pavel Emelyanov	e2007cd317	batchlog_manager: Coroutinize drain and stop This is not identical change, if drain() resolves with exception we end up skipping the gate closing, but since it's stop why bother Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-04 13:42:46 +03:00
Pavel Emelyanov	8a03683671	batchlog_manager: Drain it with shared future The .drain() method can be called from several places, each needs to wait for its completion. Now this is achieved with the help of a gate, but there's a simpler way Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-04 13:42:45 +03:00
Pavel Emelyanov	2e1ec36efd	commitlog: Add shutdown message It happens in database::drain(), we know when it starts after keyspaces are flushed, now it's good to know when it completes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-04 13:42:45 +03:00
Pavel Emelyanov	ea820e13b3	database: Move flushing logging Now it happens before calling database::drain() but drain is not only flushing it does lots of other things. More elaborated logging is better Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-04 13:42:45 +03:00
Avi Kivity	f949612620	Update tools/python3 submodule (install pip executables) * tools/python3 3471634...e48dcc2 (1): > Support installing executables from PIP package	2022-07-04 13:02:51 +03:00
Nadav Har'El	09d0ca7c75	test/cql-pytest: make run-cassandra work on new systems... with several Java versions The test/cql-pytest/run-cassandra script runs our cql-pytest tests against Cassandra. Today, Cassandra can only run correctly on Java 8 or 11 (see https://issues.apache.org/jira/browse/CASSANDRA-16895) but recent Linux distributions have switched to newer versions of Java - e.g., on my Fedora 36 installation, the default "java" is Java 17. Which can't run Cassandra. So what I do in this patch is to check if "java" has the right version, and if it doesn't, it looks at several additional locations if it can find a Java of the right version. By the way, we are sure that Java 8 must be installed because our install-dependencies.sh installs it. After this patch, test/cql-pytest/run-cassandra resumes working on Fedora 36. Fixes #10946 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10947	2022-07-04 12:00:46 +02:00
Aleksandra Martyniuk	bf34589fc1	cql3: create tokens out of null values properly Method reponsible for creating a token of given values is not meant to be used with empty optionals. Thus, having requested a token of the columns containing null values resulted with an exception being thrown. This kind of behaviour was not compatible with the one applied in cassandra. To fix this, before the computation of a token, it is checked whether no null value is contained. If any value in the processed vector is null, null value is returned. Fixes: #10594 Closes #10942	2022-07-04 10:42:23 +02:00
Avi Kivity	719724e4df	scripts: pull_github_pr.sh: support recovering from a failed cherry-pick If a single-patch pull request fails cherry-picking, it's still possible to recover it (if it's a simple conflict). Give the maintainer the option by opening a subshell and instructing them to either complete the cherry-pick or abort it. Closes #10949	2022-07-04 09:26:45 +03:00
Avi Kivity	2eedb9bf7c	Update seastar submodule * seastar 9c016aeebf...7d8d846b26 (16): > Merge 'coroutine: exception: retain exception_ptr type' from Benny Halevy > core: log in on_internal_error even when throwing > sched_group: Report the sched group that exceeded the limit Fixes #8226. > Add .mailmap > prometheus: make the help string optional > core: lw_shared_ptr: allow defining `lw_shared_ptr<T>` class member without knowing the definition of `T` > ci: build and test in debug and dev modes > Merge 'Added summaries, remove empty, and aggregation to Prometheus' from Amnon Heiman > Merge 'net/tls: vec_push: call on_internal_error if _output_pending already failed' from Benny Halevy Fixes #10127 > Merge 'CI: build and test with both gcc and clang ' from Beni Peled > Merge "Initialize lowres_clock::_now earlier" from Pavel E Ref #10743 > reactor: don't count make_exception_future etc. in cpp_exceptions metric > file: Deprecate file lifetime hint calls > foreign_ptr: fix doc. > cmake: fix mention of FindLibUring.cmake in install target > semaphore: derive named_semaphore_aborted exception from semaphore_aborted Fixes #10666. Closes #10951	2022-07-04 00:24:47 +03:00
Avi Kivity	973d2a58d0	Merge 'docs: move docs to docs/dev folder' from David Garcia In order to allow our Scylla OSS customers the ability to select a version for their documentation, we are migrating the Scylla docs content to the Scylla OSS repository. This PR covers the following points of the [Migration Plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit#): 1. Creates a subdirectory for dev docs: /docs/dev 2. Moves the existing dev doc content in the scylla repo to /docs/dev, but keep Alternator docs in /docs. 3. Flattens the structure in /docs/dev (remove the subfolders). 4. Adds redirects from `scylla.docs.scylladb.com/<version>/<document>` to `https://github.com/scylladb/scylla/blob/master/docs/dev/<document>.md` 5. Excludes publishing docs for /docs/devs. 1. Enter the docs folder with `cd docs`. 2. Run `make redirects`. 3. Enter the docs folder and run `make preview`. The docs should build without warnings. 4. Open http://127.0.0.1:5500 in your browser. You shoul donly see the alternator docs. 5. Open http://127.0.0.1:5500/stable/design-notes/IDL.html in your browser. It should redirect you to https://github.com/scylladb/scylla/blob/master/docs/dev/IDL.md and raise a 404 error since this PR is not merged yet. 6. Surf the `docs/dev` folder. It should have all the scylla project internal docs without subdirectories. Closes #10873 * github.com:scylladb/scylla: Update docs/conf.py Update docs/dev/protocols.md Update docs/dev/README.md Update docs/dev/README.md Update docs/conf.py Fix broken links Remove source folder Add redirections Move dev docs to docs/dev	2022-07-03 20:37:11 +03:00
Wojciech Mitros	bfa3c0e734	test: move codes of UDFs compiled to WASM to test/resource After compiling to WASM, UDFs become much larger than the source code. When they're included in test_wasm.py, it becomes difficult to navigate in the file. Moving them to another place does not make understanding the test scripts harder, because the source code is still included. This problem will become even more severe when testing UDFs using WASI. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #10934	2022-07-03 17:37:21 +03:00
Nadav Har'El	5f98b81cb3	dbuild: disable selinux instead of relabeling By default, Docker uses SELinux to prevent malicious code in the container from "escaping" and touching files outside the container: The container is only allowed to touch files with a special SELinux label, which the outside files simply do not have. However, this means that if you want to "mount" outside files into the container, Docker needs to add the special label to them. This is why one needs to use the ":z" option when mounting an outside file inside docker - it asks docker to "relabel" the directory to be usable in Docker. But this relabeling process is slow and potentially harmful if done to large directories such as your home directory, where you may theoretically have SELinux labels for other reasons. The relabling is also unnecessary - we don't really need the SELinux protection in dbuild. Dbuild was meant to provide a common toolchain - it was never meant to protect the build host from a malicious build script. The alternative we use in this patch is "--security-opt label=disable". This allows the container to access any file in the host filesystem, but as usual - only if it's explicitly "mounted" into the container. All ":z" we added in the past can be removed. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10945	2022-07-03 16:20:07 +03:00
Avi Kivity	c290976185	Merge 'cql3: Remove some restrictions classes' from Jan Ciołek This PR removes some restrictions classes and replaces them with expression. * `single_column_restriction` has been removed altogether. * `partition_key_restrictions` field inside `statement_restrictions` has been replaced with `expression` `clustering_key_restrictions` are not replaced yet, but this PR already has 30 commits so it's probably better to merge this before adding any more changes. Luckily most of these commits are implementations of small helper functions. `single_column_restriction` was pretty easy to remove. This class holds the `expression` that describes the restriction and `column_definition` of the restricted column. It inherits from `restriction` - the base class of all restrictions. I wasn't able to replace it with plain `expression` just yet, because a lot of times a `shared_ptr<single_column_restriction>` is being cast to `shared_ptr<restriction>`. Instead I replaced all instances of `single_column_restriction` with `restriction`. To decide if a `restriction` is a `single_column_restriction` we can use a helper method that works on expressions. Same with acquiring the restricted `column_definition`. This change has two advantages: * One less restriction class -> moving towards 0 * Preparing towards one generic `restriction/expression` type and using functions to distinguish the type of expression that we're dealing with. `partition_key_restrictions` is a class used to keep restrictions on the partition key inside `statement_restrictions`. Removing it required two major steps. First I had to implement taking all the binary operators and making sure that they are valid together. Before the change this was the `merge_to` method. It ensures that for example there are no token and regular restrictions occurring at the same time. This has been implemented as `statement_restrictions::add_restriction`. It detects which case it's dealing with and mimics `merge_to` from the right restrictions class. Then I implemented all methods of `partition_key_restrictions` but operating on plain `expressions`. While doing that I was able to gradually shift the responsibility to the brand new functions. Finally `partition_key_restrictions` wasn't used anywhere at all and I was able to remove it. Here's the inheritance tree of all restriction classes for context: ![image](https://user-images.githubusercontent.com/36861778/176141470-f96f6189-e650-44c2-9648-2a840b4c89c0.png) For now this is marked as a draft. I just put all this together in a readable way and wanted to put it out for you to see. I will have another look at the code and maybe do some improvements. Closes #10910 * github.com:scylladb/scylla: cql3: Remove _new from _new_partition_key_restrictions cql3: Remove _partition_key_restrictions from statement_restrictions cql3: Use expression for index restrictions cql3: expr: Add contains_multi_column_restriction cql3: Add expr::value_for cql3: Use the new restrictions map in another place cql3: use the new map in get_single_column_partition_key_restrictions cql3: Keep single column restrictions map inside statement restrictions cql3: Use expression instead of _partition_key_restrictions in the remaining code cql3: Replace partition_key_restrictions->has_supporting_index() cql3: Replace statement_restrictions->get_column_defs() cql3: Replace partition_key_restrictions->needs_filtering() cql3: Replace partition_key_restrictions->size() cql3: Replace partition_key_restrictions->is_all_eq() cql3: Replace parition_key_restriction->has_unrestricted_components() cql3: Replace parition_key_restrictions->empty() cql3: Keep restrictions as expressions inside statement_restrictions cql3: Handle single value INs inside prepare_binary_operator cql3: Add get_columns_in_commons cql3: expr: Add is_empty_restriction cql3: Replicate column sorting functionality using expressions cql3: Remove single_column_restriction class cql3: Replace uses of single_column_restriction with restriction cql3: expr: Add get_the_only_column cql3: expr: Add is_single_column_restriction cql3: expr: Add for_each_expression cql3: Remove some unsued methods	2022-07-03 16:11:25 +03:00
Jan Ciolek	f37094959d	cql3: Remove _new from _new_partition_key_restrictions _new_partition_key_restrictions was a temporary name used during the transition from restrictions to expressions. Now that restrictions aren't used anymore it can be changed back to _partition_key_restrictions. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:11 +02:00
Jan Ciolek	854ffd7bd8	cql3: Remove _partition_key_restrictions from statement_restrictions Now that all functionality of partition_key_restrictions has been implemented using expressions we can remove this field from statement_restrictions. _new_partition_key_restrictions will be used for everything instead. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:11 +02:00
Jan Ciolek	76bf75a9d3	cql3: Use expression for index restrictions Restrictions that might be used by an index are currently being kept as shared_ptr<restrictions>. This stand in the way of replacing _parition_key_restrictions with an expression as an expression can't be cast to shared_ptr<restriction>. Change shared_ptr<restriction> to expression everywhere where necessary in index operations. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:11 +02:00
Jan Ciolek	83f27fc8c1	cql3: expr: Add contains_multi_column_restriction Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:11 +02:00
Jan Ciolek	fd0798c8a2	cql3: Add expr::value_for value_for is a method from the restriction class which finds the value for a given column. Under the hood it makes use of possible_lhs_values. It will be needed to implement some functionality that was implemented using restrictions before. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:11 +02:00
Jan Ciolek	8c8a03aad1	cql3: Use the new restrictions map in another place Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:11 +02:00
Jan Ciolek	4026042cbc	cql3: use the new map in get_single_column_partition_key_restrictions Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:11 +02:00
Jan Ciolek	3916bf1168	cql3: Keep single column restrictions map inside statement restrictions Some parts of the code make use of a map keeping single column restrictions for each partition key column. One of this places is inside do_filter, so it could be a performance problem to create such a map from scratch each time. After adding all restrictions from the where clause the new map is created and can be used for various purposes. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:10 +02:00
Jan Ciolek	1339ff1c79	cql3: Use expression instead of _partition_key_restrictions in the remaining code There are still some places that use partition_key_restrictions instead of _new_partition_key_restrictions in statement_restrictions. Change them to use the new representation Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:10 +02:00
Jan Ciolek	0bb49e423a	cql3: Replace partition_key_restrictions->has_supporting_index() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:10 +02:00
Jan Ciolek	418ed0b802	cql3: Replace statement_restrictions->get_column_defs() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:10 +02:00
Jan Ciolek	103f8e3d05	cql3: Replace partition_key_restrictions->needs_filtering() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:10 +02:00
Jan Ciolek	de99c0a0fa	cql3: Replace partition_key_restrictions->size() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:10 +02:00
Jan Ciolek	16e94aaa91	cql3: Replace partition_key_restrictions->is_all_eq() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:09 +02:00
Jan Ciolek	4d376eb84f	cql3: Replace parition_key_restriction->has_unrestricted_components() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:09 +02:00
Jan Ciolek	7f620cfa29	cql3: Replace parition_key_restrictions->empty() To remove partition_key_restrictions all of its methods have to be implemented using the new expression representation. The first to go is empty() as it's easy to implement. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:09 +02:00
Jan Ciolek	e6e502e6e8	cql3: Keep restrictions as expressions inside statement_restrictions Currently restrictions on partition, clustering and nonprimary columns are kept inside special purpose restriction objects. We want to remove all the restrictions classes so these objects will be removed as well. In the future each of these restrictions will be kept in an expression. Add new fields to statement_restrictions class which will keep the right restrictions. Currently restrictions from where clause are added one by one using merge_to method of the restrictions class. This functionality will be replaced by statement_restrictions::add_restriction. Functions for adding restrictions perform validation and add new restrictions to the right field inside the class. The checks that are done in add_*_restriction methods correspond to the checks performed by merge_to in respective restriction classes. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:09 +02:00
Jan Ciolek	9b6b1f69aa	cql3: Handle single value INs inside prepare_binary_operator Currently expr::to_restriction is the only place where prepare_binary_operator is called. In case of a single-value IN restriction like: mycol IN (1) this expression is converted to mycol = 1 by expr::to_restriction. Once restriction is removed expr::to_restriction will be removed as well so its functionality has to be moved somewhere else. Move handling single value INs inside prepare_binary_operator. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:09 +02:00
Jan Ciolek	24b0a61d51	cql3: Add get_columns_in_commons Add a function that finds common columns between two expressions. It's used in error messages in the original restrictions code so it must be included in the new code as well for compatibility. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:09 +02:00
Jan Ciolek	177ba9b9db	cql3: expr: Add is_empty_restriction Add a function to check whether an expression restricts anything at all. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:29:09 +02:00
Jan Ciolek	228b344d9c	cql3: Replicate column sorting functionality using expressions Restrictions code keeps restrictions for each column in a map sorted by their position in the schema. Then there are methods that allow to access the restricted column in the correct order. To replicate this in upcoming code we need functions that implement this functionality. The original comparator can be found in: cql3/restrictions/single_column_restrictions.hh For primary key columns this comparator compares their positions in the schema. For non-primary columns the position is assumed to be clustering_key_size(), which seems pretty random. To avoid passing the schema to the comparator for nonprimary columns I just assume the position is u32::max(). This seems to be as good of a choice as clustering_key_size(). Orignally Cassandra used -1: `bc8a260471/src/java/org/apache/cassandra/config/ColumnDefinition.java (L79-L86)` We never end up comparing columns of different kind using this comparator anyway. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 16:28:41 +02:00
Pavel Emelyanov	af026e423e	compaction_manager: Add logging around drain Now we know when it starts and whe^w if it finishes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-01 17:17:53 +03:00
Pavel Emelyanov	a9d6e5cfb6	compaction_manager: Coroutinize drain It's short enough to fix indentation right at once Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-01 17:17:53 +03:00
Pavel Emelyanov	b5c4553a66	storage_service: Sanitize stop_transport() It generates ignored future that can be avoided if using forwarding to shared_future<>'s promise Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-01 17:17:53 +03:00
Jan Ciolek	e37ddd5b89	cql3: Remove single_column_restriction class Now that all uses of this class have been replaced by the generic restriction the class is not used anywhere and can be removed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 15:53:19 +02:00
Jan Ciolek	3e3d2f939c	cql3: Replace uses of single_column_restriction with restriction single_column_restriction is a class used to represent restrictions in a single column. The class is very simple - it's basically an expression with some additional information. As a step towards removing all restriction classes all uses of this class are replaced by uses of the generic restriction class. All functionality of this class has been implemented using free standing functions operating on expressions. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 15:52:10 +02:00
Jan Ciolek	afc482f0a5	cql3: expr: Add get_the_only_column Add a function that gets the only column from a single column restriction expression. The code would be very similiar to is_single_column_restriction, so a new function is introducted to reduce duplication. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 15:50:48 +02:00
Jan Ciolek	9c3b0299a1	cql3: expr: Add is_single_column_restriction Add a function that checks whether an expression contains restrictions on exactly one column. This a "single_column_restriction" in the same way that instances of "class single_column_restriction" are. It will be used later to distinguish cases later once this class is removed Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-07-01 15:49:37 +02:00
Wojciech Mitros	4fd289a78c	wasm: fix freeing in wasm UDFs using WASI To call a UDF that is using WASI, we need to properly configure the wasmtime instance that it will be called on. The configuration was missing from udf_cache::load(), so we add it here. The free function does not return any value, so we should use a calling method that does not expect any returns. This patch adds such a method and uses it. A test that did not pass without this fix and does pass after is added. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #10935	2022-07-01 07:57:45 +02:00
Piotr Sarna	42f51b2f7b	Merge 'alternator: use position-in-partition in paging... cookie only when reading CQL tables' from Botond Dénes Recently, we added full position-in-partition support to alternator's paging cookie, so it can support stopping at arbitrary positions. This support however is only really needed when tables have range tombstones and alternator tables never have them. So to avoid having to make the new fields in 'ExclusiveStartKey' reserved, we avoid filling these in when reading an alternator table, as in this case it is safe to assume the position is `after_key($clustring_key)`. We do include these new members however when reading CQL tables through alternator. As this is only supported for system tables, we can also be sure that the elaborate names we used for these fields are enough to avoid naming clashes. Fixes: https://github.com/scylladb/scylla/issues/10903 Closes #10920 * github.com:scylladb/scylla: alternator: use position-in-partition in paging cookie only when reading CQL tables alternator: make is_alternator_keyspace() a standalone method	2022-06-30 20:24:40 +02:00
Nadav Har'El	29a0a2d694	test/scylla-gdb: detect Scylla compiled without debugging information test/scylla-gdb tests Scylla's gdb debugging tools, and cannot work if Scylla was compiled without debug information (i.e, the "dev" build mode). In the past, test/scylla-gdb/run detected this case and printed a clear error: Scylla executable was compiled without debugging information (-g) so cannot be used to test gdb. Please set SCYLLA environment variable. Unfortunately, since recently this detection fails, because even when Scylla is compiled without debug information we link into it a library (libwasmtime.a) which has some debug information. As a result, instead of one clear error message, we get all scylla-gdb tests running - and each of them failing separately. This is ugly and unhelpful. Each of the tests fail because our "gdb" test fixture tries to load scylla-gdb.py and fails when the symbols it needs (e.g., "size_t") cannot be found. So in this patch, we check once for the existance of this symbol - and if missing we exit pytest instead of failing each individual test. Moreover, if loading scylla-gdb.py fails for some other unexpected reason, let's exit the test as well, instead of failing each individual test. Fixes #10863. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10937	2022-06-30 20:22:50 +02:00
David Garcia	fc59ebc2c3	Update docs/conf.py	2022-06-30 19:12:08 +01:00
David Garcia	a1b990075e	Update docs/dev/protocols.md	2022-06-30 19:11:09 +01:00
David Garcia	d555febdd3	Update docs/dev/README.md Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>	2022-06-30 19:10:41 +01:00
David Garcia	350abdd72d	Update docs/dev/README.md Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>	2022-06-30 19:10:33 +01:00
David Garcia	66a77f74d2	Update docs/conf.py Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>	2022-06-30 19:10:23 +01:00
Botond Dénes	d80256f4dd	dirty_memory_manager: move db ctor out-of-line To facilitate further patching.	2022-06-30 17:26:18 +03:00
Botond Dénes	3a19412237	Merge 'Various improvements to perf_row_cache_upgrade test' from Tomasz Grabiec Closes #10930 * github.com:scylladb/scylla: test: perf_row_cache_update: Flush std output after each line test: perf_row_cache_update: Drain background cleaner before starting the test test: perf_row_cache_update: Measure memtable filling time test: perf_row_cache_update: Respect preemption when applying mutations test: perf_row_cache_update: Drop unused pk variable	2022-06-30 15:57:29 +03:00
Botond Dénes	2b6eeadc07	alternator: use position-in-partition in paging cookie only when reading CQL tables Recently, we added full position-in-partition support to alternator's paging cookie, so it can support stopping at arbitrary positions. This support however is only really needed when tables have range tombstones and alternator tables never have them. So to avoid having to make the new fields in 'ExclusiveStartKey' reserved, we avoid filling these in when reading an alternator table, as in this case it is safe to assume the position is `after_key($clustring_key)`. We do include these new members however when reading CQL tables through alternator. As this is only supported for system tables, we can also be sure that the elaborate names we used for these fields are enough to avoid naming clashes. The condition in the code implementing this is actually even more general: it only includes the region/weight members when the position differs from that of a normal alternator one.	2022-06-30 15:10:30 +03:00
Botond Dénes	52058ea974	alternator: make is_alternator_keyspace() a standalone method	2022-06-30 14:18:29 +03:00
Jan Ciolek	cb3b179945	cql3: expr: Add for_each_expression for_each_expression is a function that can be used to iterate over all expressions inside an expression recursively and perform some operation on each of them. For example: for_each_expression<column_vaue>(e, [](const column_value& cval) {std::cout << cval << '\n';}); Will print all column values in an expression It's awkward to do this using recurse_until or find_in_expression because these functions are meant for slightly different purposes. Having a dedicated function for this purpose will make the code cleaner and easier to understand. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-06-30 10:03:53 +02:00
Pavel Emelyanov	868c3be01f	config: Tune the config option The option is used, but is not implemented. If attaching implementation to it right a once the compaction will slow down to 16MB/s on all nodes. Make it zero (unbound) by default and mard live-updateable while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-30 09:55:52 +03:00
Tomasz Grabiec	8f3349b407	test: lib: flat_mutation_reader_assertion: Add trace-level logging of read fragments Message-Id: <20220629153926.137824-1-tgrabiec@scylladb.com>	2022-06-30 08:43:30 +03:00
Tomasz Grabiec	6d753bd01c	gdb: Make robust in case there is no global storage_proxy or database instance Some unit tests don't initialze these. Message-Id: <20220629152743.134296-1-tgrabiec@scylladb.com>	2022-06-30 08:41:57 +03:00
Nadav Har'El	8024da10f0	test/cql-pytest: avoid leaving behind temporary files Before this patch, the test cql-pytest/test_tools.py left behind a temporary file in /tmp. It used pytest's "tmp_path_factory" feature, but it doesn't remove temporary files it creates. This patch removes the temporary file when the fixture using it ends, but moreover, it puts the temporary file not in /tmp but rather next to Scylla's data directory. That directory will be eventually removed entirely, so even if we accidentally leave a file there, it will eventually be deleted. Fixes #10924 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10929	2022-06-30 07:35:55 +03:00
Tomasz Grabiec	a6aef60b93	memtable: Fix missing range tombstones during reads under ceratin rare conditions There is a bug introduced in `e74c3c8` (4.6.0) which makes memtable reader skip one a range tombstone for a certain pattern of deletions and under certain sequence of events. _rt_stream contains the result of deoverlapping range tombstones which had the same position, which were sipped from all the versions. The result of deoverlapping may produce a range tombstone which starts later, at the same position as a more recent tombstone which has not been sipped from the partition version yet. If we consume the old range tombstone from _rt_stream and then refresh the iterators, the refresh will skip over the newer tombstone. The fix is to drop the logic which drains _rt_stream so that _rt_stream is always merged with partition versions. For the problem to trigger, there have to be multiple MVCC versions (at least 2) which contain deletions of the following form: [a, c] @ t0 [a, b) @ t1, [b, d] @ t2 c > b The proper sequence for such versions is (assuming d > c): [a, b) @ t1, [b, d] @ t2 Due to the bug, the reader will produce: [a, b) @ t1, [b, c] @ t0 The reader also needs to be preempted right before processing [b, d] @ t2 and iterators need to get invalidated so that lsa_partition_reader::do_refresh_state() is called and it skips over [b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it does emit the proper range tombstone, it's possible that it will violate fragment order in the stream if _rt_stream accumulated remainders (possible with 3 MVCC versions). The problem goes away once MVCC versions merge. Fixes #10913 Fixes #10830 Closes #10914	2022-06-29 19:02:23 +03:00
Tomasz Grabiec	3a222c1db3	test: perf_row_cache_update: Flush std output after each line	2022-06-29 17:36:02 +02:00
Tomasz Grabiec	daf0f041be	test: perf_row_cache_update: Drain background cleaner before starting the test	2022-06-29 17:36:02 +02:00
Tomasz Grabiec	5d4bd5d6d5	test: perf_row_cache_update: Measure memtable filling time	2022-06-29 17:36:02 +02:00
Tomasz Grabiec	9d9bf8c196	test: perf_row_cache_update: Respect preemption when applying mutations Otherwise, once preemption is signalled, memtable::apply() will keep creating MVCC snapshots, which will slow the test down.	2022-06-29 17:36:02 +02:00
Tomasz Grabiec	46a5a606c4	test: perf_row_cache_update: Drop unused pk variable	2022-06-29 17:36:02 +02:00
Pavel Emelyanov	85033ea6ae	Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0. They are mostly refactors which don't affect the behavior of the system, except one: the commit `4d439a16b3` causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because: 1. eventually, we want this to be the default behavior 2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary. Closes #10864 * github.com:scylladb/scylla: service/raft: raft_group_registry: add assertions when fetching servers for groups service/raft: raft_group_registry: remove `_raft_support_listener` service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map service/raft: raft_group0: move group 0 RPC handlers from `storage_service` service/raft: messaging: extract raft_addr/inet_addr conversion functions service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls test/boost: memtable_test: perform schema operations on shard 0 test/boost: cdc_test: remove test_cdc_across_shards message: rename `send_message_abortable` to `send_message_cancellable` message: change parameter order in `send_message_oneway_timeout`	2022-06-29 16:51:54 +03:00
Pavel Emelyanov	b60f2c220b	test_tools: Do not create type if it exists There effectively are several test-cases in this test, each calls the scylla_sstable() to prepare, thus each creates a type in the same scylla instance. The 2nd attempt ends up with the "already exists" error: E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="A user type of name cql_test_1656396925652.type1 already exists" tests: unit(dev) https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1075/ fixes: #10872 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220628081459.12791-1-xemul@scylladb.com>	2022-06-29 14:31:57 +03:00
Calle Wilund	aab7794c31	commitlog_test: Change timeout handling to do abort() Refs #10805 To help debug spurious failures, ensure to do abort() for debugger/core ease. Closes #10843.	2022-06-29 13:26:51 +03:00
Nadav Har'El	fc0243a43a	Merge 'Implement a number of improvements in test.py' from Konstantin Osipov A number of improvements in test.py as requested by maintainers: * don't capture pytest output * stick to the specific server in control connections * support --log-level option and pass it to logging module * when checking if CQL is up, ignore timeout errors * no longer force schema migration when starting the server * use test uname, not id, in log output * improve logging of ScyllaServer * log what cluster is used for a test * extend xml output with logs On the same token, remove mypy warnings and make linter pass on test.py, as well as add some type checking. Fixes #10871 Fixes #10785 Closes #10902 * github.com:scylladb/scylla: test.py: extend xml output with logs test.py: log what cluster is used for a test test.py: improve logging of ScyllaServer test.py: use test uname, not id, in log output test.py: support --log-level option and pass it to logging module test.py: make ScyllaServer more reliable and fast test.py: don't capture pytest output test.py: add type annotations test.py: convert log_filename to pathlib test.py: please linter test.py: remove mypy warnings	2022-06-29 13:06:07 +03:00
Pavel Emelyanov	3a753068be	Merge "Make permissions cache live updateable and add an API for resetting authorization cache" from Igor Ribeiro Barbosa Duarte Currently, for users who have permissions_cache configs set to very high values (and thus can't wait for the configured times to pass) having to restart the service every time they make a change related to permissions or prepared_statements cache (e.g. Adding a user and changing their permissions) can become pretty annoying. This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries live updateable so that restarting the service is not necessary anymore for these cases. It also adds an API for flushing the cache to make it easier for users who don't want to modify their permissions_cache config. branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/ dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache * https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable: loading_cache_test: Test loading_cache::reset and loading_cache::update_config api: Add API for resetting authorization cache authorization_cache: Make permissions cache and authorized prepared statements cache live updateable auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct utils/loading_cache.hh: Add update_config method utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh utils/loading_cache.hh: Add reset method	2022-06-29 11:14:13 +03:00
Benny Halevy	cb0b728ed1	storage_service: handle_state_removing: restore_replica_count under _async_gate Track the background restore_replica_count fiber so it be awaited on in stop() by closing the _async_gate. Fixes #10672 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-29 10:51:26 +03:00
Benny Halevy	1b1c02b243	storage_service: add async_gate for background work To be used for tracking restore_replica_count and waiting for it on stop(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-29 10:47:49 +03:00
Igor Ribeiro Barbosa Duarte	8cc2de5fe0	loading_cache_test: Test loading_cache::reset and loading_cache::update_config Validate that the size of the cache is zero after calling the reset method and that the config is being updated correctly after calling update_config. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte	a23c3d6338	api: Add API for resetting authorization cache For cases where we have very high values set to permissions_cache validity and update interval (E.g.: 1 day), whenever a change to permissions is made it's necessary to update scylla config and decrease these values, since waiting for all this time to pass wouldn't be viable. This patch adds an API for resetting the authorization cache so that changing the config won't be mandatory for these cases. Usage: $ curl -X POST http://localhost:10000/authorization_cache/reset Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte	b9051c79bc	authorization_cache: Make permissions cache and authorized prepared statements cache live updateable Currently, for users who have permissions_cache configs set to very high values (and thus can't wait for the configured times to pass) having to restart the service every time they make a change related to permissions or prepared_statements cache(e.g.: Adding a user) can become pretty annoying. This patch make permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries live updateable so that restarting the service is not necessary anymore for these cases. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte	c8c48a98fa	auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct This patch makes authorized_prepared_statements_cache acccept a config struct, similarly to permissions_cache. This will make it easier to make this cache live updateable on the next patch. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:57:52 -03:00
Igor Ribeiro Barbosa Duarte	d02cd5e8bc	utils/loading_cache.hh: Add update_config method This patch adds an update_config method in order to allow live updating the config for permissions_cache. This method is going to be used in the next patches after making permissions_cache config live updateable. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:46:58 -03:00
Igor Ribeiro Barbosa Duarte	667840a7eb	utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh This patch renames the permissions_cache_config struct to loading_cache_config and moves it to utils/loading_cache.hh. This will make it easier to handle config updates to the authorization caches on the next patches Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:46:22 -03:00
Nadav Har'El	630959bb77	Merge 'test.py async and schema changes' from Alecco Change tests to use async mode and add helpers and tests for schema changes. These test series will be expanded with topology changes. Closes #10550 * github.com:scylladb/scylla: test.py topology: repro for issue #1207 test.py: port fixture fails_without_raft test.py topology: table methods to add/remove index test.py topology: add/drop table column helpers test.py topology: insert sequential row test.py: remove deprecated test test_null test.py: managed random tables test.py: test_keyspace fixture async test.py: rename fixture test_keyspace to keyspace test.py topology: test with asyncio	2022-06-28 23:25:18 +03:00
Avi Kivity	37780c6521	Merge 'test: perf: allow testing timeouts in perf_simple_query' from Piotr Dulikowski This PR adds necessary modifications to perf_simple_query so that it can be used to test performance of the timeout handling path. With an appropriate combination of flags, it is possible to consistently trigger timeouts on every operation. The following flags are added: - `--stop-on-error` - if true (which is the default), the test stops after encountering the first exception and reports it; otherwise it causes errors to be counted and reported at the end. - `--timeout <x>` - allows to use `USE TIMEOUT <x>` in the benchmark query/statement. - `--bypass-cache` - uses `BYPASS CACHE` in the benchmark query (relevant only to reads). Examples: ``` ./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write 131023.65 tps ( 56.2 allocs/op, 13.2 tasks/op, 49784 insns/op, 0 errors) ./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write --stop-on-error=false --timeout=0s 97163.73 tps ( 53.1 allocs/op, 5.1 tasks/op, 78687 insns/op, 1000000 errors) ./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 154060.36 tps ( 63.1 allocs/op, 12.1 tasks/op, 42998 insns/op, 0 errors) ./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --stop-on-error=false --flush --bypass-cache --timeout=0s 30127.43 tps ( 48.2 allocs/op, 14.3 tasks/op, 312416 insns/op, 1000000 errors) ``` Refs: #2363 Closes #10899 * github.com:scylladb/scylla: test: perf: add bypass cache argument test: perf: add timeout argument test: perf: count errors and report the count in results test: perf: add stop-on-error argument test: perf: coroutinize run_worker() test: perf: fix crash on exception in time_parallel_ex	2022-06-28 19:28:22 +03:00
Tomasz Grabiec	3bb147ae95	db: mutation_cleaner: Enqueue new snapshots at the back This fixes a quadratic behavior in case lots of snapshots with range tombstones are queued for merging. Before the change, new snapshots were inserted at the front, which is also where the worker looks at. Merging a version has a linear component in complexity function which depends on the number of range tombstones. If we merge snapshots starting from the latest to oldest then the whole process becomes quadratic because the version which is merged accumulates an increasing amont of tombstones, ones which were already merged before. We should instead merge starting from the oldest snapshots, this way each tombstone is applied exactly once during merge. This bug got wose after `4bd4aa2e88`, which makes merging tombstones more expensive. Closes #10916	2022-06-28 18:29:29 +03:00
Nadav Har'El	a8b02f7965	test: set sanitizer options in run scripts When the run scripts for tests of cql-pytest, alternator, redis, etc., run Scylla, they should set the UBSAN_OPTIONS and ASAN_OPTIONS so that if the executable is built with sanitizers enabled, it will ignore false positives that we know about, and fail on real errors. The change in this patch affects all test/*/run scripts which use the this shared Scylla-starting code. test.py already had the same settings, and it affected the tests that it knows to run directly (unit tests, cql-pytest, etc.). Fixes #10904 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10915	2022-06-28 18:24:48 +03:00
Konstantin Osipov	d5d748ae86	test.py: extend xml output with logs Add test and server logs, as well as the unidiff, to XML output. This makes jenkins reports nicer. While on it, debug & fix bugs in handling of flaky tests: - the reset would reset a flaky test even after the last attempt fails, so it would be impossible to see what happened to it - the args needed to be reset as well, since execution modifies them - we would say that we're going to retry the flaky test when in fact it was the last attempt to run it and no more retries were planned	2022-06-28 18:22:01 +03:00
Konstantin Osipov	22d30abed8	test.py: log what cluster is used for a test	2022-06-28 18:22:01 +03:00
Konstantin Osipov	dbdfac7a0f	test.py: improve logging of ScyllaServer	2022-06-28 18:22:01 +03:00
Konstantin Osipov	c4bee2860b	test.py: use test uname, not id, in log output Clarify "Test was cancelled" error message.	2022-06-28 18:22:01 +03:00
Konstantin Osipov	b502dd3c4e	test.py: support --log-level option and pass it to logging module scylla-python driver and scylla_server.py can be more verbose at higher log levels, allow specifying the log level from the command line.	2022-06-28 18:22:01 +03:00
Konstantin Osipov	ad7423649f	test.py: make ScyllaServer more reliable and fast 1) Stick to the specific server in control connections. It could happen that, when starting a cluster and checking if a specific node is up, the check would actually execute against an already running node. Prevent this from happening by setting a white list connection balancing policy for control connections. 2) When checking if CQL is up, ignore timeout errors Scylla in debug mode can easily time out on a DDL query, and the timeout error at start up would lead to the entire cluster marked as broken. This is too harsh, allow timeouts at start. 3) No longer force schema migration when starting the server By default, Raft is on, so the nodes are getting schema through Raft leader. Schema migration significantly slows down cluster start in debug mode (60 seconds -> 100 seconds), and even though it was a great test that helped discover several bugs in Scylla, it shouldn't be part of normal cluster boot, so disable it.	2022-06-28 18:21:25 +03:00
Konstantin Osipov	867d5b4eda	test.py: don't capture pytest output Let print()s inside pytest tests go into pytest logs. This simplifies debugging, especially if someone is not familiar with pytest.	2022-06-28 17:46:47 +03:00
Konstantin Osipov	fd3d08e560	test.py: add type annotations Add type annotations where possible.	2022-06-28 17:46:47 +03:00
Konstantin Osipov	2470b1d888	test.py: convert log_filename to pathlib	2022-06-28 17:46:47 +03:00
Konstantin Osipov	20070e2d89	test.py: please linter Rename the static method to not collied with a member variable with the same name.	2022-06-28 17:46:46 +03:00
Konstantin Osipov	fc232099bb	test.py: remove mypy warnings	2022-06-28 17:46:46 +03:00
David Garcia	b85843b9cc	Fix broken links Fix broken links	2022-06-28 15:19:36 +01:00
David Garcia	b87537c767	Remove source folder Remove source folder Remove source folder	2022-06-28 15:07:35 +01:00
Alejo Sanchez	c478a53d9c	test.py topology: repro for issue #1207 Repro for bug in concurrent schema changes for many tables and indexing involved. Do alter tables by doing in parallel new table creation, alter a table (_alter), and index other tables (_index). Original repro had sets of 20 of those and slept for 20 seconds to settle. This repro does it for Scylla with just 1 set and 1 second. This issue goes away once Raft is enabled. https://github.com/scylladb/scylla/issues/1207 Originally at https://issues.apache.org/jira/browse/CASSANDRA-10250 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:27 +02:00
Alejo Sanchez	4228bfef84	test.py: port fixture fails_without_raft Port fails_without_raft to higher level conftest file for future use in topology pytests. While there, make it async. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:27 +02:00
Alejo Sanchez	e2cc35b768	test.py topology: table methods to add/remove index Add helper methods to add and drop indexes on a given column. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:27 +02:00
Alejo Sanchez	d80857e26e	test.py topology: add/drop table column helpers Helper to add/drop a specified or random column. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:27 +02:00
Alejo Sanchez	e8e6a8e85a	test.py topology: insert sequential row For each table keep a counter and insert rows with sequential values generated correspondingly by each column's type. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:27 +02:00
Alejo Sanchez	0624be6d58	test.py: remove deprecated test test_null With test_schema there's no need for test_null. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:27 +02:00
Alejo Sanchez	ed140f98d8	test.py: managed random tables Helpers to create keyspace and manange randomized tables. Fixture drops all created tables still active after the test finishes. Includes helper methods to verify schema consistency. These helpers will be used in Raft schema changes tests coming later. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:27 +02:00
Alejo Sanchez	df1a032c04	test.py: test_keyspace fixture async Make test_keyspace fixture async. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:26 +02:00
Alejo Sanchez	fda69a0773	test.py: rename fixture test_keyspace to keyspace Name makes better sense as it's not a test but a fixture for tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:26 +02:00
Alejo Sanchez	00648342a6	test.py topology: test with asyncio Run test async using a wrapper for Cassandra python driver's future. The wrapper was suggested by a user and brought forward by @fruch. It's based on https://stackoverflow.com/a/49351069 . Redefine pytest event_loop fixture to avoid issues with fixtures with scope bigger than function (like keyspace). See https://github.com/pytest-dev/pytest-asyncio/issues/68 Convert sample test_null to async. More useful test cases will come afterwards. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-28 15:07:26 +02:00
David Garcia	8e7ebea335	Merge remote-tracking branch 'upstream/master' into move-dev-docs	2022-06-28 11:02:38 +01:00
David Garcia	5adb5875f1	Add redirections	2022-06-28 09:39:14 +01:00
Jan Ciolek	1824878e9f	cql3: Remove some unsued methods They are removed because they are not used anywhere and they contain code that would have to be modified in the following commits. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-06-28 08:10:21 +02:00
Botond Dénes	6c818f8625	Merge 'sstables: generation_type tidy-up' from Michael Livshin - Use `sstables::generation_type` in more places - Enforce conceptual separation of `sstables::generation_type` and `int64_t` - Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible Fixes #10796. Closes #10844 * github.com:scylladb/scylla: sstables: make generation_type an actual separate type sstables: use generation_type more soundly extremum_tracker: do not require default-constructible value types	2022-06-28 08:50:12 +03:00
Calle Wilund	688fd31e64	commitlog: Add counters for actual pending allocations + segment wait Fixes #9367 The CL counters pending_allocations and requests_blocked_memory are exposed in graphana (etc) and often referred to as metrics on whether we are blocking on commit log. But they don't really show this, as they only measure whether or not we are blocked on the memory bandwidth semaphore that provides rate back pressure (fixed num bytes/s - sortof). However, actual tasks in allocation or segment wait is not exposed, so if we are blocked on disk IO or waiting for segments to become available, we have no visible metrics. While the "old" counters certainly are valid, I have yet to ever see them be non-zero in modern life. Closes #9368	2022-06-28 08:36:27 +03:00
Nadav Har'El	e22364dcc5	doc, alternator: split "experimental" features from "unimplemented" ones Currently in docs/alternator/compatibility.md experimental features and unimplemented features are bunched together under one heading ("unimplemented features"). In this patch we separate them into two sections. This makes the "unimplemented features" section shorter, and also allows us to link to the new "experimental features" section separately. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10893	2022-06-28 08:08:50 +03:00
Tomasz Grabiec	1a9d1d380a	Reads from cache lack preemption check when scanning over range tombstones A scan over range tombstones will ignore preemption, which may cause reactor stalls or read failure due to std::bad_alloc. This is a regression introduced in `5e97fb9fc4`. _lower_bound_changed was always set to false, which is later checked at preemption point and inhibits yielding. Closes #10900	2022-06-28 06:58:48 +03:00
Piotr Dulikowski	6c69606702	test: perf: add bypass cache argument Adds the "--bypass-cache" argument which adds a "BYPASS CACHE" clause to the query being run in the benchmark. It only affects the read mode.	2022-06-27 22:14:29 +02:00
Piotr Dulikowski	fdd0a4146f	test: perf: add timeout argument Adds the "--timeout" argument which allows specifying a timeout used in all operations. It works by inserting "USING <timeout>" in appropriate place in the query. The flag is most useful when set to zero - with an appropriate combination of other flags (flush, bypass cache) it guarantees that each operation will time out and performance of the timeout handling logic can be measured.	2022-06-27 22:14:29 +02:00
Piotr Dulikowski	b9250a43e3	test: perf: count errors and report the count in results Now, exceptions encountered during the test are counted as errors, and the error count is reported at the end of the test.	2022-06-27 22:14:29 +02:00
Piotr Dulikowski	21612f97b0	test: perf: add stop-on-error argument Adds the "--stop-on-error" argument to perf_simple_query. When enabled (and it is enabled by default), the benchmark will propagate exceptions if any occur in the tested function. Otherwise, errors will be ignored.	2022-06-27 22:14:29 +02:00
Piotr Dulikowski	d3bc946859	test: perf: coroutinize run_worker() Converts the executor::run_worker() method to a coroutine. This will allow extending the function in further commits without having to allocate continuations.	2022-06-27 22:14:29 +02:00
Piotr Dulikowski	33b22e78be	test: perf: fix crash on exception in time_parallel_ex The `time_parallel_ex` function creates a sharded<executor> and uses it to run the benchmark on multiple shards in parallel. However, if the benchmarking function throws an exception, the sharded<executor> will be destroyed without being stopped, which triggers an assertion in sharded<T> destructor. This commit makes sure that the executor is stopped before being destroyed by putting `exec.stop()` into a `seastar::defer`.	2022-06-27 22:14:29 +02:00
Avi Kivity	7b37e02aa7	Update seastar submodule * seastar ff46af9ae0...9c016aeebf (8): > Merge "Handle overflow in token bucket replenisher" from Pavel E Fixes #10743 Fixes #10846 > abort_source: request_abort: restore legacy no-args method > configure.py: do not use distutils > configure.py: drop unused "import sys" > Revert "Use recv syscall instead of read in do_read_some()" > Use recv syscall instead of read in do_read_some() > Merge 'Add initial support for websocket protocol' from Andrzej Stalke > Merge 'abort_source: request_abort: allow passing exception to subscribers' from Benny Halevy Closes #10898	2022-06-27 23:11:56 +03:00
Kamil Braun	ff4ecfa182	dht: boot_strapper: check if keyspace still exists in `bootstrap` While we're iterating over the fetched keyspace names, some of these keyspaces may get dropped. Handle that by checking if the keyspace still exists. Also, when retrieving the replication strategy from the keyspace, store the pointer (which is an `lw_shared_ptr`) to the strategy to keep it alive, in case the keyspace that was holding it gets dropped. Closes #10861	2022-06-27 19:13:46 +02:00
Asias He	d3c6e72c69	repair: Allow abort repair jobs in early stage Consider this: - User starts a repair job with http api - User aborts all repair - The repair_info object for the repair job is created - The repair job is not aborted In this patch, the repair uuid is recorded before repair_info object is created, so that repair can now abort repair jobs in the early stage. Fixes #10384 Closes #10428	2022-06-27 16:39:36 +03:00
Pavel Emelyanov	f3841c1b45	exceptions: Define operator<< for exception_code Otherwise cql_transport::additional_options_for_proto_ext() complains about inability to format the enum class value Introduced by `efc3953c` (transport: add rate_limit_error) Fmt version 8.1.1-5.fc35, fresher one must have it out of the box Fixes #10884 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220627052703.32024-1-xemul@scylladb.com>	2022-06-27 14:49:58 +03:00
Avi Kivity	3131cbea62	Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones. This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters. Refs: https://github.com/scylladb/scylla/issues/3672 Refs: https://github.com/scylladb/scylla/issues/7689 Refs: https://github.com/scylladb/scylla/issues/7933 Tests: manual upgrade test. I wrote a data set with: ``` ./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000 ``` This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload: ``` ./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100 ``` I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs. Closes #10829 * github.com:scylladb/scylla: query: have replica provide the last position idl/query: add last_position to query_result mutlishard_mutation_query: propagate compaction state to result builder multishard_mutation_query: defer creating result builder until needed querier: use full_position instead of ad-hoc struct querier: rely on compactor for position tracking mutation_compactor: add current_full_position() convenience accessor mutation_compactor: s/_last_clustering_pos/_last_pos/ mutation_compactor: add state accessor to compact_mutation introduce full_position idl: move position_in_partition into own header service/paging: use position_in_partition instead of clustering_key for last row alternator/serialization: extract value object parsing logic service/pagers/query_pagers.cc: fix indentation position_in_partition: add to_string(partition_region) and parse_partition_region() mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh	2022-06-27 12:23:21 +03:00
Benny Halevy	81fa1ce9a1	Revert 'Compact staging sstables' This patch reverts the following patches merged in `78750c2e1a` "Merge 'Compact staging sstables' from Benny Halevy" > `597e415c38` "table: clone staging sstables into table dir" > `ce5bd505dc` "view_update_generator: discover_staging_sstables: reindent" > `59874b2837` "table: add get_staging_sstables" > `7536dd7f00` "distributed_loader: populate table directory first" The feature causes regressions seen with e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/41/testReport/materialized_views_test/TestMaterializedViews/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split011___test_base_replica_repair/ ``` AssertionError: Expected [[0, 0, 'a', 3.0]] from SELECT * FROM t_by_v WHERE v = 0, but got [] ``` Where views aren't updated properly. Apparently since `table::stream_view_replica_updates` doesn't exclude the staging sstables anymore and since they are cloned to the base table as new sstables it seems to the view builder that no view updates are required since there's no changes comparing to the base table. Reopens #9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10890	2022-06-27 12:18:48 +03:00
Benny Halevy	8bccd5e9c5	compaction_manager: task: acquire_semaphore: handle abort_requested_exception Change `8f39547d89` added `handle_exception_type([] (const semaphore_aborted& e) {})`, but it turned out that `named_semaphore_aborted` isn't derived from `semaphore_aborted`, but rather from `abort_requested_exception` so handle the base exception instead. Fixes #10666 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10881	2022-06-27 09:47:48 +03:00
Pavel Emelyanov	708d3a1ea4	cql-test: Initialize cluster port as integer Otherwise it complains like this: _________ ERROR at setup of test_allow_filtering_indexed_no_filtering __________ request = <SubRequest 'cql' for <Function test_allow_filtering_indexed_no_filtering>> @pytest.fixture(scope="session") def cql(request): [...<snip>...] > return cluster.connect() test/cql-pytest/conftest.py:66: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cassandra/cluster.py:1708: in cassandra.cluster.Cluster.connect ??? cassandra/cluster.py:1765: in cassandra.cluster.Cluster._new_session ??? cassandra/cluster.py:2563: in cassandra.cluster.Session.__init__ ??? cassandra/pool.py:203: in cassandra.pool.Host.__str__ ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E TypeError: %d format: a real number is required, not str tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-27 09:00:48 +03:00
Benny Halevy	751eceb2e6	types: time_point_to_string: use numeric formatting rather than chrono-format specifiers As reported in #10867, newer versions of the fmt library format %Y using 4-characters width, 0-padding the prefix when needed, while older versions don't do that. This change moves away from using %Y and friends fmt specifiers to using explicit numeric-based formatting conforming to ISO 8601 and making sure the year field has at least 4 digits and is zero padded. When negative, the width is upped to 5 so it would show as -0001 rather than -001. The unit test was updated respectively. Fixes #10867 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10870	2022-06-27 08:28:56 +03:00
Benny Halevy	9c231ad0ce	repair_reader: construct _reader_handle before _reader Currently, the `_reader` member is explicitly initialized with the result of the call to `make_reader`. And `make_reader`, as a side effect, assigns a value to the `_reader_handle` member. Since C++ initializes class members sequentially, in the order they are defined, the assignment to `_reader_handle` in `make_reader()` happens before `_reader_handle` is initialized. This patch fixes that by changing the definition order, and consequently, the member initialization order in the constructor so that `_reader_handle` will be (default-)initialized before the call to `make_reader()`, avoiding the undefined behavior. Fixes #10882 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10883	2022-06-26 20:17:47 +03:00
Amnon Heiman	6b9b76c919	main.cc: add trailing backslash to the API directories The API uses the http server to serve two directories: the api_ui_dir where the swagger-ui directory is found and the api_doc_dir where the swagger definition files are found. Internally, the API uses the httpd::directory_handler that append the files it gets from the path to the base directory name. A user can override the default configuration and set a directory name that will not end with a backslash. This will result with files not found. This patch check if that backslash is missing, and if it is, adds it to the API configuration. Fixes #10700 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #10877	2022-06-26 20:05:37 +03:00
David Garcia	bb21c3c869	Move dev docs to docs/dev	2022-06-24 18:07:08 +01:00
Piotr Sarna	f2bb676d27	docs: mention python in debugging.md Evaluating Python code from within gdb is priceless, especially that all helper classes and functions sourced from scylla-gdb.py can be used in there. This commit adds a paragraph in debugging.md mentioning this tool. Closes #10869	2022-06-24 15:16:43 +03:00
Pavel Emelyanov' via ScyllaDB development	a78af050fd	cql: Constify select_statement restrictions It is in fact immutable (both the pointer and the object it points to), so is the pointer copy returned by get_restrictions() method, so are those propagated to filtering stuff. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1028 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220624083351.24970-1-xemul@scylladb.com>	2022-06-24 12:27:36 +03:00
Nadav Har'El	905088ce7a	cql: improve error message for static column in materialized view Static columns are not currently allowed in a materialized view. If the base table has a static column and one tries to create a view with a "SELECT ", the following error message is printed today: Unable to include static column 'ColumnDefinition{name=s, type=org.apache.cassandra.db.marshal.Int32Type, kind=STATIC, componentIndex=null, droppedAt=-9223372036854775808}' which would be included by Materialized View SELECT statement It is completely unnecessary to include all these details about the column definition - just its name would have sufficed. In other words, we should print def.name_as_text(), not the entire def. This is what other error messages in the same file do as well. After this patch the error message becomes nicer and clearer: Unable to include static column 's' which would be included by Materialized View SELECT * statement Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10854	2022-06-24 11:19:33 +03:00
Botond Dénes	78750c2e1a	Merge 'Compact staging sstables' from Benny Halevy This series decouples the staging sstables from the table's sstable set. The current behavior keeps the sstables in the staging directory until view building is done. They are readable as any other sstable, but fenced off from compaction, so they don't go away in the meanwhile. Currently, when views are built, the sstables are moved into the main table directory where they will then be compacted normally. The problem with this design is that the staging sstables are never compacted, in particular they won't get cleaned up or scrubbed. The cleanup scenario open a backdoor for data resurrection when the staging sstables are moved after view building while possibly containing stale partitions (#9559) which will not be cleaned up until next time cleanup compaction is performed. With this series, SSTables that are created in or moved to the staging sub-directory are "cloned" into the base table directory by hard-linking the components there and creating a new sstable object which loads the cloned files. The former, in the staging directory is used solely for view building and is not added to the table's sstable set, while the latter, its clone, behaves like any other sstable and is added either to the regular or maintenance set and is read and compacted normally. When view building is done, instead of moving the staging sstable into the table's base directory, it is simply unlinked. If its "clone" wasn't compacted away yet, then it will just remain where it is, exactly like it would be after it was moved there in the present state of things. If it was already compacted and no longer exists, then unlinking will then free its storage. Note that snapshot is based on the sstables listed by the table, which do not include the staging sstables with this change. But that shouldn't matter since even today, the sstables in the snapshot has no notion of "staging" directory and it is expected that the MV's are either updated view `nodetool refresh` if restoring sstables from snapshot using the uploads dir, or if restoring the whole table from backup - MV's are effectively expected to be rebuilt from scratch (they are not included in automatic snapshots anyway since we don't have snapshot-coherency across tables). A fundamental infrastructure change was done to achieve that which is to change the sstable_list which was a std::unordered_set<shared_sstable> into a std::unordered_map<generation_type, shared_sstable> that keeps the shared_sstable objects indexed by generation number (that must be unique). With this model, sstables are supposed to be searched by the generation number, not by their pointer, since when the staging sstable is clones, there will be 2 shared_sstable objects with the same generation (and different `dir()`) and we must distinguish between them. Special care was taken to throw a runtime_error exception if when looking up a shared sstable and finding another one with the same generation, since they must never exist in the same sstable_map. Fixes #9559 Closes #10657 * github.com:scylladb/scylla: table: clone staging sstables into table dir view_update_generator: discover_staging_sstables: reindent table: add get_staging_sstables view_update_generator: discover_staging_sstables: get shared table ptr earlier distributed_loader: populate table directory first sstables: time_series_sstable_set: insert: make exception safe sstables: move_to_new_dir: fix debug log message	2022-06-24 08:05:38 +03:00
Botond Dénes	1f4f8ba773	Merge 'compaction_manager: track if off-startegy compaction was performed in run_offstrategy_compaction' from Benny Halevy This series moves the logic to not perform off-strategy compaction if the maintenance set is empty from the table layer down to the compaction_manager layer since it is the one that needs to make the decision. With that compaction_manager::perform_offstrategy will return a future<bool> which resolves to true iff off-strategy compaction was required and performed. The sstable_compaction_test was adjusted and a new compaction_manager_for_testing class was added to make sure the compaction manager is enabled when constructed (it wasn't so test_offstrategy_sstable_compaction didn't perform any off-strategy compactions!) and stopped before destroyed. Closes #10848 * github.com:scylladb/scylla: table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager compaction_manager: offstrategy_compaction_task: refactor log printouts test: sstable_compaction: compaction_manager_for_testing	2022-06-24 08:04:02 +03:00
Avi Kivity	dab56b82fa	Merge 'Per-partition rate limiting' from Piotr Dulikowski Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode. This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension: ``` ALTER TABLE ks.tbl WITH per_partition_rate_limit = { 'max_writes_per_second': 100, 'max_reads_per_second': 200 }; ``` Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead. Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all. The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here. Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`: - Write mode: ``` `8f690fdd47` (PR base): 129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op) This PR: 125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op) ``` - Read mode: ``` `8f690fdd47` (PR base): 150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op) This PR: 151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op) ``` Manual upgrade test: - Start 3 nodes, 4 shards each, Scylla version `8f690fdd47` - Create a keyspace with scylla-bench, RF=3 - Start reading and writing with scylla-bench with CL=QUORUM - Manually upgrade nodes one by one to the version from this PR - Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded - Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected Fixes: #4703 Closes #9810 * github.com:scylladb/scylla: storage_proxy: metrics for per-partition rate limiting of reads storage_proxy: metrics for per-partition rate limiting of writes database: add stats for per partition rate limiting tests: add per_partition_rate_limit_test config: add add_per_partition_rate_limit_extension function for testing cf_prop_defs: guard per-partition rate limit with a feature query-request: add allow_limit flag storage_proxy: add allow rate limit flag to get_read_executor storage_proxy: resultize return type of get_read_executor storage_proxy: add per partition rate limit info to read RPC storage_proxy: add per partition rate limit info to query_result_local(_digest) storage_proxy: add allow rate limit flag to mutate/mutate_result storage_proxy: add allow rate limit flag to mutate_internal storage_proxy: add allow rate limit flag to mutate_begin storage_proxy: choose the right per partition rate limit info in write handler storage_proxy: resultize return types of write handler creation path storage_proxy: add per partition rate limit to mutation_holders storage_proxy: add per partition rate limit info to write RPC storage_proxy: add per partition rate limit info to mutate_locally database: apply per-partition rate limiting for reads/writes database: move and rename: classify_query -> classify_request schema: add per_partition_rate_limit schema extension db: add rate_limiter storage_proxy: propagate rate_limit_exception through read RPC gms: add TYPED_ERRORS_IN_READ_RPC cluster feature storage_proxy: pass rate_limit_exception through write RPC replica: add rate_limit_exception and a simple serialization framework docs: design doc for per-partition rate limiting transport: add rate_limit_error	2022-06-24 01:32:13 +03:00
Kamil Braun	a3d2f54806	service/raft: raft_group_registry: add assertions when fetching servers for groups Better than dereferencing null-pointers or null-opts.	2022-06-23 16:14:41 +02:00
Kamil Braun	bb58ee0b2e	service/raft: raft_group_registry: remove `_raft_support_listener` It did nothing. It will be readded in `raft_group0` and it will do something, stay tuned. With this we can remove the `feature_service` reference from `raft_group_registry`.	2022-06-23 16:14:41 +02:00
Kamil Braun	0f78d81573	service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map For better observability during testing or debugging.	2022-06-23 16:14:41 +02:00
Kamil Braun	8e907cbf57	service/raft: raft_group0: move group 0 RPC handlers from `storage_service` And generate the boilerplate from IDL declarations. Simplifies the code, and the code now resides where it belongs.	2022-06-23 16:14:41 +02:00
Kamil Braun	4f0feee43e	service/raft: messaging: extract raft_addr/inet_addr conversion functions Don't repeat yourself.	2022-06-23 16:14:41 +02:00
Kamil Braun	5da163e0b8	service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` `raft_group0` was constructed at the beginning of `join_cluster`, which required passing references to 3 additional services to `join_cluster` used only for that purpose (group 0 client, raft group registry, and query processor). Now we initialize `raft_group0` in main - like all other services - and pass a reference to `join_cluster` so `storage_service` can store a pointer to group 0. We initialize `raft_group0` before we start listening for RPCs in `messaging_service`. In a later commit we'll move the initialization of group 0 related verbs to the constructor of `raft_group0` from `storage_service`, so they will be initialized before we start listening for RPCs.	2022-06-23 16:14:41 +02:00
Kamil Braun	4d439a16b3	treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls In schema_altering_statement: we will bounce statements to shard 0 whether Raft is enabled or not. In migration_manager, when we're sending a group 0 snapshot: well, if we're sending a group 0 snapshot, Raft must be enabled; the check is redundant.	2022-06-23 16:14:41 +02:00
Kamil Braun	411231da75	test/boost: memtable_test: perform schema operations on shard 0 Will be a prerequisite with Raft enabled.	2022-06-23 16:14:41 +02:00
Kamil Braun	3be376f6c5	test/boost: cdc_test: remove test_cdc_across_shards The test checked if creating a table with CDC enabled on shard other than 0 would create the CDC log table as well; it was a regression test for #5582. However we will soon bounce all schema change requests to shard 0, so the test's purpose is gone. I need to remove this test because `cquery_nofail` does not handle the bouncing correctly: it silently accepts the bounce message, assumes that the query was successful and returns. So after we change the code to start bouncing all requests to shard 0, if a query was ran inside test code using `cquery_nofail` on a shard different than 0 it would do nothing and following queries executed on shard 0 would fail because they depended on the effect of the aforementioned query.	2022-06-23 16:14:41 +02:00
Kamil Braun	c030d03893	message: rename `send_message_abortable` to `send_message_cancellable` It's not possible to abort an RPC call entirely, since the remote part continues running (if the message got out). Calling the provided abort source does the following: 1. if the message is still in the outgoing queue, drop it, 2. resolve waiter callbacks exceptionally. Using the word "cancellable" is more appropriate. Also write a small comment at `send_message_cancellable`.	2022-06-23 16:14:41 +02:00
Kamil Braun	07fe3e4a99	message: change parameter order in `send_message_oneway_timeout` Make it consistent with the other 'send message' functions. Simplify code generation logic in idl-compiler. Interestingly this function is not used anywhere so I didn't have to fix any call sites.	2022-06-23 16:14:41 +02:00
Benny Halevy	597e415c38	table: clone staging sstables into table dir clone staging sstables so their content may be compacted while views are built. When done, the hard-linked copy in the staging subdirectory will be simply unlinked. Fixes #9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 16:55:27 +03:00
Benny Halevy	ce5bd505dc	view_update_generator: discover_staging_sstables: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 16:55:27 +03:00
Benny Halevy	59874b2837	table: add get_staging_sstables We don't have to go over all sstables in the table to select the staging sstables out of them, we can get it directly from the _sstables_staging map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 16:55:27 +03:00
Benny Halevy	b8b14d76b3	view_update_generator: discover_staging_sstables: get shared table ptr earlier It's potentially a bit more efficient since t.get_sstables is called only once, while t.shared_from_this() is called per staging sstable. Also, prepare for the following patches that modify this function further. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 16:55:27 +03:00
Benny Halevy	7536dd7f00	distributed_loader: populate table directory first So we can clone staging sstables into it later when populating the table from the staging_dir Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 16:55:27 +03:00
Benny Halevy	cd68b04fbf	sstables: time_series_sstable_set: insert: make exception safe Need to erase the shared sstable from _sstables if insertion to _sstables_reversed fails. Fixes #10787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 16:55:27 +03:00
Benny Halevy	9d41676116	sstables: move_to_new_dir: fix debug log message Remove extraneous `old_dir` arg. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 16:55:27 +03:00
Raphael S. Carvalho' via ScyllaDB development	32600f60f3	scylla-gdb: Fix scylla_compaction_tasks Make it account for all the changes done in the compaction manager recently. 5.0 is not affected. So does not merit a backport. (gdb) scylla compaction-tasks 1 type=sstables::compaction_type::Reshard, state=compaction_manager::task::state::active, "keyspace1"."standard1" Total: 1 instances of compaction_manager::task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220621225600.20359-1-raphaelsc@scylladb.com>	2022-06-23 16:17:31 +03:00
Piotr Sarna	026f58f2a4	scylla-gdb: document scylla_shard The command is quite straightforward, but it didn't offer any documentation when calling `help scylla shard`, so it's hereby added. As a small bonus, a more comprehensive message is printed when the argument is not an integer. Message-Id: <9b958a4befce1c7baa6f86504ab74b93840b37e9.1655984258.git.sarna@scylladb.com>	2022-06-23 15:24:36 +03:00
Piotr Sarna	280e54c3e6	scylla-gdb: add printing registers from seastar::thread `scylla thread` command is extended with a non-intrusive option for dumping saved registers from the jmp_buf structure in an unmangled form. It can later be useful, e.g. for peeking into thread's instruction pointer or reasoning about its stack. Example debugging session: (gdb) scylla threads [shard 1] (seastar::thread_context) 0x6010000d9e00, stack: 0x601004f00000 [shard 1] (seastar::thread_context) 0x6010000daf00, stack: 0x601004e00000 (gdb) scylla thread --print-regs 0x6010000d9e00 rbx: 0x601004f1fd00 rbp: 0x601004f1fc20 r12: 0x6010000d9e20 r13: 0x6010002a3190 r14: 0x601004f1fd08 r15: 0x6010000d9e10 rsp: 0x601004f1fbb0 rip: 0x2f0aea6 (gdb) disassemble 0x2f0aea6 Dump of assembler code for function _ZN7seastar12jmp_buf_link10switch_outEv: 0x0000000002f0ae90 <+0>: push %rax 0x0000000002f0ae91 <+1>: mov 0xc8(%rdi),%rax 0x0000000002f0ae98 <+8>: mov %rax,%fs:0xfffffffffffe5dc8 0x0000000002f0aea1 <+17>: call 0x30333d0 <_setjmp@plt> 0x0000000002f0aea6 <+22>: test %eax,%eax 0x0000000002f0aea8 <+24>: je 0x2f0aeac <_ZN7seastar12jmp_buf_link10switch_outEv+28> 0x0000000002f0aeaa <+26>: pop %rax 0x0000000002f0aeab <+27>: ret 0x0000000002f0aeac <+28>: mov %fs:0xfffffffffffe5dc8,%rdi 0x0000000002f0aeb5 <+37>: mov $0x1,%esi 0x0000000002f0aeba <+42>: call 0x30333c0 <longjmp@plt> End of assembler dump. Message-Id: <553c1ed76987776916d5261ed13866650e84df34.1655984258.git.sarna@scylladb.com>	2022-06-23 15:24:36 +03:00
Piotr Sarna	01d281442e	test: extend view filtering test case In order to cover more code paths, the test case now places filtering on various combinations of base columns, including both primary keys and regular columns. It also makes the test scylla_only, as filtering is an extension not supported in Cassandra right now. Closes #10860	2022-06-23 14:19:41 +03:00
Botond Dénes	fd5f8f2275	query: have replica provide the last position Use the recently introduced query-result facility to have the replica set the position where the query should continue from. For now this is the same as what the implicit position would have been previously (last row in result), but it opens up the possibility to stop the query at a dead row.	2022-06-23 13:36:24 +03:00
Botond Dénes	009d2fe2f7	idl/query: add last_position to query_result To be used to allow the replica to specify the last position in the stream, where the query was left off. Currently this is always the same as the implicit position -- the last row in the result-set -- but this requires only stopping the read on a live row, which is a requirement we want to lift: we want to be able to stop on a tombstone. As tombstones are not included in the query result, we have to allow the replica to overwrite the last seen position explicitly. This patch introduces the new field in the query-result IDL but it is not written to yet, nor is it read, that is left for the next patches.	2022-06-23 13:36:24 +03:00
Botond Dénes	7b6b7a49cd	mutlishard_mutation_query: propagate compaction state to result builder Not used in this patch, facilitates further patching.	2022-06-23 13:36:24 +03:00
Botond Dénes	738cb99c53	multishard_mutation_query: defer creating result builder until needed Currently the result builder is created two frames above the method in which actually needed. Push down a factory method instead and create it where actually used. This allows us to pass it arguments that are present only in the method which uses it.	2022-06-23 13:36:24 +03:00
Botond Dénes	5575f8a55a	querier: use full_position instead of ad-hoc struct	2022-06-23 13:36:24 +03:00
Botond Dénes	58d53b66c1	querier: rely on compactor for position tracking For some time now the compactor track its own position. The querier can make use of this instead of duplicating this effort.	2022-06-23 13:36:24 +03:00
Botond Dénes	9beef08a1b	mutation_compactor: add current_full_position() convenience accessor	2022-06-23 13:36:24 +03:00
Botond Dénes	a3cd235de2	mutation_compactor: s/_last_clustering_pos/_last_pos/ Generalize position tracking to track non-clustering positions too. Also add an accessor for it.	2022-06-23 13:36:24 +03:00
Botond Dénes	5a6e807a1c	mutation_compactor: add state accessor to compact_mutation	2022-06-23 13:36:24 +03:00
Botond Dénes	e0cf7cec27	introduce full_position A simple struct containing a full position, including a partition key and a position in partition. Two variants are introduced: an owning version and a view. This is to replace all the ad-hoc structures introduced for the same purpose: std::pair() and std::tuple() of partition key and clustering key, and other similar small structs scattered around the code. This patch does not replace any of the above mentioned construcs with the new full_position, it merely introduces it to enable incremental standardization.	2022-06-23 13:36:24 +03:00
Botond Dénes	119be5d5db	idl: move position_in_partition into own header So it can be used without pulling in all of partition_checksum.idl.hh.	2022-06-23 13:36:24 +03:00
Botond Dénes	2b0bc11f2e	service/paging: use position_in_partition instead of clustering_key for last row The former allows for expressing more positions, like a position before/after a clustering key. This practically enables the coordinator side paging logic, for a query to be stopped at a tombstone (which can have said positions).	2022-06-23 13:36:20 +03:00
Avi Kivity	3c33fe93df	install-dependencies.sh: uprgade node_exporter to 1.3.1 New features and bugfixes. Closes #10859	2022-06-23 11:47:13 +03:00
Botond Dénes	adabe3b5a3	alternator/serialization: extract value object parsing logic To make it reusable by a method added by the next patch.	2022-06-23 11:33:18 +03:00
Botond Dénes	cb0146e372	service/pagers/query_pagers.cc: fix indentation Broken since forever.	2022-06-23 11:19:55 +03:00
Botond Dénes	a7d467d794	position_in_partition: add to_string(partition_region) and parse_partition_region() And rebase operator<<(partition_region) on top of the former.	2022-06-23 11:19:55 +03:00
Botond Dénes	ab0f3512c8	mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh Where its definition lives.	2022-06-23 11:19:55 +03:00
Nadav Har'El	51fbc89df3	util/chunked_vector: more complete comment chunked_vector was headed by short comment which didn't really explain why it exists and how and why it really differs from std::dequeue. Moreover, it made the vague claim that it "limits" contiguous allocations, which it really doesn't (at least not in the asymptotic sense). In this patch I wrote a much longer comment, which I hope will clearly explain exactly what chunked_vector is, how it really differs in its contiguous allocations from std::deque, and what it guarantees and doesn't guarantee. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10857	2022-06-23 10:33:35 +03:00
Takuya ASADA	3a51e7820a	scylla_cpuset_setup: stop deleting perftune.yaml and skip update cpuset.conf when same parameter specified To make scylla setup scripts easier to handle in Ansible, stop deleting perftune.yaml and detect cpuset.conf changes by mtime of the file. Also, skip update cpuset.conf when same parameter specified. Fixes #10121 Closes #10312	2022-06-23 10:28:36 +03:00
Botond Dénes	080ed590bf	Merge "Obtain dc/rack from topology, not snitch" from Pavel Emelyanov " The way dc/rack info is maintained is very intricate. The dc/rack strings originate at snitch, get propagated via gossiper, get notified to storage service which, in turn, stores them into the system keyspace and token metadata. Code that needs to get dc/rack for a given endpoint calls snitch which tries to get the data from gossiper and if failed goes and loads it from system keyspace cache. Also there's "internal IP" thing hanging arond that loops messaging service in both -- updating and getting the info. The plan is to make topology (that currently sits on token metadata) stay the only "source of truth" regarding the endpoints' dc/rack and internal IP info. The dc/rack mappings are put into topology already, but it cannot yet fully replace snitch for two reasons: - it doesn't map internal IP to endpoint - it doesn't get data stored in system keyspace So what this patch set does is patches most of the dc/rack getters to call topology methods. The topology is temporarily patched to just call the respective snitch methods. This removes a big portion of calls for global snitch instance. After the set the places that still explicitly rely on snitch to provide dc/rack are - messaging service: needs internal IP knowledge on topology - db/consistency_level: is all "global", needs heavier patching - tests: just later " * 'br-get-dc-rack-from-topology-2' of https://github.com/xemul/scylla: proxy stats: Get rack/datacenter from topology proxy stats: Push topology arg to get_ep_stats api: Get rack/datacenter from topology hints: Remove snitch dependency hints: Get rack/datacenter from topology alternator: Get rack/datacenter from topology range_streamer: Get rack/datacenter from topology repair: Get rack/datacenter from topology view: Get rack/datacenter from topology storage_service: Get rack/datacenter from topology proxy: Get rack/datacenter from topology topology: Add get_rack/_datacenter methods	2022-06-23 10:01:36 +03:00
Benny Halevy	a65ed19edc	table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager compaction_manager needs to decide about running off-strategy compaction or not based on the maintenance_set, not partly in table::trigger_offstrategy_compaction and part in the compaction_manager layer as it is done today. So move the logic down to performa_offstrategy that now returns future<bool> to return true iff it performed offstrategy compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 08:18:17 +03:00
Benny Halevy	9079c98db0	compaction_manager: offstrategy_compaction_task: refactor log printouts Move logging from run_offstrategy_compaction to do_run so that in the next patch we can skip run_offstrategy_compaction if the maintenance set is empty (but still log it, for the sake of dtests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 08:02:44 +03:00
Benny Halevy	34e9391587	test: sstable_compaction: compaction_manager_for_testing Make the compaction manager for testing using this class. Makes sure to enable the compaction manager and to stop it before it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 08:02:44 +03:00
Igor Ribeiro Barbosa Duarte	277f5a4009	utils/loading_cache.hh: Add reset method This patch adds a reset method which is going to be used in the next patches for updating the loadind_cache config and also to be able to flush the cache without having to update scylla config Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-23 01:18:22 -03:00
Piotr Dulikowski	442901f14a	storage_proxy: metrics for per-partition rate limiting of reads Adds a metric "read_rate_limited" which indicates how many times a read operation was rejected due to per-partition rate limiting. The metric differentiates between reads rejected by the coordinator and reads rejected by replicas.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	6e5d486970	storage_proxy: metrics for per-partition rate limiting of writes Adds a metric "write_rate_limited" which indicates how many times a write operation was rejected due to per-partition rate limiting. The metric differentiates between writes rejected by the coordinator and writes rejected by replicas.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	13a5022499	database: add stats for per partition rate limiting Adds statistics which count how many times a replica has decided to reject a write ("total_writes_rate_limited") or a read ("total_reads_rate_limited").	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	bc50163016	tests: add per_partition_rate_limit_test Adds the per_partition_rate_limit_test.cc file. Currently, it only contains a test which verifies that the feature correctly switches off rate limiting for internal queries (!allow_limit \|\| internal sg).	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	761a037afb	config: add add_per_partition_rate_limit_extension function for testing ...and use it in cql_test_env to enable the per_partition_rate_limit extension for all tests that use it.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	1a36029ab5	cf_prop_defs: guard per-partition rate limit with a feature The per-partition rate limit feature requires all nodes in the cluster to support it in order to work well. This commit adds a check which disallows creating/altering tables with per-partition rate limit until the node is sure that all nodes in the cluster support it.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	a7ad70600d	query-request: add allow_limit flag Adds allow_limit flag to the read_command. The flag decides whether rate limiting of this operation is allowed.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	c691e94190	storage_proxy: add allow rate limit flag to get_read_executor Adds a flag to get_read_executor which decides whether the read should be rate limited or not. The read executors were modified to choose the appropriate per partition rate limit info parameter and send it to the replicas.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	3357066387	storage_proxy: resultize return type of get_read_executor Now, get_read_executor is able to return coordinator exceptions without throwing them. In an upcoming commit, it will start returning rate limit exception in some cases and it is preferable to return them without throwing.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	d3d9add219	storage_proxy: add per partition rate limit info to read RPC Now, the read RPC accept the per partition rate limit info parameter. It is passed on to query_result_local(_digest) methods.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	e8e8ada4b4	storage_proxy: add per partition rate limit info to query_result_local(_digest) The query_result_local and query_result_local_digest methods were updated to accept db::per_partition_rate_limit::info structure and pass it on to database::accept.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	e6beab3106	storage_proxy: add allow rate limit flag to mutate/mutate_result Now, mutate/mutate_result accept a flag which decides whether the write should be rate limited or not. The new parameter is mandatory and all call sites were updated.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	1f65c4e001	storage_proxy: add allow rate limit flag to mutate_internal Now, mutate_internal accepts a flag which decides whether the write should be rate limited or not.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	1e4e92ed8b	storage_proxy: add allow rate limit flag to mutate_begin Now, mutate_begin accepts a flag which decides whether given write should be rate limited or not.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	76e95e7ae8	storage_proxy: choose the right per partition rate limit info in write handler Now, write response handler calculates the appropriate rate limit info parameter and passes it to the mutation holder.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	2a7ba76c3e	storage_proxy: resultize return types of write handler creation path The mutate_prepare and create_write_response_handler(_helper) functions are modified to be able to return exceptions without throwing them. In an upcoming commit, create_write_response_handler will sometimes return rate limit exception, and it is preferable to return them without throwing.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	3f88ecdea6	storage_proxy: add per partition rate limit to mutation_holders Now, `apply_locally` and `apply_remotely` accept the per partition rate limit info parameter.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	02469e0b15	storage_proxy: add per partition rate limit info to write RPC Adds db::per_partition_rate_limit::info parameter to the write RPC. The rate limit info controls the behavior of the rate limiter on the replica.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	c06376b383	storage_proxy: add per partition rate limit info to mutate_locally Now, mutate_locally accepts a parameter that controls the rate limiter behavior on the replica.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	cc9a2ad41f	database: apply per-partition rate limiting for reads/writes Adds the `db::rate_limiter` to the `database` class and modifies the `query` and `apply` methods so that they account the read/write operations in the rate limiter and optionally reject them.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	ec635ba170	database: move and rename: classify_query -> classify_request Moves the classify_query higher and renames it to classify_request. The function will be reused in further commits to protect non-user queries from accidentally being rate limited.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	dccb8a5729	schema: add per_partition_rate_limit schema extension Adds the new `per_partition_rate_limit` schema extension. It has two parameters: `max_writes_per_second` and `max_reads_per_second`. In the future commits they will control how many operations of given type are allowed for each partition in the given table.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	0fe8b55427	db: add rate_limiter Introduces the rate_limiter, a replica-side data structure meant for tracking the frequence with which each partition is being accessed (separately for reads and writes) and deciding whether the request should be accepted and processed further or rejected. The limiter is implemented as a statically allocated hashmap which keeps track of the frequency with which partitions are accessed. Its entries are incremented when an operation is admitted and are decayed exponentially over time. If a partition is detected to be accessed more than its limit allows, requests are rejected with a probability calculated in such a way that, on average, the number of accepted requests is kept at the limit. The structure currently weights a bit above 1MB and each shard is meant to keep a separate instance. All operations are O(1), including the periodic timer.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	2162bb9f3b	storage_proxy: propagate rate_limit_exception through read RPC This commit modifies the read RPC and the storage_proxy logic so that the coordinator knows whether a read operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` if that happens.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	000f417d23	gms: add TYPED_ERRORS_IN_READ_RPC cluster feature We would like to extend the read RPC to return an optional, second value which indicates an exception - seastar type-erases exception on the RPC handler boundary and we need to differentiate rate_limit_exception from others. However, it may happen that a replica with an up-to-date version of Scylla tries to return an exception in this way to a coordinator with an old version and the coordinator will drop the error, thinking that the request succeeded. In order to protect from that, we introduce the `TYPED_ERROR_IN_READ_RPC` feature. Only after it is enabled replicas will start returning exceptions in the new way, and until then all exceptions will be reported using seastar's type-erasure mechanism.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	51546b0609	storage_proxy: pass rate_limit_exception through write RPC This commit modifies the storage_proxy logic so that the coordinator knows whether a write operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` when that happens.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	621b7f35e2	replica: add rate_limit_exception and a simple serialization framework Introduces `replica::rate_limit_exception` - an exceptions that is supposed to be thrown/returned on the replica side when the request is rejected due to the exceeding the per-partition rate limit. Additionally, introduces the `exception_variant` type which allows to transport the new exception over RPC while preserving the type information. This will be useful in later commits, as the coordinator will have to know whether a replica has failed due to rate limit being exceeded or another kind of error. The `exception_variant` currently can only either hold "other exception" (std::monostate) or the aforementioned `rate_limit_exception`, but can be extended in a backwards-compatible way in the future to be able to hold more exceptions that need to be handled in a different way.	2022-06-22 20:07:58 +02:00
Piotr Dulikowski	a55d7ad46d	docs: design doc for per-partition rate limiting	2022-06-22 20:07:58 +02:00
Piotr Dulikowski	efc3953c0a	transport: add rate_limit_error Adds a CQL protocol extension which introduces the rate_limit_error. The new error code will be used to indicate that the operation failed due to it exceeding the allowed per-partition rate limit. The error code is supposed to be returned only if the corresponding CQL extension is enabled by the client - if it's not enabled, then Config_error will be returned in its stead.	2022-06-22 20:07:58 +02:00
Piotr Sarna	bc3a635c42	view: exclude using static columns in the view filter The code which applied view filtering (i.e. a condition placed on a view column, e.g. "WHERE v = 42") erroneously used a wildcard selection, which also assumes that static columns are needed, if the base table contains any such columns. The filtering code currently assumes that no such columns are fetched, so the selection is amended to only ask for regular columns (primary key columns are sent anyway, because they are enabled via slice options, so no need to ask for them explicitly). Fixes #10851 Closes #10855	2022-06-22 15:55:45 +03:00
Pavel Emelyanov' via ScyllaDB development	b0b29edcd7	distributed-loader: Remove ensure_system_table_directories It looks like the exactly same code is called few steps above via distributed_loader::init_system_keyspace `- distributed_loader::populate_keyspace While at it -- move the supervisor::notify("loading system sstables") handing around in the more suitable location. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/981/ Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220621165313.31284-1-xemul@scylladb.com>	2022-06-22 13:59:00 +03:00
Nadav Har'El	cf289ad538	Merge 'types: time_point_to_string: harden against out of range timestamps' from Benny Halevy The time point is multiplied by an adjustment factor of 1000 for boost::posix_time::time_duration::ticks_per_second() = 1000000 when calling boost::posix_time::milliseconds(count) and that may lead to integer overflow as reported by the UndefinedBehaviorSanitizer. See https://github.com/scylladb/scylla/issues/10830#issuecomment-1158899187 This change checks for possible overflow in advance and prints the raw counter value in this case, along with an explanation. Refs #10830 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10831 * github.com:scylladb/scylla: test: types: add test cases for timestamp_type to_string format types: time_point_to_string: harden against out of range timestamps	2022-06-22 14:01:06 +03:00
Pavel Emelyanov	f0cafc35fd	proxy stats: Get rack/datacenter from topology The reference is already at hand. The get_ep_stats() calls another helper that also maps endpoint to datacenter, but it can get the obtained dc sstring via argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:27 +03:00
Pavel Emelyanov	8ffe249430	proxy stats: Push topology arg to get_ep_stats The latter will need it to get dc info from. All the callers are either storage proxy or have storage proxy pointer/reference to get topology from. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:27 +03:00
Pavel Emelyanov	3ab7c9320c	api: Get rack/datacenter from topology The http_ctx already has token metadata on board, it's possible to get topology from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:27 +03:00
Pavel Emelyanov	820be06ac1	hints: Remove snitch dependency After previous patch hints manager class gets unused dependency on snitch. While removing it it turns out that several unrelated places get needed headers indirectly via host_filter.hh -> snitsh_base.hh inclusion. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	9b6312687b	hints: Get rack/datacenter from topology The topology referecne is obtained from the proxy anchor pointer sitting on manager. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	98a4d41e31	alternator: Get rack/datacenter from topology It's needed in two places, both can get topology from the proxy's token metadata. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	5e2fa32c8c	range_streamer: Get rack/datacenter from topology It's needed in source filter classes so range-streamer passes the topology reference into its methods. Nice side effect -- snitch header goes away from range-streamer one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	b28db0294c	repair: Get rack/datacenter from topology Repair gets token metadata from its local database reference. Not perfect, repair should better have its own private token meta reference, but it's OK for now. The change obsoletes static get_local_dc helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	17128eb54b	view: Get rack/datacenter from topology The view code already gets token metadata from global proxy instance. Do the same to get topology object. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	894cbeacc5	storage_service: Get rack/datacenter from topology Same as in previous patch -- storage service has token metadata to get topology from. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	507db73586	proxy: Get rack/datacenter from topology Proxy has shared token metadata from which it can get the topology. This change obsoletes static get_local_dc() helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Pavel Emelyanov	b6f7c8da8b	topology: Add get_rack/_datacenter methods For now they just forward the request to snitch. Once topology is properly updated boot-time dc/rack info and knows internal IP it will be able to serve request on its own. For convenience overloads without arguments return dc/rack for current node. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Avi Kivity	e499f45593	Update seastar submodule * seastar 443e6a9b77...ff46af9ae0 (15): > rpc: Take care of client::send() future in send_helper > test: futures: add test_get_on_exceptional_promise > compile_commands.json generation in configure > condition-variable: use an empty loop for spinning CPU > byteorder: use boost::endian to do the conversion. > Merge "Replace RPC outgoing queue with continuation chain" from Pavel E > test_runner: use std::endl to ensure messages are flushed > memory: realloc: defer to malloc if ptr is null > cmake: require boost 1.73 for building with C++20 > reactor: backend: io_uring: disable on old kernels if RAID devices exist > Move function in invoke_on_all > core/loop: drop unused parameters > net/api: add connected_socket::operator bool() > fix cpuset count is zero after shift > docker: add pandoc package Closes #10845	2022-06-22 00:39:24 +03:00
Michael Livshin	d7c90b5239	sstables: make generation_type an actual separate type Now that `generation_type` is used properly (at least in some places), we turn to the compiler to help keep the generation/value separation intact. Fixes #10796. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-21 20:08:01 +03:00
Konstantin Osipov	c59a730c1e	test: re-enable but mark as flaky cdc_with_lwt_test Running flaky tests prevents regressions from sneaking which is possible if the test is disabled. Closes #10832	2022-06-21 16:36:49 +03:00
Benny Halevy	e5b7ce4cb7	test: types: add test cases for timestamp_type to_string format Following the previous patch that changed time_point_to_string we should cement the different edge cases for the next time this function changes. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-21 14:11:35 +03:00
Avi Kivity	88f75f91ae	cql3: grammar: move semantic check for incompatible UPDATEs to prepare() code The grammar now checks that UPDATEs don't clash (for example, updates to the same column). The checks are good, but the grammar isn't the right place for them - better to concentrate all the checks in the prepare() code so it's easy to see all the checks. Move the checks to raw::update_statement::prepare_internal(). This exposes that the checks are quadratic, so add a comment. It could be fixed with a stable_sort() first, but that is left to later. Closes #10820	2022-06-21 11:42:11 +02:00
Botond Dénes	2b62f67593	tools/scylla-types: escape {} chars in description So fmt::format() doesn't interpret them as substitutions and doesn't error-out because there is no argument for them. Closes #10803	2022-06-21 11:58:13 +03:00
Michael Livshin	1e7360ef6d	checksum_utils_test: supply valid input to crc32_combine() If the len2 argument to crc32_combine() is zero, then the crc2 argument must also be zero. fast_crc32_combine() explicitly checks for len2==0, in which case it ignores crc2 (which is the same as if it were zero). zlib's crc32_combine() used to have that check prior to version 1.2.12, but then lost it, making its necessary for callers to be more careful. Also add the len2==0 check to the dummy fast_crc32_combine() implementation, because it delegates to zlib's. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10731	2022-06-21 11:58:13 +03:00
Botond Dénes	5b50725c45	Merge 'partition_snapshot_row_cursor: avoid unnecessary row cloning in row()' from Michał Chojnowski Due to implementation details, all `deletable_row`s used in `row()` are copied twice, even though the only need to be copied/applied once. This is unnecessary work. `perf_simple_query_g --enable-cache=1 --flush --smp 1 --duration 30` Before: median 158516.17 tps ( 64.1 allocs/op, 12.1 tasks/op, 45010 insns/op) After: median 164307.76 tps ( 62.1 allocs/op, 12.1 tasks/op, 43220 insns/op) Closes #10509 * github.com:scylladb/scylla: partition_snapshot_row_cursor: construct the clustering_row directly in row() mutation_fragment: add a "from deletable_row" constructor to clustering_row mutation_fragment: pass the applied row by reference in clustering_row::apply()	2022-06-21 11:58:13 +03:00
Botond Dénes	121900e377	Merge "Sanitize compaction manager construction and stopping" from Pavel Emelyanov " In order to wire-in the compaction_throughput_mb_per_sec the compaction creation and stopping will need to be patched. Right now both places are quite hairy, this set coroutinizes stop() for simpler adding of stopping bits, unifies all the compaction manager constructors and adds the compaction_manager::config for simpler future extending. As a side effect the backlog_controller class gets an "abstract" sched group it controlls which in turn will facilitate seastar sched groups unification some day. " * 'br-compaction-manager-start-stop-cleanup' of https://github.com/xemul/scylla: compaction_manager: Introduce compaction_manager::config backlog_controller: Generalize scheduling groups database: Keep compound flushing sched group compaction_manager: Swap groups and controller compaction_manager: Keep compaction_sg on board compaction_manager: Unify scheduling_group structures compaction_manager: Merge static/dynamic constructors compaction_manager: Coroutinuze really_do_stop() compaction_manager: Shuffle really_do_stop() compaction_manager: Remove try-catch around logger	2022-06-21 11:58:13 +03:00
Raphael S. Carvalho	aa667e590e	sstable_set: Fix partitioned_sstable_set constructor The sstable set param isn't being used anywhere, and it's also buggy as sstable run list isn't being updated accordingly. so it could happen that set contains sstables but run list is empty, introducing inconsistency. we're fortunate that the bug wasn't activated as it would've been a hard one to catch. found this while auditting the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>	2022-06-21 11:58:13 +03:00
Benny Halevy	59acc58920	test: error_injection: test_inject_noop: do no rely on wall clock timing This unit test may fail in debug mode since .then may yield, even if inject returns a ready future, so the wall clock timing has no basis. As seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/932/testReport/junit/boost.error_injection_test.debug/test_boost_error_injection_test/test_inject_noop/ ``` [Exception] - critical check wait_time.count() < sleep_msec.count() has failed [47 >= 10] == [File] - test/boost/error_injection_test.cc == [Line] -45 ``` Instead, just verify that inject returns a successful, ready future, when using a non-existing errr injection name. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10842	2022-06-21 11:58:13 +03:00
Benny Halevy	87ee6c3722	types: time_point_to_string: harden against out of range timestamps The time point is multiplied by an adjustment factor of 1000 for boost::posix_time::time_duration::ticks_per_second() = 1000000 when calling boost::posix_time::milliseconds(count). That may lead to integer overflow as reported by the UndefinedBehaviorSanitizer. See https://github.com/scylladb/scylla/issues/10830#issuecomment-1158899187 This change uses gmtime_r to convert seconds since unix epoch to std::tm and the fmt library to format the iso representation of the time_point to avoid exceptions and undefined behavior. gmtime_r may still detect an overflow "when the year does not fit into an integer" (see ctime(3)). In this case we return a backward compatible representation of "{count} milliseconds (out of range)". Refs #10830 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-21 08:08:57 +03:00
Michael Livshin	ab13127761	sstables: use generation_type more soundly `generation_type` is (supposed to be) conceptually different from `int64_t` (even if physically they are the same), but at present Scylla code still largely treats them interchangeably. In addition to using `generation_type` in more places, we provide (no-op) `generation_value()` and `generation_from_value()` operations to make the smoke-and-mirrors more believable. The churn is considerable, but all mechanical. To avoid even more (way, way more) churn, unit test code is left untreated for now, except where it uses the affected core APIs directly. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-20 19:37:31 +03:00
Michael Livshin	ef06f92631	extremum_tracker: do not require default-constructible value types The requirement is just an unintended artifact of the implementation. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-20 19:37:31 +03:00
Calle Wilund	8b49718203	commitlog: Add (internal) measurement of byte rates add/release/flush-req Adds measuring the apparent delta vector of footprint added/removed within the timer time slice, and potentially include this (if influx is greater than data removed) in threshold calculation. The idea is to anticipate crossing usage threshold within a time slice, so request a flush slightly earlier, hoping this will give all involved more time to do their disk work. Obviously, this is very akin to just adjusting the threshold downwards, but the slight difference is that we take actual transaction rate vs. segment free rate into account, not just static footprint. Note: this is a very simplistic version of this anticipation scheme, we just use the "raw" delta for the timer slice. A more sophisiticated approach would perhaps do either a lowpass filtered rate (adjust over longer time), or a regression or whatnot. But again, the default period of 10s is something of an eternity, so maybe that is superfluous...	2022-06-20 15:58:36 +00:00
Calle Wilund	6921210bf5	commitlog: Add counters for # bytes released/flush requested Adds "bytes_released" and "bytes_flush_requested", representing total bytes released from disk as a result of segment release (as allocation bytes + overhead - not counting unused "waste"), resp. total size we've requested flush callbacks to release data, also counted as actual used bytes in segments we request be made released. These counters, together with bytes_written, should in ideal use cases be at an equilibrium (actually equal), thus observing them should give an idea on whether we are imbalanced in managing to release bytes in same rate as they are allocated (i.e. transaction rate).	2022-06-20 15:58:36 +00:00
Calle Wilund	336383c87e	commitlog: Keep track of last flush high position to avoid double request Apparent mismerge or something. We already have an unused "_flush_position", intended to keep track of the last requested high rp. Now actually update and use it. The latter to avoid sending requests for segments/cf id:s we've already requested external flush of. Also enables us to ensure we don't do double bookkeep here.	2022-06-20 15:58:26 +00:00
Calle Wilund	c904b3cf35	commitlog: Fix counter descriptor language Remove superfluous "a"	2022-06-20 15:54:20 +00:00
Takuya ASADA	13caac7ae6	install.sh: install files with correct permission in strict umask setting To avoid failing to run scripts in non-root user, we need to set permission explicitly on executables. Fixes #10752 Closes #10840	2022-06-20 17:52:03 +03:00
Avi Kivity	a8507a6d28	Merge 'docs/contribute/maintainer.md: expand with merging guidelines' from Botond Dénes The current maintainer.md lacks any guidelines on what patches to accept/reject. Instead maintainers are expected to observe the unwritten rules as exercised by more senior maintainers, as well as use their own judgement or ask when in doubt. This has worked well as maintainers are all people who either worked at the company for a long time and hence had time to observe how things work, and/or have previous experience maintaining open-source projects. Nevertheless, many times I have wished we had a guideline I could glance at to make sure I considered all the angles and to make sure I did not forget some important unwritten rule. This series attempts to concisely summarize these unwritten rules in the form of a checklist, without attempting to cover all exceptions and corner-cases. This should already be enough for a maintainer-in-doubt to be able to quickly go over the checklist and see if they forgot to check anything (especially when evaluating backports). /cc @scylladb/scylla-maint Closes #10806 * github.com:scylladb/scylla: docs/contribute/maintainer.md: add merging and backporting guidelines docs/contribute/CONTRIBUTING.md: add reference to review checklist: docs/contribute/review-checklist.md: add section about patch organization docs/contribute/maintainer.md: expand section on git submodule sync	2022-06-20 17:20:52 +03:00
Botond Dénes	c3e7c1cf59	tools/schema_loader: load_schemas(): add note about CDC table names Explaining how the code determines what tables are CDC tables when parsing schema statements. Closes #10788	2022-06-20 17:16:33 +03:00
Michał Chojnowski	5570354f44	partition_snapshot_row_cursor: construct the clustering_row directly in row() Currently row() creates an empty clustering_row, then applies deletable_rows from the cursor to the empty clustering_row. But the apply logic is unnecessary for the first apply(), and it's cheaper to simply copy the row.	2022-06-20 15:45:19 +02:00
Michał Chojnowski	52c963b331	mutation_fragment: add a "from deletable_row" constructor to clustering_row Currently, construction of clustering_row from deletable_row is done by applying the deletable_row to an empty clustering_row. Direct construction is a slightly cheaper alternative.	2022-06-20 15:45:19 +02:00
Michał Chojnowski	a061eb9e76	mutation_fragment: pass the applied row by reference in clustering_row::apply() Currently, clustering_row::apply() takes deletable_row by reference, but copies it before passing it to deletable_row::apply(). This is more expensive than passing the reference down (by about 1800 instructions for perf_simple_query rows).	2022-06-20 15:22:17 +02:00
Avi Kivity	f8d84e3aaf	Update tools/java submodule (sync to Cassandra 3.11.3, deps update) * tools/java d4133b54c9...de8289690e (1): > Merge 'Sync with Cassandra 3.11.13 and update a few dependencies (v2)' from Piotr Grabowski	2022-06-20 13:25:17 +03:00
Asias He	72797bf516	token_metadata: Shortcut zero leaving nodes case in calculate_pending_ranges_for_leaving If there are zero leaving nodes, no need to calculate anything. This saves time for calculating pending ranges in large clusters significantly to avoid unnecessary calculation. Refs #10337 Closes #10822	2022-06-20 13:19:58 +03:00
Piotr Sarna	bbbd1f4edd	Merge 'alternator: make BatchGetItem group reads by partition' from Nadav Har'El This small series improves Alternator's BatchGetItem performance by grouping requests to the same partition together (Fixes #10753) and also improves error checking when the same item is requested more than once (Fixes #10757). Closes #10834 * github.com:scylladb/scylla: alternator: make BatchGetItem group reads by partition test/alternator: additional test for BatchGetItem	2022-06-20 10:07:19 +02:00
Raphael S. Carvalho	f15a6ce41a	tests: Introduce optional RNG seed for boost suite Today, if you want to reproduce a rare condition using the same RNG seed reported, you cannot use test.py which provides useful infrastructure and will have to run the tests manually instead. So let's extend test.py to allow optional forwarding of RNG seed to boost tests only, as other suites don't support the seed option. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220615223657.142110-1-raphaelsc@scylladb.com>	2022-06-20 07:19:08 +03:00
Nadav Har'El	3aca1ca572	alternator: make BatchGetItem group reads by partition DynamoDB API's BatchGetItem invokes a number (up to 25) of read requests in parallel, returning when all results are available. Alternator naively implemented this by sending all read requests in parallel, no matter which requests these were. That implementation was inefficient when all the requests are to different items (clustering rows) of the same partition. In a multi-node setup this will end up sending 25 separate requests to the same remote node(s). Even on a single-node setup, this may result in reading from disk more than once, and even if the partition is cached - doing an O(logN) search in each multiple times. What we do in this patch, instead, is to group all the BatchGetItem requests that aimed at the same partition into a single read request asking for a (sorted) list of clustering keys. This is similar to an "IN" request in CQL. As an example of the performance benefit of this patch, I tried a BatchGetItem request asking for 20 random items from a 10-million item partition. I measured the latency of this request on a single-node Scylla. Before this patch, I saw a latency of 17-21 ms (the lower number is when the request is retried and the requested items are already in the cache). After this patch, the latency is 10-14 ms. The performance improvement on multi-node clusters are expected to be even higher. Unfortunately the patch is less trivial than I hoped it would be, because some of the old code was organized under the assumption that each read request only returned one item (and if it failed, it means only one item failed), so this part of the code had to be reorganized (and, for making the code more readable, coroutinized). An unintended benefit of the code reorganization is that it also gave me an opportunity to fail an attempt to ask BatchGetItem the same item more than once (issue #10757). The patch also adds a few more corner cases in the tests, to be even more sure that the code reorganization doesn't introduce a regression in BatchGetItem. Fixes #10753 Fixes #10757 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-19 14:47:57 +03:00
Pavel Emelyanov	85263b2d02	trace-state: Remove unused fields ... and one friendship declaration tests: compilation Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220616094224.30676-1-xemul@scylladb.com>	2022-06-17 15:02:51 +03:00
Geoffrey Beausire	ee9841b138	Ensure gossip is enabled on all shards before starting the failure_detector_loop Before it was possible for a race condition to happen where the failure_detector_loop is started before the gossiper._enabled is set to true on every shard. This change ensure that _enabled is set to true before moving forward Closes #10548	2022-06-17 14:10:45 +03:00
Avi Kivity	4d587e0c3d	cql3: raw_value: deduplicate view() and to_view() Commit `e739f2b779` ("cql3: expr: make evaluate() return a cql3::raw_value rather than an expr::constant") introduced raw_value::view() as a synonym to raw_value::to_view() to reduce churn. To fix this duplication, we now remove raw_value::to_view(). raw_value::to_view() was picked for removal because is has fewer call sites, reducing churn again. Closes #10819	2022-06-17 09:32:58 +02:00
Avi Kivity	19a6e69001	cql3: accept and type-check reused named bind variables A named bind-variable can be reused: SELECT * FROM tab WHERE a = :var AND b = :var Currently, the grammar just ignores the possibility and creates a new variable with the same name. The new variable cannot be referenced by name since the first one shadows it. Catch variable reuse by maintaining a map from bind variable names to indexed, and check that when reusing a bind variable the types match. A unit test is added. Fixes #10810 Closes #10813	2022-06-17 09:09:49 +02:00
Konstantin Osipov	670b2562a1	lwt: Cassandrda compatibility when incarnating a row for UPDATE When evaluating an LWT condition involving both static and non-static cells, and matching no regular row, the static row must be used UNLESS the IF condition is IF EXISTS/IF NOT EXISTS, in which case special rules apply. Before this fix, Scylla used to assume a row doesn't exist if there is no matching primary key. In Cassandra, if there is a non-empty static row in the partition, a regular row based on the static row' cell values is created in this case, and then this row is used to evaluate the condition. This problem was reported as gh-10081. The reason for Scylla behaviour before the patch was that when implementing LWT I tried to converge Cassandra data model (or lack of thereof) with a relational data model, and assumed a static row is a "shared" portion of a regular row, i.e. a storage level concept intended to save space, and doesn't have independent existence. This was an oversimplification. This patch fixes gh-10081, making Scylla semantics match the one of Cassandra. I will now list other known examples when a static row has an own independent existence as part of a table, for cataloguing purposes. SELECT * from a partition which has a partition key and a static cell set returns 1 row. If later a regular row is added to the partition, the SELECT would still return 1 row, i.e. the static row will disappear, and a regular row will appear instead. Another example showing a static row has an independent existence below: CREATE TABLE t (p int, c int, s int static, PRIMARY KEY(p, c)); INSERT INTO t (p, c) VALUES(1, 1); INSERT INTO t (p, s) VALUES(1, 1) IF NOT EXISTS; In Cassandra (and Scylla), IF NOT EXISTS evaluates to TRUE, even though both the regular row and the partition exist. But the static cells are not set, and the insert only provides a partition key, so the database assumes the insert is operating against a static row. It would be wrong to assume that a static row exists when the partition key exists: INSERT INTO t (p, c, s) VALUES(1, 1, 1) IF NOT EXISTS; [applied] \| p \| c \| s -----------+---+---+------ False \| 1 \| 1 \| null evaluates to False, i.e. the regular row does exist when p and c exist. Issue CREATE TABLE t (p INT, c INT, r INT, s INT static, PRIMARY KEY(p, c)) INSERT INTO t (p, s) VALUES (1, 1); UPDATE t SET s=2, r=1 WHERE p=1 AND c=1 IF s=1 and r=null; - in this case, even though the regular row doesn't exist, the static row does, and should be used for condition evaluation. In other words, IF EXISTS/IF NOT EXISTS have contextual semantics. They apply to the regular row if clustering key is used in the WHERE clause, otherwise they apply to static row. One analogy for static rows is that it is like a static member of C++ or Java class. It's an attribute of the class (assuming class = partition), which is accessible through every object of the class (object = regular row). It is also present if there are no objects of the class, but the class itself exists: i.e. a partition could have no regular rows, but some static cells set, in this case it has a static row. Unlike C++/Java static class members a static row is an optional attribute of the partition. A partition may exist, but the static row may be absent (e.g. no static cell is set). If the static row does exist, all regular rows share its contents, even if they do not exist. A regular row exists when its clustering key is present in the table. A static row exists when at least one static cell is set. Tests are updated because now when no matching row is found for the update we show the value of the static row as the previous value, instead of a non-matching clustering row. Changes in v2: - reworded the commit message - added select tests Closes #10711	2022-06-16 19:23:46 +03:00
Nadav Har'El	0be06e0bdf	test/alternator: additional test for BatchGetItem Our simple test for BatchGetItem on a table with sort keys still has requests with just one sort key per partition, so if BatchGetItem has a bug with requesting multiple sort keys from the same partition, such bug won't be caught by the simple tests. So in this test we add a test that does. This will be useful for the next patch, we are planning to refactor BatchGetItem's handling of multiple sort keys in the same partition - so it will be useful to have more regression tests. The tests test_batch_get_item_large and test_batch_get_item_partial would actually also catch such bugs, but they are more elaborate tests and it's nice to have smaller tests more focused on checking specific features. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-16 18:19:20 +03:00
Pavel Emelyanov	0c8abca75e	compaction_manager: Introduce compaction_manager::config This is to make it constructible in a way most other services are -- all the "scalar" parameters are passed via a config. With this it will be much shorter to add compaction bandwidth throttling option by just extending the config itself, not the list of constructor arguments (and all its callers). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	997a34bf8c	backlog_controller: Generalize scheduling groups Make struct scheduling_group be sub-class of the backlog controller. Its new meaning is now -- the group under controller maintenance. Both database and compaction manager derive their sched groups from this one. This makes backlog controller construction simpler, prepares the ground for sched groups unification in seastar and facilitates next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	12b2d6400d	database: Keep compound flushing sched group Similar to previous patch that made the same for compaction manager. The newly introduced private scheduling_group class is temporary and will go away in next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	0fef2e0273	compaction_manager: Swap groups and controller To have groups initialized before controller. Makes next patch shorter Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	fbb59fc920	compaction_manager: Keep compaction_sg on board This is mainly to make next patch simpler. Also this makes the backlog controller API smaller by removing its sg() method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	0662036d27	compaction_manager: Unify scheduling_group structures There are two of them with identical content and meaning Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	41f1044d3c	compaction_manager: Merge static/dynamic constructors The only difference between those two are in the way backlog controller is created. It's much simpler to have the controller construction logic in compaction manager instead. Similar "trick" is used to construct flush controller for the database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	2dbf0b5248	compaction_manager: Coroutinuze really_do_stop() This way it's more compact and easier to extend. Also it's small enough to fix indentation right at once. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	bbd9fc26cd	compaction_manager: Shuffle really_do_stop() Make it the future-returning method and setup the _stop_future in its only caller. Makes next patch much simpler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Pavel Emelyanov	b19b8c9e5b	compaction_manager: Remove try-catch around logger Logging functions are all noexcept already Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-16 17:40:19 +03:00
Israel Fruchter	d2ca2455db	scripts/scylla_util.py: introduce back user/group arguments for out() since #10467 remove the user/group parameters needed for the housekeeping call, need to introuce them back Fixes: #10804 Closes #10818	2022-06-16 13:50:17 +03:00
Petr Gusev	d606966597	cql3::column_condition.cc: fix _in_marker handling The commit scylladb@5dee55d introduced a regression: type of in_list_receiver was taken from receiver instead of value_spec as it was before. This regression was caught by dtest test_lwt_update_prepared_listlike_and_tuples. This commit reverts to original behavior and adds a specific boost-test for this scenario. Fixes: #10821 Closes #10812	2022-06-16 10:57:12 +03:00
Botond Dénes	1718ed9e9f	docs/contribute/maintainer.md: add merging and backporting guidelines	2022-06-16 10:29:26 +03:00
Botond Dénes	e9c9ca4a8a	docs/contribute/CONTRIBUTING.md: add reference to review checklist: It serves as a good resource for aspiring contributors to see what reviewers will be looking for in submitted patches.	2022-06-16 10:29:26 +03:00
Botond Dénes	4542486f23	docs/contribute/review-checklist.md: add section about patch organization	2022-06-16 10:29:26 +03:00
Botond Dénes	25f4ad1543	docs/contribute/maintainer.md: expand section on git submodule sync git submodule sync is only dangerous if one is using certain workflows. Explain what the danger are and when it is safe to use.	2022-06-16 10:29:21 +03:00
Botond Dénes	2734c70803	Merge 'Batchlog replay for decommission fix and cleanup' from Asias He This patch set - adds log before and after batch log replay - removes a duplicated call to trigger batch log replay - removes obsoletes log Closes #10800 * github.com:scylladb/scylla: storage_service: Remove obsolete log storage_service: Do not call do_batch_log_replay again in unbootstrap storage_service: Add log for start and stop of batchlog replay	2022-06-16 08:55:05 +03:00
Botond Dénes	0b80b5850f	Merge 'allow view snapshots when automatic' from Michael Livshin A pre-scrub view snapshot cannot be attributed to user error, so no call to bail out. Closes #10760. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10783 * github.com:scylladb/scylla: api-doc: correct spelling allow pre-scrub snapshots of materialized views and secondary indices	2022-06-16 08:47:33 +03:00
Botond Dénes	4bd4aa2e88	Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drop tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652. Closes #10807 * github.com:scylladb/scylla: memtable: Add counters for tombstone compaction memtable, cache: Eagerly compact data with tombstones memtable: Subtract from flushed memory when cleaning mvcc: Introduce apply_resume to hold state for partition version merging test: mutation: Compare against compacted mutations compacting_reader: Drop irrelevant tombstones mutation_partition: Extract deletable_row::compact_and_expire() mvcc: Apply mutations in memtable with preemption enabled test: memtable: Make failed_flush_prevents_writes() immune to background merging	2022-06-15 18:12:42 +03:00
Nadav Har'El	665e8c1a23	test/cql-ptest: add tests for collection indexing This patch adds an extensive array of tests for the Cassandra feature that Scylla hasn't implemented yet (issues #2962, #8745, #10707) of indexing the keys, values or entries of a collection column. The goal of these tests is to explicitly exercise every corner case I could think of by looking at the documentation of this feature and considering its possible implementation - and as usual, making sure that the tests actually pass on Cassandra. These tests overlap some of the existing unit tests that we translated from Cassandra, as well as some randomized tests that do not necessarily cover the same edge cases as these tests cover. All tests added in this patch pass on Cassandra, but currently fail on Scylla due to the above issues. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10771	2022-06-15 16:10:36 +02:00
Michael Livshin	43f2c55c5d	configure.py: speed up and simplify compdb generation The most time-consuming part is invoking "ninja -t compdb", and there is no need to repeat that for every mode. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10733	2022-06-15 16:40:52 +03:00
Botond Dénes	6242f3fef8	Merge 'process_sstables_dir: close directory_lister on error' from Benny Halevy `sstable_directory::process_sstables_dir` may hit an exception when calling `handle_component`. In this case we currently destroy the `sstable_dir_lister` variable without closing the `directory_lister` first - leading to terminate in `~directory_lister` as seen in #10697. This mini-series handles this exception and always closes the `directory_lister`. Add unit test to reproduce this issue. Fixes #10697 Closes #10754 * github.com:scylladb/scylla: sstable_directory: process_sstable_dir: fixup indentation sstable_directory: process_sstable_dir: close directory_lister on error	2022-06-15 16:40:30 +03:00
Tomasz Grabiec	3bec1cc19f	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries. Fixes #10801 Refs #10793 (cherry picked from commit `0e78ad50ea`) Closes #10802	2022-06-15 14:33:19 +02:00
Mikołaj Sielużycki	db5b05948b	compaction: Clarify comment. Closes #10799	2022-06-15 15:09:44 +03:00
Avi Kivity	aa8f135f64	Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki If we reach a situation where flush rate exceeds compaction rate, we may end up with arbitrarily large number of sstables on disk. If a read is executed in such case, the amount of memory required is proportional to the number of sstables for the given shard, which in extreme cases can lead to OOM. In the wild, this was observed in 2 scenarios: - A node with >10 shards creates a keyspace with thousands of tables, drops the keyspace and shuts down before compaction finishes. Dropping keyspace drops tables, and each dropped table is smp::count writes to system.local table with flush after write, which creates tens of thousands of sstables. Bootstrap read from system.local will run OOM. - A failure to agree on table schema (due to a code bug) between nodes during repair resulted in excessive flushing of small sstables which compaction couldn't keep up with. In the unit test introduced in this patch series it can be proved that even hard setting maximum shares for compaction and minimum shares for flushing doesn't tilt the balance towards compaction enough to prevent the problem. Since it's a fast producer, slow consumer problem, the remaining solution is to block producer until the consumer catches up. If there are too many table runs originating from memtable, we block the current flush until the number of sstables is reduced (via ongoing compaction or a truncate operation). Fixes https://github.com/scylladb/scylla/issues/4116 Changelog: v5: - added a nicer way of timing the stalls caused by waiting for flush - added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups - added comment why we trigger compaction before waiting for sstable count reduction - removed unnecessary cv.signal from table::stop v4: - removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset v3: - removed unnecessary change to scheduling groups from v2 - moved sstables_changed signalling to suggested place in table::stop - added log how long the table flush was blocked for - changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <= v2: - Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well. - Converted table::stop to coroutines. - Reordered commits so that test is committed after fix, so it doesn't trip up bisection. Closes #10717 * github.com:scylladb/scylla: table: Add test where compaction doesn't keep up with flush rate. random_mutation_generator: Add option to specify ks_name and cf_name table: Prevent creating unbounded number of sstables	2022-06-15 14:51:08 +03:00
Benny Halevy	89c5e8413f	sstable_directory: process_sstable_dir: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-15 13:56:11 +03:00
Benny Halevy	6cafd83e1c	sstable_directory: process_sstable_dir: close directory_lister on error Otherwise, if we don't consume all lister's entries, ~directory_lister terminates since the directory_lister is destroyed without being closed. Add unit test to reproduce this issue. Fixes #10697 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-15 13:56:10 +03:00
Tomasz Grabiec	169025d9b4	memtable: Add counters for tombstone compaction	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	94f9109bea	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	53026f3ba6	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	cd523214a2	mvcc: Introduce apply_resume to hold state for partition version merging Partition version merging is preemptable. It may stop in the middle and be resumed later. Currently, all state is kept inside the versions themselves, in the form of elements in the source version which are yet to be moved. This will change once we add compaction (tombstones with rows) into the merging algorithm. There, state cannot be encoded purley within versions. Consider applying a partition tombstone over large number of rows. This patch introduces apply_rows object to hold the necessary state to make sure forward progress in case of preemption. No change in behavior yet.	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	02c92d5ea2	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	570b76bc5b	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	44bb9d495b	mutation_partition: Extract deletable_row::compact_and_expire()	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	a4e96960b8	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-15 11:29:43 +02:00
Tomasz Grabiec	c682521ac7	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries.	2022-06-15 11:29:43 +02:00
Mikołaj Sielużycki	25407a7e41	table: Add test where compaction doesn't keep up with flush rate. The test simulates a situation where 2 threads issue flushes to 2 tables. Both issue small flushes, but one has injected reactor stalls. This can lead to a situation where lots of small sstables accumulate on disk, and, if compaction never has a chance to keep up, resources can be exhausted.	2022-06-15 10:57:28 +02:00
Mikołaj Sielużycki	b5684aa96d	random_mutation_generator: Add option to specify ks_name and cf_name	2022-06-15 10:57:28 +02:00
Mikołaj Sielużycki	4cd42f97d0	table: Prevent creating unbounded number of sstables If we reach a situation where flush rate exceeds compaction rate, we may end up with arbitrarily large number of sstables on disk. If a read is executed in such case, the amount of memory required is proportional to the number of sstables for the given shard, which in extreme cases can lead to OOM. In the wild, this was observed in 2 scenarios: - A node with >10 shards creates a keyspace with thousands of tables, drops the keyspace and shuts down before compaction finishes. Dropping keyspace drops tables, and each dropped table is smp::count writes to system.local table with flush after write, which creates tens of thousands of sstables. Bootstrap read from system.local will run OOM. - A failure to agree on table schema (due to a code bug) between nodes during repair resulted in excessive flushing of small sstables which compaction couldn't keep up with. In the unit test introduced in this patch series it can be proved that even hard setting maximum shares for compaction and minimum shares for flushing doesn't tilt the balance towards compaction enough to prevent the problem. Since it's a fast producer, slow consumer problem, the remaining solution is to block producer until the consumer catches up. If there are too many table runs originating from memtable, we block the current flush until the number of sstables is reduced (via ongoing compaction or a truncate operation).	2022-06-15 10:57:28 +02:00
Pavel Emelyanov	9a88bc260c	Merge 'various group0 start/stop issues' from Gleb The series fixes a couple of crashes that were found during starting and stopping Scylla with raft while doing ddl operations. Most of them related to shutdown order between different components. Also in scylla-dev gleb/group0-fixes-v1 CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/ * origin-dev/gleb/group0-fixes-v1: migration manager: remove unused code db/system_distributed_keyspace: do not announce empty schema main: stop raft before the migration manager storage_service: do not pass the raft group manager to storage_service constructor main: destroy the group0_client after stopping the group0	2022-06-15 11:44:03 +03:00
Michael Livshin	28d44ce6db	api-doc: correct spelling Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-15 11:30:58 +03:00
Michael Livshin	aab4cd850c	allow pre-scrub snapshots of materialized views and secondary indices Previously, any attempt to take a materialized view or secondary index snapshot was considered a mistake and caused the snapshot operation to abort, with a suggestion to snapshot the base table instead. But an automatic pre-scrub snapshot of a view cannot be attributed to user error, so the operation should not be aborted in that case. (It is an open question whether the more correct thing to do during pre-scrub snapshot would be to silently ignore views. Or perhaps they should be ignored in all cases except when the user explicitly asks to snapshot them, by name) Closes #10760. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-15 11:30:58 +03:00
Avi Kivity	e739f2b779	cql3: expr: make evaluate() return a cql3::raw_value rather than an expr::constant An expr::constant is an expression that happens to represent a constant, so it's too heavyweight to be used for evaluation. Right now the extra weight is just a type (which causes extra work by having to maintain the shared_ptr reference count), but it will grow in the future to include source location (for error reporting) and maybe other things. Prior to `e9b6171b5` ("Merge 'cql3: expr: unify left-hand-side and right-hand-side of binary_operator prepares' from Avi Kivity"), we had to use expr::constant since there was not enough type infomation in expressions. But now every expression carries its type (in programming language terms, expressions are now statically typed), so carrying types in values is not needed. So change evaluate() to return cql3::raw_value. The majority of the patch just changes that. The rest deals with some fallout: - cql3::raw_value gains a view() helper to convert to a raw_value_view, and is_null_or_unset() to match with expr::constant and reduce further churn. - some helpers that worked on expr::constant and now receive a raw_value now need the type passed via an additional argument. The type is computed from the expression by the caller. - many type checks during expression evaluation were dropped. This is a consequence of static typing - we must trust the expression prepare phase to perform full type checking since values no longer carry type information. Closes #10797	2022-06-15 08:47:24 +02:00
Avi Kivity	398a86698d	Update tools/python3 submodule (/usr/lib/sysimage filtering) * tools/python3 f725ec7...3471634 (1): > create-relocatable-package.py: filter out /usr/lib/sysimage	2022-06-15 09:27:06 +03:00
Avi Kivity	8f690fdd47	Update seastar submodule * seastar 1424d34c93...443e6a9b77 (5): > reactor: re-raise fatal signals Ref #9242 > test: initialize _earliest_started and _latest_finished > reactor: add io_uring backend > semaphore: add semaphore_unit operator bool > Merge 'map reduce: save mapper' from Benny Halevy io_uring is disabled since the frozen toolchain's liburing it too old. Closes #10794	2022-06-15 08:36:08 +03:00
Asias He	6f4bfea994	storage_service: Remove obsolete log In unbootstrap, we do not really stream hints here. Remove the log about it.	2022-06-15 08:28:06 +08:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Benny Halevy	5bd2e0ccce	test: memtable_test: failed_flush_prevents_writes: validate flush using min_memtable_timestamp active_memtable().empty() becomes true once seal_active_memtable succeeds with _memtables->add_memtable(), not when it is able to flush the (once active) memtable. In contrast, min_memtable_timestamp() returns api::max_timestamp only if there is no data in any memtable. Fixes #10793 Backport notes: - Introduced in `f6d9d6175f` (currently in branch-5.0) - backport requires also `0e78ad50ea` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10798	2022-06-14 16:13:35 +03:00
Piotr Sarna	61ae0a46e3	Merge 'Three small fixes to Alternator's handling of GSIs... and LSIs' from Nadav Har'El This series includes three small fixes (and of course, tests) for various edge cases of GSI and LSI handling in Alternator: 1. We add the IndexArn that were missing in DescribeTable for indexes (GSI and LSI) 2. We forbid the same name to be used for both GSI and LSI (allowing it was a bug, not a feature) 3. We improve the error handling when trying to tag a GSI or LSI, which is not currently allowed (it's also not allowed in DynamoDB). Closes #10791 * github.com:scylladb/scylla: alternator: improve error handling when trying to tag a GSI or LSI alternator: forbid duplicate index (LSI and GSI) names alternator: add ARN for indexes (LSI and GSI)	2022-06-14 07:39:44 +02:00
Nadav Har'El	c0a09669c1	Merge 'cql3: expr: unify binary operator left-hand-side and right-hand-side evaluation' from Avi Kivity The left-hand-side of a binary_operator is currently evaluated via a get_value() function that receives the row values. On the other hand, the right hand side is evaluated via evaluate(), which receives query_options in order to resolve bind variables. This series unifies the two paths into evaluate(), and standardizes the different inputs into a new evaluation_inputs struct. The old hacks column_value_eval_bag and column_maybe_subscripted are removed. Closes #10782 * github.com:scylladb/scylla: cql3: expr: drop column_maybe_subscripted cql3: expr: possible_lhs_values(): open-code get_value_comparator() cql3: expr: rationalize lhs/rhs argument order cql3: expr: don't rely on grammar when comparing tuples cql3: expr: wire column_value and subscript to evaluate() cql3: get_value(subscript): remove gratuitous pointer cql3: expr: reindent get_value(subscript) cql3: expr: extract get_value(subscript) from get_value(column_maybe_subscripted) cql3: raw_value: add missing conversion from managed_bytes_opt&& cql3: prepare_expr: prepare subscript type cql3: expr: drop internal 'column_value_eval_bag' cql3: expr: change evalute() to accept evaluation_inputs cql3: expr: make evaluate(<expression subtype>) static cql3: expr: push is_satisfied_by regular and static column extraction to callers cql3: expr: convert is_satisfied_by() signature to evaluation_inputs cql3: expr: introduce evaluation_inputs	2022-06-13 23:07:57 +03:00
Nadav Har'El	e20233dab1	alternator: improve error handling when trying to tag a GSI or LSI In issue #10786, we raised the idea of maybe allowing to tag (with TagResource) GSIs and LSIs, not just base tables. However, currently, neither DynamoDB nor Syclla allows it. So in this patch we add a test that confirms this. And while at it, we fix Alternator to return the same error message as DynamoDB in this case. Refs #10786. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-13 18:14:42 +03:00
Nadav Har'El	8866c326de	alternator: forbid duplicate index (LSI and GSI) names Adding an LSI and GSI with the same name to the same Alternator table should be forbidden - because if both exists only one of them (the GSI) would actually be usable. DynamoDB also forbids such duplicate name. So in this patch we add a test for this issue, and fix it. Since the patch involves a few more uses of the IndexName string, we also clean up its handling a bit, to use std::string_view instead of the old-style std::string&. Fixes #10789 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-13 18:14:42 +03:00
Nadav Har'El	00866a75d8	alternator: add ARN for indexes (LSI and GSI) DynamoDB gives an ARN ("Amazon Resource Name") to LSIs and GSIs. These look like BASEARN/index/INDEXNAME, where BASEARN is the ARN of the base table, and INDEXNAME is the name of the LSI or the GSI. These ARNs should be returned by DescribeTable as part of its description of each index, and this patch adds that missing IndexArn field. The ARN we're adding here is hardly useful (e.g., as explained in issue #10786, it can't be used to add tags to the index table), but nevertheless should exist for compatibility with DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-13 18:14:42 +03:00
Benny Halevy	8f39547d89	compaction_manager: task: convert semaphore_aborted to compaction_stopped exception Fixes #10666 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10686	2022-06-13 16:20:39 +03:00
Botond Dénes	b820aad3e0	Merge 'test/cql-pytest: skip another test on older, buggy, drivers' from Nadav Har'El Older versions of the Python Cassandra driver had a bug where a single empty page aborts a scan. The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately tiny pages, so it turns out that some of them are empty, so the test breaks on buggy versions of the driver, which cause the test to fail when run by developers who happen to have old versions of the driver. So in this small series we skip this test when running on a buggy version of the driver. Fixes #10763 Closes #10766 * github.com:scylladb/scylla: test/cql-pytest: skip another test on older, buggy, drivers test/cql-pytest: de-duplicate code checking for an old buggy driver	2022-06-13 16:06:11 +03:00
Avi Kivity	8edb79ea80	Merge 'Reduce compaction serialization' from Mikołaj Sielużycki update_history can take a long time compared to compaction, as a call issued on shard S1 can be handled on shard S2. If the other shard is under heavy load, we may unnecessarily block kicking off a new compaction. Normally it isn't a problem, as compactions aren't super frequent, but there were edge cases where the described behaviour caused compaction to fail to keep up with excessive flushing, leading to too many sstables on disk and OOM during a read. There is no need to wait with next compaction until history is updated, so release the weight earlier to remove unnecessary serialization. Changelog: v3: - explicitly call deregister instead of moving the weight RAII object to release weight - mark compaction as finished when sstables are compacted, without waiting for history to update v2: - Split the patches differently for easier review - Rebased agains newer master, which contains fixes that failed the debug version of the test - Removed the test, as it will be provided by [PR#10717](https://github.com/scylladb/scylla/pull/10717) Closes #10507 * github.com:scylladb/scylla: compaction: Release compaction weight before updating history. compaction: Inline compact_sstables_and_update_history call. compaction: Extract compact_sstables function compaction: Rename compact_sstables to compact_sstables_and_update_history compaction: Extract update_history function compaction: Extract should_update_history function. compaction: Fetch start_size from compaction_result compaction: Add tracking start_size in compaction_result.	2022-06-13 16:04:20 +03:00
Takuya ASADA	5643c6de56	scylla_util.py: fix "systemctl is-active" causes error On `48b6aec16a` we mistakenly allowed check=True on systemd_unit.is_active(), it should be check=False. We check unit's status by "systemctl is-active" output string, it returns "active" or "inactive". But systemctl command returns non-zero status when it returning "inactive", so we are getting Exception here. To fix this, we need new option "ignore_error=True" for out(), and use it in systemd_unit.is_active(). Fixes #10455 Closes #10467	2022-06-13 13:45:50 +03:00
Botond Dénes	43d23d797d	scylla-gdb.py: make scylla-threads more flexible Currently the scylla-threads command just lists all threads on all shards. This is usually more than what one wants. This patch adds support for listing threads only on the current shard (new default), on a specific shard or all shards (old default). Also, optionally the thread functor's vtable symbol can be added to the listing. This makes the listing more informative but much more bloated as well. Which is why it's opt-in. Example listing (no vtable symbols): (gdb) scylla threads [shard 5] (seastar::thread_context) 0x60100035c900, stack: 0x60100bd20000 [shard 5] (seastar::thread_context) 0x6010084a3800, stack: 0x601008c80000 [shard 5] (seastar::thread_context) 0x60100037c900, stack: 0x60100c640000 [shard 5] (seastar::thread_context) 0x60100035d200, stack: 0x60100d9e0000 [shard 5] (seastar::thread_context) 0x60100372d980, stack: 0x60100ad60000 [shard 5] (seastar::thread_context) 0x601000110d80, stack: 0x601009be0000 [shard 5] (seastar::thread_context) 0x6010084cd680, stack: 0x60100a160000 [shard 5] (seastar::thread_context) 0x6010000dc780, stack: 0x60100a2e0000 [shard 5] (seastar::thread_context) 0x6010084cca80, stack: 0x60100a1c0000 [shard 5] (seastar::thread_context) 0x6010084cc000, stack: 0x60100ab40000 [shard 5] (seastar::thread_context) 0x60100038ca80, stack: 0x601009860000 [shard 5] (seastar::thread_context) 0x60100037db00, stack: 0x60100a820000 Example listing with vtable symbols: (gdb) scylla threads -v [shard 5] (seastar::thread_context) 0x60100035c900, stack: 0x60100bd20000 vtable: 0x478520 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}>(seastar::thread_attributes, sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x6010084a3800, stack: 0x601008c80000 vtable: 0x478520 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}>(seastar::thread_attributes, sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x60100037c900, stack: 0x60100c640000 vtable: 0x478520 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}>(seastar::thread_attributes, sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x60100035d200, stack: 0x60100d9e0000 vtable: 0x4784f0 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_1>(seastar::thread_attributes, sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_1&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x60100372d980, stack: 0x60100ad60000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x601000110d80, stack: 0x601009be0000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x6010084cd680, stack: 0x60100a160000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x6010000dc780, stack: 0x60100a2e0000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x6010084cca80, stack: 0x60100a1c0000 vtable: 0x582ca8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::view::view_update_generator::start()::$_0>(seastar::thread_attributes, db::view::view_update_generator::start()::$_0&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x6010084cc000, stack: 0x60100ab40000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x60100038ca80, stack: 0x601009860000 vtable: 0x8c6088 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<std::_Bind<void (seastar::tls::reloadable_credentials_base::reloading_builder::(seastar::tls::reloadable_credentials_base::reloading_builder))()>>(seastar::thread_attributes, std::_Bind<void (seastar::tls::reloadable_credentials_base::reloading_builder::(seastar::tls::reloadable_credentials_base::reloading_builder))()>&&)::{lambda()#1}>::s_vtable> [shard 5] (seastar::thread_context) 0x60100037db00, stack: 0x60100a820000 vtable: 0x569bd8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::space_watchdog::start()::$_2>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_2&&)::{lambda()#1}>::s_vtable> Closes #10776	2022-06-13 13:05:27 +03:00
Avi Kivity	ee2420ff43	messaging: add boilerplate to rpc_protocol_impl.hh License, copyright, #pragma once. The copyright is set to 2021 since that was when the file was created. Closes #10778	2022-06-13 07:29:32 +02:00
Piotr Sarna	ddf83f6ddc	scripts: make pull_github_pr.sh more universally usable After `93b765f655`, our pull_github_pr.sh script tries to detect a non-orthodox remote repo name, but it also adds an assumption which breaks on some configurations (by some I mean mine). Namely, the script tries to parse the repo name from the upstream branch, assuming that current HEAD actually points to a branch, which is not the way some users (by some I mean me) work with remote repositories. Therefore, to make the script also work with detached HEAD, it now has two fallback mechanisms: 1. If parsing @{upstream} failed, the script tries to parse master@{upstream}, under the assumption that the master branch was at least once used to track the remote repo. 2. If that fails, `origin/master` is used as last resort solution. This patch allows some users (guess who) to get back to using scripts/pull_github_pr.sh again without using a custom patched version. Closes #10773	2022-06-13 08:15:40 +03:00
Avi Kivity	06a62b150d	sstables: processing_result_generator: prefer standard coroutines over the technical specification with clang 14 Clang up to version 13 supports the coroutines technical specification (in std::experimental). 15 and above support standard coroutines (in namespace std). Clang 14 supports both, but with a warning for the technical specification coroutines. To avoid the warning, change the threshold for selecting standard coroutines from clang 15 to clang 14. This follow seastar commit 070ab101e2. Closes #10647	2022-06-12 20:05:28 +03:00
Avi Kivity	6d943e6cd0	cql3: expr: drop column_maybe_subscripted column_maybe_subscripted is a variant<column_value, subscript> that existed for two reasons: 1. evaluation of subscripts and of columns took different paths. 2. calculation of the type of column or column[sub] took different paths. Now that all evaluations go through evaluate(), and the types are present in the expression itself, there is no need for column_maybe_subscripted and it is replaced with plain expressions.	2022-06-12 19:21:28 +03:00
Avi Kivity	2aa9199e9a	cql3: expr: possible_lhs_values(): open-code get_value_comparator() get_value_comparator() is going away soon, so open-code it here. It's not doing much anyway.	2022-06-12 19:14:50 +03:00
Avi Kivity	b1c12073b1	cql3: expr: rationalize lhs/rhs argument order Some functions accept the right-hand-side as the first argument and the left-hand-side as the second argument. This is now confusing, but at least safe-ish, as the arguments have different types. It's going to become dangerous when we switch to expressions for both sides, so let's rationalize it by always starting with lhs. Some parameters were annotated with _lhs/_rhs when it was not clear.	2022-06-12 18:55:24 +03:00
Avi Kivity	9beac1df53	cql3: expr: don't rely on grammar when comparing tuples The grammar only allows comparing tuples of clustering columns, which are non-null, but let's not rely on that deep in expression evaluation as it can be relaxed.	2022-06-12 18:41:03 +03:00
Avi Kivity	9a4f2a8cc3	cql3: expr: wire column_value and subscript to evaluate() With everything standardized on evaluation_inputs(), it's a matter of calling get_value().	2022-06-12 18:21:04 +03:00
Avi Kivity	30721fdc4a	cql3: get_value(subscript): remove gratuitous pointer While extracting get_value(subscript) we inherited a pointer due to the calling convention, we can now remove it.	2022-06-12 18:18:59 +03:00
Avi Kivity	dd2fec9cb1	cql3: expr: reindent get_value(subscript) Whitespace only change.	2022-06-12 18:04:12 +03:00
Avi Kivity	31b9e2a565	cql3: expr: extract get_value(subscript) from get_value(column_maybe_subscripted) We wish to wire get_value(subscript) into evaluate (and get rid of column_maybe_subscripted).	2022-06-12 18:03:03 +03:00
Avi Kivity	844756d4fe	cql3: raw_value: add missing conversion from managed_bytes_opt&& Conversion from const managed_bytes_opt& already exists.	2022-06-12 17:55:48 +03:00
Avi Kivity	248433d7e0	cql3: prepare_expr: prepare subscript type The type of a subscript expression is the value comparator of the expression (column) being subscripted, according to out wierd naming.	2022-06-12 17:39:08 +03:00
Avi Kivity	b5287db8ea	cql3: expr: drop internal 'column_value_eval_bag' is_satisfied_by() used an internal column_value_eval_bag type that was more awkwardly named (and more awkward to use due to more nesting) than evaluation_inputs. Drop it and use evaluation_inputs throughout. The thunk is_satisified_by(evaluation_inputs) that just called is_satisified_by(column_value_eval_bag) is dropped.	2022-06-12 17:12:41 +03:00
Avi Kivity	55085906ca	cql3: expr: change evalute() to accept evaluation_inputs Currently, evaluate() accepts only query_options, which makes it not useful to evaluate columns. As a result some callers (column_condition) have to call it directly on the right-hand-side of binary expressions instead of evaluating the binary expression itself. Change it to accept evaluation_input as a parameter, but keep the old signature too, since it is called from many places that don't have rows.	2022-06-12 16:51:42 +03:00
Avi Kivity	2ecdb219fb	cql3: expr: make evaluate(<expression subtype>) static They aren't called from anywhere outside expression.cc, and we're playing with the signatures, so hide them to avoid rebuilds.	2022-06-12 16:13:20 +03:00
Avi Kivity	c80999fab4	cql3: expr: push is_satisfied_by regular and static column extraction to callers is_satisfied_by() rearranges the static and regular columns from query::result_row_view form (which is a use-once iterator) to std::vector<managed_bytes_opt> (which uses the standard value representation, and allows random access which expression evaluation needs). Doing it in is_saitisfied_by() means that it is done every time an expression is evaluated, which is wasteful. It's also done even if the expression doesn't need it at all. Push it out to callers, which already eliminates some calls. We still pass cql3::expr::selection, which is a layering violation, but that is left to another time. Note that in view.cc's check_if_matches(), we should have been able to move static_and_regular_columns calculation outside the loop. However, we get crashes if we do. This is likely due to a preexisting bug (which the zero iterations loop avoids). However, in selection.cc, we are able to avoid the computation when the code claims it is only handling partition keys or clustering keys.	2022-06-12 16:12:41 +03:00
Avi Kivity	4b715226fe	cql3: expr: convert is_satisfied_by() signature to evaluation_inputs Callers are converted, but the internals are kept using the old conventions until more APIs are converted. Although the new API allows passing no query_options, the view code keeps passing dummy query_options and improvement is left as a FIXME.	2022-06-12 12:53:44 +03:00
Avi Kivity	7a9b645d64	cql3: expr: introduce evaluation_inputs An expression may refer to values provided externally: the partition and clusterinng keys, the static and regular row (all providing column values), and the query options (providing values for bind variables). Currently, different evaluation functions (evaluate(), get_value(), and is_satisfied_by()) receive different subsets of these values. As a first step towards unifying the various ways to evaluate an expression, collect the parameters in a single structure. Since different evaluation contexts have different subsets, make everything optional (via a pointer). Note that callers are expected to verify using the grammar or prepare phase that they don't refer to values that are not provided. The cql3::selection::selection parameter is provided to translate from query::result_row_view to schema column indexes. This is pretty bad since it means the translation needs to be done for every evaluation and is therefore a candidate for removal, but is kept here since that's how it's currently done.	2022-06-12 12:47:23 +03:00
Kamil Braun	e87ca733f0	Merge 'test.py: fix bugs, add support for flaky tests' from Konstantin Osipov Marking a test as flaky allows to keep running it in CI rather than disable it when it's discovered that a test is flaky. Flaky tests, if they fail, show up as flaky in the output, but don't fail the CI. ``` kostja@hulk:~/work/scylla/scylla$ ./test.py cdc_with --repeat=30 --verbose Found 30 tests. ================================================================================ [N/TOTAL] SUITE MODE RESULT TEST ------------------------------------------------------------------------------ [1/30] cql debug [ FLKY ] cdc_with_lwt_test.2 9.36s [2/30] cql debug [ FLKY ] cdc_with_lwt_test.1 9.53s [3/30] cql debug [ PASS ] cdc_with_lwt_test.7 9.37s [4/30] cql debug [ PASS ] cdc_with_lwt_test.8 9.41s [5/30] cql debug [ PASS ] cdc_with_lwt_test.10 9.76s [6/30] cql debug [ FLKY ] cdc_with_lwt_test.9 9.71s ``` Closes #10721 * github.com:scylladb/scylla: test.py: add support for flaky tests test.py: make Test hierarchy resettable test.py: proper suite name in the log test.py: shutdown cassandra-python connection before exit	2022-06-10 19:00:36 +02:00
Konstantin Osipov	8036d19b84	test.py: add support for flaky tests The idea is that a flaky test can be marked as flaky rather than disabled to make sure it passes in CI. This reduces chances of a regression being added while the flakiness is being resolved and the number of disabled tests doesn't grow.	2022-06-10 14:10:21 +03:00
Konstantin Osipov	4cf63efe6c	test.py: make Test hierarchy resettable Introduce reset() hierarchy, which is similar to __init__(), i.e. allows to reset test execution state before retrying it. Useful for retrying flaky tests.	2022-06-10 14:10:21 +03:00
Konstantin Osipov	2b92d96c87	test.py: proper suite name in the log Use a nice suite name rather than an internal Python object key in the log. Fixes a regression introduced when addressing a style-related review remark.	2022-06-10 14:10:21 +03:00
Konstantin Osipov	950d606e38	test.py: shutdown cassandra-python connection before exit Shutdown cassandra-python connections before exit, to avoid warnings/exceptions at shutdown. Cassandra-python runs a thread pool and if connections are not shut down before exit, there could be a warning that the thread pool is not destroyed before exiting main.	2022-06-10 14:10:21 +03:00
Kamil Braun	082f9889b4	Merge 'tools/schema_loader: add support for CDC tables' from Botond Dénes CDC tables use a custom partitioner, which is not reflected in schema dumps (`CREATE TABLE ...`) and currently it is not possible to fix this properly, as we have no syntax to set the partitioner for a table. To work around this, the schema loader determines whether a table is a cdc table based on its name (does it end with `_scylla_cdc_table`) and sets the partitioner manually if it is the case. Fixes: https://github.com/scylladb/scylla/issues/9840 Closes #10774 * github.com:scylladb/scylla: tools/schema_loader: add support for CDC tables cdc/log.hh: expose is_log_name()	2022-06-10 13:04:38 +02:00
Kamil Braun	aeba88cc29	Merge 'test.py: fixes for connection handling' from Alecco Change port type passed to Cassandra Python driver to int to avoid format errors in exceptions. Manually shutdown connections to avoid reconnects after tests are done (required by upcoming async pytests). Tests: (dev) Closes #10722 * github.com:scylladb/scylla: test.py: shutdown connection manually test.py: fix port type passed to Cassandra driver	2022-06-10 11:40:47 +02:00
Botond Dénes	b3d6a182e4	tools/schema_loader: add support for CDC tables CDC tables use a custom partitioner, which is not reflected in schema dumps (`CREATE TABLE ...`) and currently it is not possible to fix this properly, as we have no syntax to set the partitioner for a table. To work around this, the schema loader determines whether a table is a cdc table based on its name (does it end with `_scylla_cdc_table`) and sets the partitioner manually if it is the case.	2022-06-10 10:57:55 +03:00
Botond Dénes	f8a8fe41d6	cdc/log.hh: expose is_log_name() Allow outside code to use it to determine whether a table is cdc or not. This is currently the most reliable method if the custom partitioner is not set on the schema of the investigated table.	2022-06-10 10:57:12 +03:00
Botond Dénes	1c8c693ff7	Merge "Redefine Leveled compaction backlog" from Raphael S. Carvalho " This series is a consequence of the work started by: "compaction: LCS: Fix inefficiency when pushing SSTables to higher levels" `9de7abdc80` "Redefine Compaction Backlog to tame compaction aggressiveness" `d8833de3bb` The backlog definition for leveled is incorrectly built on the assumption that the world must reach the state of zero amplification, i.e. everything in the last level. The actual goal is space amplification of 1.1. In reality, LCS just wants that for every level L, level L is fan_out=10 times larger than L-1. See more in commit `9de7abdc80` which adjusts LCS to conform to this goal. If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1 But today, LCS calculates high backlog for the layout above, as it will only be satisfied once everything is promoted to the maximum level. That's completely disconnected from what the strategy actually wants. Therefore, a mismatch. With today's definition, the backlog for any SSTable is: sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out where Lmax = maximum level, and fan_out = LCS' fan out which is 10 by default That's essentially calculating the total cost for data in the SSTable to climb up to the maximum level. Of course, if a SSTable is at the maximum level, (Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero. Take a look at this example: If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G 0.16G = LCS' default fragment size Maximum level (Lmax in formula) can be easily 3 as: log10 of (30G/0.16G=~187 sstables)) = ~2.27 ~2.27 means that data has exceeded level 2 capacity and so needs 3 levels. So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk memory ratio), that's normalized backlog of ~15, which translates into additional ~500 shares. That's halfway to full compaction speed. With more files in higher levels, we can easily get to a normalized backlog above 30, resulting in 1k shares. The suboptimal backlog definition causes either table using LCS or coexisting tables to run with more shares than needed, causing compaction to steal resources, resulting in higher latency and reduced throughput. To solve this problem, a new formula is used which will basically calculate the amount of work needed to achieve the layout goal. We no longer want to promote everything to the last level, but instead we'll incrementally calculate the backlog in each level L, which is the amount of work needed such that the next level L + 1 is at least fan_out times bigger. Fixes #10583. Results ===== image: https://user-images.githubusercontent.com/1409139/168713675-d5987d09-7011-417c-9f91-70831c069382.png The patched version correctly clears the backlog, meaning that once LCS is satisfied, backlog is 0. Therefore, next compaction either from this table or another won't run unnecessarily aggressive. p99 read and write latency have clearly improved. throughput is also more stable. " * 'LCS_backlog_revamp' of https://github.com/raphaelsc/scylla: tests: sstable_compaction_test: Adjust controller unit test for LCS compaction: Redefine Leveled compaction backlog	2022-06-10 09:21:13 +03:00
Raphael S. Carvalho	079283193a	tests: sstable_compaction_test: Adjust controller unit test for LCS The controller unit test for LCS was only creating level 0 SSTables. As level 0 falls back to STCS controller, it means that we weren't actually testing LCS controller. So let's adjust the unit test to account for LCS fan_out, which is 10 instead of 4, and also allow creation of SSTables on higher levels. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-06-09 14:21:40 -03:00
Raphael S. Carvalho	b27a1d88fe	compaction: Redefine Leveled compaction backlog The backlog definition for leveled is incorrectly built on the assumption that the world must reach the state of zero amplification, i.e. everything in the last level. The actual goal is space amplification of 1.1. In reality, LCS just wants that for every level L, level L is fan_out=10 times larger than L-1. See more in commit `9de7abdc80` which adjusts LCS to conform to this goal. If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1 But today, LCS calculates high backlog for the layout above, as it will only be satisfied once everything is promoted to the maximum level. That's completely disconnected from what the strategy actually wants. Therefore, a mismatch. With today's definition, the backlog for any SSTable is: sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out where Lmax = maximum level, and fan_out = LCS' fan out which is 10 by default That's essentially calculating the total cost for data in the SSTable to climb up to the maximum level. Of course, if a SSTable is at the maximum level, (Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero. Take a look at this example: If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G 0.16G = LCS' default fragment size Maximum level (Lmax in formula) can be easily 3 as: log10 of (30G/0.16G=~187 sstables)) = ~2.27 ~2.27 means that data has exceeded level 2 capacity and so needs 3 levels. So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk memory ratio), that's normalized backlog of ~15, which translates into additional ~500 shares. That's halfway to full compaction speed. With more files in higher levels, we can easily get to a normalized backlog above 30, resulting in 1k shares. The suboptimal backlog definition causes either table using LCS or coexisting tables to run with more shares than needed, causing compaction to steal resources, resulting in higher latency and reduced throughput. To solve this problem, a new formula is used which will basically calculate the amount of work needed to achieve the layout goal. We no longer want to promote everything to the last level, but instead we'll incrementally calculate the backlog in each level L, which is the amount of work needed such that the next level L + 1 is at least fan_out times bigger. Fixes #10583. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-06-09 14:21:40 -03:00
Asias He	e0aca10bcd	streaming: Enable auto off strategy compaction trigger for all rbno ops Since commit `3dc9a81d02` (repair: Repair table by table internally), a table is always repaired one after another. This means a table will be repaired in a continuous manner. Unlike before a table will be repaired again after other tables have finished the same range. ``` for range in ranges for table in tables repair(range, table) ``` The wait interval can be large so we can not utilize the assumption if there is no repair traffic, the whole table is finished. After commit `3dc9a81d02`, we can utilize the fact that a table is repaired continuously property and trigger off strategy automatically when no repair traffic for a table is present. This is especially useful for decommission operation with multiple tables. Currently, we only notify the peer node the decommission is done and ask the peer to trigger off strategy compaction. With this patch, the peer node will trigger automatically after a table is finished, reducing the number of temporary sstables on disk. Refs #10462 Closes #10761	2022-06-09 17:10:14 +03:00
Avi Kivity	afc06f0017	messaging: forward-declare types in messaging_service.hh messaging_service.hh is a switchboard - it includes many things, and many things include it. Therefore, changes in the things it includes affect many translation units. Reduce the dependencies by forward-declaring as much as possible. This isn't pretty, but it reduces compile time and recompilations. Other headers adjusted as needed so everything (including `ninja dev-headers`) still compile. Closes #10755	2022-06-09 15:52:12 +03:00
Nadav Har'El	e9b6171b51	Merge 'cql3: expr: unify left-hand-side and right-hand-side of binary_operator prepares' from Avi Kivity Currently, preparing the left-hand-side of a binary operator and the right-hand-side use different code paths. The left-hand-side derives the type of the expression from the expression itself, while the right-hand-side imposes the type on the expression (allowing the types of bind variables to be inferred). This series unifies the two, by making the imposed type (the "receiver") optional, and by allowing prepare to fail gracefully if we were not able to infer the type. The old prepare_binop_lhs() is removed and replaced with prepare_expression, already used for the right hand side. There is one step remaining, and that is to replace prepare_binary_operator with prepare_expression, but that is more involved and is left for a follow-up. Closes #10709 * github.com:scylladb/scylla: cql3: expr: drop prepare_binop_lhs() cql3: expr: move implementation of prepare_binop_lhs() to try_prepare_expression() cql3: expr: use recursive descent when preparing subscripts cql3: expr: allow prepare of tuple_constructor with no receiver cql3: expr: drop no longer used printable_relation parameter from prepare_binop_lhs() cql3: expr: print only column name when failing to resolve column cql3: expr: pass schema to prepare_expression cql3: expr: prepare_binary_operator: drop unused argument ctx cql3: expr: stub type inference for prepare_expression cql3: expr: introduce type_of() to fetch the type of an expression cql3: expr: keep type information in casts cql3: expr: add type field to subscript, field_selection, and null expressions cql3: expr: cast: use data_type instead of cql3_type for the prepared form	2022-06-09 15:38:50 +03:00
Nadav Har'El	b2444b6e9f	test/cql-pytest: skip another test on older, buggy, drivers Older versions of the Python Cassandra driver had a bug, detected by the driver_bug_1 fixture, where a single empty page aborts a scan. The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately tiny pages, so it turns out that some of them are empty, so the test breaks on buggy versions of the driver, which causes the test to fail when run by developers who happen to have old versions of the driver. So in this patch we use the driver_bug_1 fixture, to skip this test when running on a buggy version of the driver. Fixes #10763 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-09 14:37:45 +03:00
Nadav Har'El	c8a3d0758a	test/cql-pytest: de-duplicate code checking for an old buggy driver We have in test_filtering.py two tests which fail when running on an old version of the Python driver which has a specific bug, so we skip those tests if the buggy driver is installed. But the code to check the driver version is duplicated twice, so in this patch we move the version-checking-and-skipping code to a fixture, which we can use twice. The motivation is that in the next patch we will want to introduce a third use of the same code - and a fixture is cleaner than a third duplicate. This patch is supposed to be code-movement only, without functional changes. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-09 14:23:40 +03:00
Gleb Natapov	2fa0519991	migration manager: remove unused code	2022-06-09 09:40:55 +03:00
Gleb Natapov	727a9071d8	db/system_distributed_keyspace: do not announce empty schema	2022-06-09 09:40:55 +03:00
Gleb Natapov	6e100d1ea3	main: stop raft before the migration manager Since the group0 uses migration manager to apply commands we need to stop raft before we stopping migration manager.	2022-06-09 09:40:55 +03:00
Gleb Natapov	70b7b2b4d6	storage_service: do not pass the raft group manager to storage_service constructor Reduce the storage_service's dependency on the raft group manager. The group manager is needed only during bootstrap and in an rpc handler, so pass it to those functions directly.	2022-06-09 09:40:55 +03:00
Gleb Natapov	89fe305888	main: destroy the group0_client after stopping the group0 The group0_client uses the group0 internally and cannot be destroyed until the group0 is stopped to guaranty no ongoing calls into it by the group0_client.	2022-06-09 09:23:53 +03:00
Nadav Har'El	75c2bd78ae	test/alternator: reproducer for GetBatchItem duplicate keys It turns out that DynamoDB forbids requesting the same item more than once in a GetBatchItem request. Trying to do it would obviously be a waste, but DynamoDB outright refuses it - and Alternator currently doesn't (refs #10757). The test currently passes on DynamoDB and fails on Alternator, so it is marked xfail. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10758	2022-06-09 07:04:50 +02:00
Asias He	05a8d382b6	storage_service: Do not call do_batch_log_replay again in unbootstrap We have called do_batch_log_replay in the beginning of unbootstrap. No need to call it again.	2022-06-09 10:07:33 +08:00
Asias He	29c3f10ee4	storage_service: Add log for start and stop of batchlog replay It might take a long time to finish. Log batchlog replay event so we know if batchlog replay goes wrong. Refs #10756	2022-06-09 09:52:18 +08:00
Piotr Sarna	e5956fee8a	Merge 'cql3: column_condition cleanups' from Avi Kivity Cosmetic cleanups for column_condition. Closes #10716 * github.com:scylladb/scylla: cql3: column_condition: deinline constructor cql3: column_condition: rename `column` member	2022-06-08 17:48:07 +02:00
Botond Dénes	f060129223	Merge 'compaction::setup: yield to prevent stalls with large number of sstables' from Benny Halevy As seen in https://github.com/scylladb/scylla/issues/10738, compaction::setup might stall when processing a large number of sstables. Make it a coroutine and maybe_yield to prevent those stalls. Closes #10750 * github.com:scylladb/scylla: compaction: setup: reserve space for _input_sstable_generations compaction: coroutinize setup and maybe yield	2022-06-08 14:54:46 +03:00
Botond Dénes	3d8cd72c97	Merge 'multishard_mutation_query: use coroutine::as_future' from Benny Halevy This series converts try/catch blocks in coroutines for multishard_mutation_query to use coroutine::as_future to get and handle errors, reducing exception handling costs (that are expected on timeouts). It was previously sent to the mailing list. This version (v2) is just a rebase of the v1 series, with one patch dropped as it was already merged to master independentally. Closes #10727 * github.com:scylladb/scylla: multishard_mutation_query: do_query: couroutinize save_readers lambda multishard_mutation_query: do_query: prevent exceptions using coroutine::as_future multishard_mutation_query: read_page: prevent exceptions using coroutine::as_future multishard_mutation_query: save_readers: fixup indentation multishard_mutation_query: coroutinize save_readers multishard_mutation_query: lookup_readers: make noexcept multishard_mutation_query: optimize lookup_readers	2022-06-08 14:26:55 +03:00
Konstantin Osipov	29f8ba2c5e	raft: add Raft design nodes to the docs Closes #10504	2022-06-08 12:33:51 +02:00
Benny Halevy	593a192664	compaction: setup: reserve space for _input_sstable_generations We know in advance the maximum number of sstable generations to track, so reserve space for it to prevent vector reallocation for large number of sstables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 10:18:24 +03:00
Benny Halevy	4fac6e0b27	compaction: coroutinize setup and maybe yield To prevent reactor stalls with large number of sstables. Fixes #10738 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 10:12:41 +03:00
Benny Halevy	5babc609c6	multishard_mutation_query: do_query: couroutinize save_readers lambda To keep it simple. It is unlikely to throw. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 09:31:17 +03:00
Benny Halevy	921092955b	multishard_mutation_query: do_query: prevent exceptions using coroutine::as_future Optimize error handling by preventing exception try/catch using coroutine::as_future. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 09:31:17 +03:00
Benny Halevy	7a76ba4038	multishard_mutation_query: read_page: prevent exceptions using coroutine::as_future Optimize error handling by preventing exception try/catch using coroutine::as_future to get query::consume_page's result. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 09:31:15 +03:00
Benny Halevy	817a0f316a	multishard_mutation_query: save_readers: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 09:23:14 +03:00
Benny Halevy	804d727b8b	multishard_mutation_query: coroutinize save_readers And use smp::invoke_on_all rather than a home-brewed version of parallel_for_each over all shard ids. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 09:23:14 +03:00
Benny Halevy	22e5352cc2	multishard_mutation_query: lookup_readers: make noexcept Sot it can be co_awaited efficiently using coroutine::as_future, othwise, any exceptions will escape `as_future`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 09:23:14 +03:00
Benny Halevy	ea3935507e	multishard_mutation_query: optimize lookup_readers No need to call _db.invoke_on inside a parallel_for_each loop over all shards. Just use _db.invoke_on_all instead. Besides that, there's no need for a .then continuation for assigning the per-shard reader in _readers[shard]. It can be done by the functor running on each db shard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-08 09:23:14 +03:00
Botond Dénes	74587cb1b5	Merge 'Fix stalls during repair with rbno' from Asias He Found with scylla --blocked-reactor-notify-ms 1 during replace operation with rbno turned on. The stalls showed without this patch were gone after this path set. Closes #10737 * github.com:scylladb/scylla: repair: Avoid stall in working_row_hashes repair: Avoid stall in apply_rows_on_master_in_thread	2022-06-08 06:41:51 +03:00
Avi Kivity	836cbf4e86	Update seastar submodule * seastar 2be9677d6e...1424d34c93 (22): > Use tls socket to retrieve distinguished name > perftune.py: remove duplicates in 'append' parameters when we dump an options file > rpc: add an option for an asynchronous connection isolation function > Merge "Add more facilities to RPC tester" from Pavel E > json: wait for writing final characters of a json document > Revert "Use tls socket to retrieve distinguished name" > future.hh: drop unused parameters > core/scollected: initialize _buf explicitly > rpc: remove recursion in do_unmarshall() > coroutine: Fix generator clang compilation > core: Reduce the default blocked-reactor-notify-ms to 25ms > build: group "CMAKE_CXX_*" options together > doc: s/c++dialect/c++-standard/ > test: coroutines: adjust coroutine generator test for gcc > Use tls socket to retrieve distinguished name > coroutine: add an async generator > net/api: s/server_socket::is_listening()/operator bool()/ > net/api: let "server_socket::local_address()" always return an addr > tls_test: Remove unsupported prio string from test case > Merge 'abort_source: assert request_abort called exactly once' from Benny Halevy > coroutines/all: stop using std::aligned_union_t > coroutines/all: ensure the template argument deduction work with clang-15 Closes #10739	2022-06-07 21:54:47 +03:00
Benny Halevy	1daa7820c9	main: shutdown: do not abort on storage_io_error Do not abort in defer_verbose_shutdown if the callback throws storage_io_error, similar and in addition to the system errors handling that was added in `132c9d5933` As seen in https://github.com/scylladb/scylla/issues/9573#issuecomment-1148238291 Fixes #9573 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10740	2022-06-07 16:55:08 +03:00
Mikołaj Sielużycki	4143878558	compaction: Release compaction weight before updating history. update_history can take a long time compared to compaction, as a call issued on shard S1 can be handled on shard S2. If the other shard is under heavy load, we may unnecessarily block kicking off a new compaction. Normally it isn't a problem, as compactions aren't super frequent, but there were edge cases where the described behaviour caused compaction to fail to keep up with excessive flushing, leading to too many sstables on disk and OOM during a read. There is no need to wait with next compaction until history is updated, so release the weight earlier to remove unnecessary serialization. Compaction is marked as finished as soon as sstables are compacted (without waiting for history update).	2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki	5ce1fd1574	compaction: Inline compact_sstables_and_update_history call. This commit introduces no functional changes and exists solely for clarity of the change in the subsequent commit.	2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki	533552273a	compaction: Extract compact_sstables function	2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki	33c5802957	compaction: Rename compact_sstables to compact_sstables_and_update_history	2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki	9572520d0d	compaction: Extract update_history function	2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki	537819b7f8	compaction: Extract should_update_history function.	2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki	447bd8a2e0	compaction: Fetch start_size from compaction_result The start size is calculated during compaction and returned from sstables::compact_sstables, so there is no need to do it twice.	2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki	2edf137f61	compaction: Add tracking start_size in compaction_result.	2022-06-07 12:55:28 +02:00
Petr Gusev	0450974057	cql3_type::raw_collection: handle unknown types first The issue is about handling errors when the user specifies something strange instead of a type, e.g. CREATE TABLE try1 (a int PRIMARY KEY, b list<zzz>): * the error message only talks about collections, while zzz could also be an UDT; * the same error message is given even when zzz is not a valid collection or UDT name. The first point has already been fixed, now Scylla says 'Non-frozen user types or collections are not allowed inside collections: list<zzz>'. This commit fixes the second. Whether the type is a valid UDT or not is checked in cql3_type::raw_ut::prepare_internal, but 'non-frozen' check triggers first in cql3_type::raw_collection::prepare_internal, before we recursively get to the argument types of the collection. The patch reverses the order here, first thing we recurse and ensure that the collection argument types are valid, and only then we apply the collection checks. A side effect of this is that the error messages of the checks in raw_collection will include the keyspace name, because it will now be assigned in raw_ut::prepare_internal before them. The patch affects the validation order, so in case of list<zzz<xxx>> the message could be different, but it doesn't seem to be possible according to the Cql grammar. Examples: create type ut2 (a int, b list<ut1>); --> error('Unknown type ks.ut1') create type ut1 (a int); create type ut2 (a int, b list<ut1>); --> error('Non-frozen user types or collections are not allowed inside collections: list<ks.ut1>') create type ut2 (a int, b list<frozen<ut1>>); --> OK Fixes: scylladb#3541 Closes #10726	2022-06-07 11:16:12 +02:00
Avi Kivity	1e7cece837	tools: toolchain: prepare: use buildah multi-arch build instead of bash hacks In `69af7a830b` ("tools: toolchain: prepare: build arch images in parallel"), we added parallel image generation. But it turns out that buildah can do this natively (with the --platform option to specify architectures and --jobs parameter to allow parallelism). This is simpler and likely has better error handling than an ad-hoc bash script, so switch to it. Closes #10734	2022-06-07 11:51:13 +03:00
Asias He	f2c05e21ee	repair: Avoid stall in working_row_hashes Fix the following stall during repair: ``` Reactor stalled for 1 ms on shard 0. Backtrace: [Backtrace #11] {build/release/scylla} 0x4c6deb2: void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./ (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:772 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:791 {build/release/scylla} 0x4c6cb10: seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1366 {build/release/scylla} 0x4c6ddc0: seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1108 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1125 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1349 {build/release/scylla} 0x7f75551bfa1f: ?? ??:0 {build/release/scylla} 0x37abf12: repair_hash::operator<(repair_hash const&) const at ././repair/hash.hh:30 (inlined by) std::less<repair_hash>::operator()(repair_hash const&, repair_hash const&) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_function.h:400 (inlined by) bool absl::container_internal::key_compare_adapter<std::less<repair_hash>, repair_hash>::checked_compare::operator()<repair_hash, repair_hash, 0>(repair_hash const&, repair_hash const&) const at ./abseil/absl/containe (inlined by) absl::container_internal::SearchResult<int, false> absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >::binary_sear (inlined by) _ZNK4absl18container_internal10btree_nodeINS0_10set_paramsI11repair_hashSt4lessIS3_ESaIS3_ELi256ELb0EEEE13binary_searchIS3_NS0_19key_compare_adapterIS5_S3_E15checked_compareEEENS0_12SearchResultIiXsr23btree_is_key_com (inlined by) absl::container_internal::SearchResult<int, false> absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >::lower_bound (inlined by) absl::container_internal::SearchResult<absl::container_internal::btree_iterator<absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash (inlined by) std::pair<absl::container_internal::btree_iterator<absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >, repair_hash (inlined by) std::pair<absl::container_internal::btree_iterator<absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >, repair_hash (inlined by) operator() at ./repair/row_level.cc:896 (inlined by) seastar::future<void> seastar::futurize<void>::invoke<repair_meta::working_row_hashes()::{lambda(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >&)#1}::operator()(absl::btree_set<repa (inlined by) auto seastar::futurize_invoke<repair_meta::working_row_hashes()::{lambda(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >&)#1}::operator()(absl::btree_set<repair_hash, std::less<repai{build/release/scylla} 0x37ac70f: seastar::internal::do_for_each_state<std::_List_iterator<repair_row>, repair_meta::working_row_hashes()::{lambda(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >&) {build/release/scylla} 0x4c7ee64: seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2356 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2769 {build/release/scylla} 0x4c80247: seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2938 {build/release/scylla} 0x4c7f49c: seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2821 {build/release/scylla} 0x4c264d8: seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:265 {build/release/scylla} 0x4c259b1: seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:156 {build/release/scylla} 0xf5c16f: scylla_main(int, char) at ./main.cc:535 {build/release/scylla} 0xf5999a: std::function<int (int, char)>::operator()(int, char**) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:590 (inlined by) main at ./main.cc:1575 {build/release/scylla} 0x27b74: ?? ??:0 {build/release/scylla} 0xf5892d: _start at ??:? ``` Found with scylla --blocked-reactor-notify-ms 1 Refs #10665	2022-06-07 16:04:50 +08:00
Asias He	45bcacf672	repair: Avoid stall in apply_rows_on_master_in_thread Fix the following stall during repair: ``` Reactor stalled for 3 ms on shard 0. Backtrace: [Backtrace #20] {build/release/scylla} 0x4c6deb2: void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:772 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:791 {build/release/scylla} 0x4c6cb10: seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1366 {build/release/scylla} 0x4c6ddc0: seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1108 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1125 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1349 {build/release/scylla} 0x7f75551bfa1f: ?? ??:0 {build/release/scylla} 0x11293e9: std::default_delete<bytes_ostream::chunk>::operator()(bytes_ostream::chunk) const at database.cc:? (inlined by) std::default_delete<bytes_ostream::chunk>::operator()(bytes_ostream::chunk) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85 {build/release/scylla} 0x37b18e6: ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361 (inlined by) ~bytes_ostream at ././bytes_ostream.hh:26 (inlined by) ~frozen_mutation_fragment at ././frozen_mutation.hh:265 (inlined by) std::_Optional_payload_base<frozen_mutation_fragment>::_M_destroy() at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:260 (inlined by) std::_Optional_payload_base<frozen_mutation_fragment>::_M_reset() at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:280 (inlined by) ~_Optional_payload at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:401 (inlined by) ~_Optional_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:472 (inlined by) ~repair_row at ././repair/row.hh:24 (inlined by) void std::destroy_at<repair_row>(repair_row) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_construct.h:88 (inlined by) void std::allocator_traits<std::allocator<std::_List_node<repair_row> > >::destroy<repair_row>(std::allocator<std::_List_node<repair_row> >&, repair_row) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:537 (inlined by) std::__cxx11::_List_base<repair_row, std::allocator<repair_row> >::_M_clear() at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/list.tcc:77 (inlined by) ~_List_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_list.h:499 (inlined by) repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int) at ./repair/row_level.cc:1273 {build/release/scylla} 0x37ad9dc: repair_meta::get_row_diff_source_op(seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int, seastar::rpc::sink<repair_hash_with_cmd>&, seastar::rpc::source<repair_row_on_wire_with_cmd>&) at ./repair/row_level.cc:1617 {build/release/scylla} 0x37a2982: repair_meta::get_row_diff_with_rpc_stream(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >, seastar::bool_class<needs_all_rows_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int) at ./repair/row_level.cc:1683 ``` Found with scylla --blocked-reactor-notify-ms 1 Refs #10665	2022-06-07 16:04:50 +08:00
Takuya ASADA	c82da0ea8e	install-dependencies.sh: add scylla-api-client PIP package Add scylla-api-client PIP package, and install into scylla-python3 package. See scylladb/scylla-machine-image#340 Closes #10728 [avi: regenerate frozen toolchain] Closes #10732	2022-06-07 09:43:50 +03:00
Takuya ASADA	ad2344a864	scylla_coredump_setup: support new format of Storage field Storage field of "coredumpctl info" changed at systemd-v248, it added "(present)" on the end of line when coredump file available. Fixes #10669 Closes #10714	2022-06-07 02:21:32 +03:00
Avi Kivity	e0670f0bb5	Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drop tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652. Closes #10612 * github.com:scylladb/scylla: memtable: Add counters for tombstone compaction memtable, cache: Eagerly compact data with tombstones memtable: Subtract from flushed memory when cleaning mvcc: Introduce apply_resume to hold state for partition version merging test: mutation: Compare against compacted mutations compacting_reader: Drop irrelevant tombstones mutation_partition: Extract deletable_row::compact_and_expire() mvcc: Apply mutations in memtable with preemption enabled test: memtable: Make failed_flush_prevents_writes() immune to background merging	2022-06-07 02:17:09 +03:00
Tomasz Grabiec	0bc45f9666	memtable: Add counters for tombstone compaction	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	beadd248e3	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	9135d1fd1f	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	989ef88e26	mvcc: Introduce apply_resume to hold state for partition version merging Partition version merging is preemptable. It may stop in the middle and be resumed later. Currently, all state is kept inside the versions themselves, in the form of elements in the source version which are yet to be moved. This will change once we add compaction (tombstones with rows) into the merging algorithm. There, state cannot be encoded purley within versions. Consider applying a partition tombstone over large number of rows. This patch introduces apply_rows object to hold the necessary state to make sure forward progress in case of preemption. No change in behavior yet.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	374234cf76	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-06 19:25:40 +02:00
Tomasz Grabiec	604e720706	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-06 19:23:37 +02:00
Tomasz Grabiec	080c403d0b	mutation_partition: Extract deletable_row::compact_and_expire()	2022-06-06 19:23:37 +02:00
Tomasz Grabiec	0e3c4fc641	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-06 19:23:37 +02:00
Tomasz Grabiec	0e78ad50ea	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries.	2022-06-06 19:23:37 +02:00
Alejo Sanchez	98061c8960	test.py: shutdown connection manually To prevent async scheduling issues of reconnection after tests are done, manually close the connection after fixture ends. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-03 12:09:18 +02:00
Alejo Sanchez	17afcff228	test.py: fix port type passed to Cassandra driver Port is expected to be int, not str. Using a str causes errors for exception formatting. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-03 12:09:06 +02:00
Botond Dénes	605ee74c39	Merge 'sstables: save Scylla version & build id in metadata' from Michael Livshin To provide a reasonably-definitive answer to "what exact version of Scylla wrote this?". Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10712 * github.com:scylladb/scylla: docs: document recently-added Scylla sstable metadata sections sstables: save Scylla version & build id in metadata scylla_sstable: generalize metadata visitor for disk_string build_id: cache the value	2022-06-03 07:49:51 +03:00
Botond Dénes	49215fcff7	Merge 'Remove `flat_mutation_reader` (v1)' from Michael Livshin - Introduce a simpler substitute for `flat_mutation_reader`-resulting-from-a-downgrade that is adequate for the remaining uses but is _not_ a full-fledged reader (does not redirect all logic to an `::impl`, does not buffer, does not really have `::peek()`), so hopefully carries a smaller performance overhead. The name `mutation_fragment_v1_stream` is kind of a mouthful but it's the best I have - (not tests) Use the above instead of `downgrade_to_v1()` - Plug it in as another option in `mutation_source`, in and out - (tests) Substitute deliberate uses of `downgrade_to_v1()` with `mutation_fragment_v1_stream()` - (tests) Replace all the previously-overlooked occurrences of `mutation_source::make_reader()` with `mutation_source::make_reader_v2()`, or with `mutation_source::make_fragment_v1_stream()` where deliberate or still required (see below) - (tests) This series still leaves some tests with `mutation_fragment_v1_stream` (i.e. at v1) where not called for by the test logic per se, because another missing piece of work is figuring out how to properly feed `mutation_fragment_v2` (i.e. range tombstone changes) to `mutation_partition`. While that is not done (and I think it's better to punt on it in this PR), we have to produce `mutation_fragment` instances in tests that `apply()` them to `mutation_partition`, thus we still use downgraded readers in those tests - Remove the `flat_mutation_reader` class and things downstream of it Fixes #10586 Closes #10654 * github.com:scylladb/scylla: fix "ninja dev-headers" flat_mutation_reader ist tot tests: downgrade_to_v1() -> mutation_fragment_v1_stream() tests: flat_reader_assertions: refactor out match_compacted_mutation() tests: ms.make_reader() -> ms.make_fragment_v1_stream() repair/row_level: mutation_fragment_v1_stream() instead of downgrade_to_v1() stream_transfer_task: mutation_fragment_v1_stream() instead of downgrade_to_v1() sstables_loader: mutation_fragment_v1_stream() instead of downgrade_to_v1() mutation_source: add ::make_fragment_v1_stream() introduce mutation_fragment_v1_stream tests: ms.make_reader() -> ms.make_reader_v2() tests: remove test_downgrade_to_v1_clear_buffer() mutation_source_test: fix indentation tests: remove some redundant calls to downgrade_to_v1() tests: remove some to-become-pointless ms.make_reader()-using tests tests: remove some to-become-pointless reader downgrade tests	2022-06-03 07:26:29 +03:00
Michael Livshin	9a541c7c58	docs: document recently-added Scylla sstable metadata sections Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-02 19:40:52 +03:00
Kamil Braun	72f629c2b6	test: cdc_enable_disable_test: remove non-determinism The test sometimes fails because the order of rows in the SELECT results depends on how stream IDs for the different partition keys get generated. In some runs the stream ID for pk=1 may go before the stream ID for pk=4, in some runs the other way. The fix is to use the same partition key but different clustering keys for the different rows. Refs: #10601 Closes #10718	2022-06-02 19:40:07 +03:00
Botond Dénes	0a25a2bff3	sstables: validate_checksums(): more readable checksum mismatch messages Replace: Compressed chunk checksum mismatch at chunk {}, offset {}, for chunk of size {}: expected={}, actual={} With: Compressed chunk checksum mismatch at offset {}, for chunk #{} of size {}: expected={}, actual={} This is a follow-up for #10693. Also bring the uncompressed chunk checksum check messages up to date with the compressed one (which #10693 forgot to do). Another change included is merging the advancement of the chunk index with the iteration over the chunks, so we don't maintain two counters (one in the iterator and an explicit one). Closes #10715	2022-06-02 19:38:39 +03:00
Anna Stuchlik	a309c2a1b6	conf: update the description of the seeds parameter in scylla.yaml Closes #10719	2022-06-02 18:45:11 +03:00
Avi Kivity	bfa8a8efb7	cql3: column_condition: deinline constructor It will be easier to mangle it later out of the header.	2022-06-02 13:11:05 +03:00
Avi Kivity	4f7cbbb54c	cql3: column_condition: rename `column` member Prefix with underscore as a data member.	2022-06-02 13:09:54 +03:00
Michael Livshin	fc1b957367	sstables: save Scylla version & build id in metadata To provide a reasonably-definitive answer to "what exact version of Scylla wrote this?". Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-02 11:21:05 +03:00
Michael Livshin	b60bc8bb8a	scylla_sstable: generalize metadata visitor for disk_string Some metadata fields have interesting types, and some are just strings. There can be more than one string field, which the visitor would not be able to distinguish from one another by type alone, so no reason to make `scylla_metadata::sstable_origin` special. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-02 11:21:05 +03:00
Michael Livshin	80c9455413	build_id: cache the value The CPU cost of iterating over the relevant ELF structures is probably negligible (despite the amount of code involved), but there is no need to keep the containing page mapped in RAM when it doesn't have to be. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-02 11:21:05 +03:00
Avi Kivity	f5062f4b5a	Merge 'Use generation_type for SSTable ancestors' from Raphael "Raph" Carvalho To avoid a discrepancy about underlying generation type once something other than integer is allowed for the sstable generation. Also simplifies one generic writer interface for sealing sstable statistics. Closes #10703 * github.com:scylladb/scylla: sstables: Use generation_type for compaction ancestors sstables: Make compaction ancestors optional when sealing statistics	2022-06-01 19:55:08 +03:00
Avi Kivity	7debf6780c	cql3: expr: drop prepare_binop_lhs() It is now just a thin wrapper around try_prepare_expression(), so replace it with that.	2022-06-01 18:58:14 +03:00
Avi Kivity	76e0dc66e5	cql3: expr: move implementation of prepare_binop_lhs() to try_prepare_expression() This unifies the left-hand-side and right-hand-side of expression preparation. The contents of the visitor in prepare_binop_lhs() is moved to the visitor in try_prepare_expression(). This usually replaces an on_internal_error() branch. An exception is tuple_constructor, which is valid in both the left-hand-side and right-hand-side (e.g. WHERE (x, y) IN (?, ?, ?)). We previously enhanced this case to support not having a a column_specification, so we just delete the branch from prepare_binop_lhs.	2022-06-01 18:58:14 +03:00
Avi Kivity	046abc4323	cql3: expr: use recursive descent when preparing subscripts When encountering a subscript as the left-hand-side of a binary operator, we assume the subscripted value is a column and process it directly. As a step towards de-specializing the left-hand-side of binary operators, use recursive descent into prepare_binop_lhs() instead. This requires generating a column_specification for arbitrary expressions, so we add a column_specification_of() function for that. Currently it will return a good representation for columns (the only input allowed by the grammar) and a bad representation (the text representation of the expression) for other expressions. We'll have to improve that when we relax the grammar.	2022-06-01 18:58:12 +03:00
Avi Kivity	747a1dd244	cql3: expr: allow prepare of tuple_constructor with no receiver Currently the only expression form that can appear on both the left hand side of an expression and the right hand side is a tuple constructor, so consequently it must support both modes of type processing - either deriving the type from the expression, or imposing a type on the expression. As an example, in WHERE (A, B) = (:a, :b) the first tuple derives its type from the column types, while the second tuple has the type of the first tuple imposed on it. So, we adjust tuple_constructor_prepare_nontuple to support both forms. This means allowing the receiver not to be present, and calculating the tuple type if that is the case.	2022-06-01 18:48:55 +03:00
Avi Kivity	b1c8fd8fa5	cql3: expr: drop no longer used printable_relation parameter from prepare_binop_lhs() Inching ever closer to unifying the two expression preparation variants.	2022-06-01 18:48:03 +03:00
Avi Kivity	4e0a089f3e	cql3: expr: print only column name when failing to resolve column resolve_column() is part of the prepare stage, and tries to resolve a column name in a query against the table's columns. If it fails, it prints the containing binary_expression as context. However, that's unnecessary - the unresolved column name is sufficient context. So print that. The motivation is to unify preparation of binary_operator left-hand-side and right-hand-side - prepare_expression() doesn't have the extra parameter and it wouldn't make sense to add it, as expressions might not be children of binary_operators.	2022-06-01 18:48:03 +03:00
Avi Kivity	9e213d979f	cql3: expr: pass schema to prepare_expression Currently prepare_expression is never used where a schema is needed - it is called for the right-hand-side of binary operators (where we don't accept columns) or for attributes like WRITETIME or TTL. But when we unify expression preparation it will need to handle columns too, and these need the schema to look up the column. So pass the schema as a parameter. It is optional (a pointer) since not all contexts will have a schema (for example CREATE AGGREGATE).	2022-06-01 18:48:03 +03:00
Avi Kivity	9a81285206	cql3: expr: prepare_binary_operator: drop unused argument ctx This brings the calling convention closer to prepare_expression so we can unify them.	2022-06-01 18:48:03 +03:00
Avi Kivity	9deabdfbf4	cql3: expr: stub type inference for prepare_expression In CQL (and SQL) types flow in different directions in expression components. In an expression A[:x] = :y The type of A is known, the type of :x is derived from the type of A, and the type of :y is derived from the type of A[:x]. Currently prepare_expression() only supports the second mode - an expression's type is dictated by its caller via the column_specification parameter. But this means it can only be used to evaluate the right-hand-side of binary expressions, since the left-hand-side uses the first mode, where the type is derived from the column, not imposed by the caller. To support both modes, make the column_specification parameter optional (it is already a pointer so just accept null) and also make the returned expression optional, to indicate failure to infer the type if the column_specification was not given. This patch only arranges for the new calling convention (as a new try_prepare_expression call), it does not actually implement anything.	2022-06-01 18:48:03 +03:00
Avi Kivity	10aa6ddca3	cql3: expr: introduce type_of() to fetch the type of an expression For most types, we just return the type field. A few expressions have other methods to access the type, and some expressions cannot survive prepare and so calling type_of() on them is illegal.	2022-06-01 18:47:58 +03:00
Avi Kivity	43a3c94532	cql3: expr: keep type information in casts Currently, preparing a cast drops the cast completely (as the types are verified to be binary compatibile). This means we lose the casted-to type. Since we wish to keep type infomation, keep the cast in the prepared expression tree (and therefore the casted-to type). Once we do that, we must extend evaluate() to support cast expressions.	2022-06-01 18:46:55 +03:00
Avi Kivity	0a4a8c6b92	cql3: expr: add type field to subscript, field_selection, and null expressions Almost all expressions either already have a type field or have an O(1) way of reaching the type (for example, column_value can access the type via its column_definition). Add a type field to the few expression types that don't already have it. Since prepare_expr() doesn't yet generate these expressions, we don't have any place to populate it, so it remains null.	2022-06-01 18:45:56 +03:00
Avi Kivity	d984ea1b7a	cql3: expr: cast: use data_type instead of cql3_type for the prepared form A cast expression naturally includes a data type indicating what type we are casting into. Right now the prepared form uses cql3_type. Change it to data_type which is what other expressions use to reduce friction. Since cql3_type is a thin wrapper around data_type, the change is minimal. The change propagates to selectable::with_cast, but again it is minimal.	2022-06-01 12:19:53 +03:00
Michael Livshin	632b4e5a9a	fix "ninja dev-headers" Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	029508b77c	flat_mutation_reader ist tot Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	2a91323051	tests: downgrade_to_v1() -> mutation_fragment_v1_stream() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	eabe568d1c	tests: flat_reader_assertions: refactor out match_compacted_mutation() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	a08ee649fc	tests: ms.make_reader() -> ms.make_fragment_v1_stream() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	7a11a22cd6	repair/row_level: mutation_fragment_v1_stream() instead of downgrade_to_v1() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	8305ac26ca	stream_transfer_task: mutation_fragment_v1_stream() instead of downgrade_to_v1() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	00bee4e0b3	sstables_loader: mutation_fragment_v1_stream() instead of downgrade_to_v1() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	00b2e7b2c5	mutation_source: add ::make_fragment_v1_stream() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	1b98692c8c	introduce mutation_fragment_v1_stream At this point, none of the remaining uses of `flat_mutation_reader` (all of which are results of calling `downgrade_to_v1()` anyway) actually need a full-featured flat mutation reader with its own separate buffer etc. `mutation_fragment_v1_stream` can only be constructed by wrapping a `flat_mutation_reader_v2`, contains enough functionality for the remaining consumers of `mutation_fragment_v1` sources and unit tests and no more, and does not buffer. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	d137b32994	tests: ms.make_reader() -> ms.make_reader_v2() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	1a9e0ed73d	tests: remove test_downgrade_to_v1_clear_buffer() The projected limited replacement of downgraded v1 mutation reader will not do its own buffering, so this test will be pointless. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	66ceb32612	mutation_source_test: fix indentation Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	b9ada78ec2	tests: remove some redundant calls to downgrade_to_v1() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	63a61ccaad	tests: remove some to-become-pointless ms.make_reader()-using tests mutation_source are going to be created only from v2 readers and the ::make_reader() method family is scheduled for removal. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	b288cc4f9f	tests: remove some to-become-pointless reader downgrade tests Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Raphael S. Carvalho	2a7eb16c02	sstables: Use generation_type for compaction ancestors Let's also use generation_type for compaction ancestors, so once we support something other than integer for SSTable generation, we won't have discrepancy about what the generation type is. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-05-31 15:28:02 -03:00
Raphael S. Carvalho	d36604703f	sstables: Make compaction ancestors optional when sealing statistics Compaction ancestors is only available in versions older than mx, therefore we can make it optional in seal_statistics(). The motivation is that mx writer will no longer call sstable::compaction_ancestors() which return type will be soon changed to type generation_type, so the returned value can be something other than an integer, e.g. uuid. We could kill compaction_ancestors in seal_statistics interface, but given that most generic write functions still work for older versions, if there were still a writer for them, I decided to not do it now. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-05-31 15:26:03 -03:00
Calle Wilund	adda43edc7	CDC - do not remove log table on CDC disable Fixes #10489 Killing the CDC log table on CDC disable is unhelpful in many ways, partly because it can cause random exceptions on nodes trying to do a CDC-enabled write at the same time as log table is dropped, but also because it makes it impossible to collect data generated before CDC was turned off, but which is not yet consumed. Since data should be TTL:ed anyway, retaining the table should not really add any overhead beyond the compaction to eventually clear it. And user did set TTL=0 (disabled), then he is already responsible for clearing out the data. This also has the nice feature of meshing with the alternator streams semantics. Closes #10601	2022-05-31 19:07:07 +03:00
Konstantin Osipov	94a192a7aa	Revert "test.py: temporarily disable raft" This reverts commit `26128a222b`. The issue the commit depends on is fixed, so enable raft back. Closes #10694	2022-05-31 14:39:26 +03:00
Avi Kivity	41b098f54e	Udpate tools/jmx submodule (jackson dependency update) * tools/jmx 53f7f55...fe351e8 (1): > Update jackson dependency	2022-05-31 13:46:46 +03:00
Mikołaj Sielużycki	bc18e97473	sstable_writer: Fix mutation order violation The change - adds a test which exposes a problem of a peculiar setup of tombstones that trigger a mutation fragment stream validation exception - fixes the problem Applying tombstones in the order: range_tombstone_change pos(ck1), after_all_prefixed, tombstone_timestamp=1 range_tombstone_change pos(ck2), before_all_prefixed, tombstone=NONE range_tombstone_change pos(NONE), after_all_prefixed, tombstone=NONE Leads to swapping the order of mutations when written and read from disk via sstable writer. This is caused by conversion of range_tombstone_change (in memory representation) to range tombstone marker (on disk representation) and back. When this mutation stream is written to disk, the range tombstone markers type is calculated based on the relationship between range_tombstone_changes. The RTC series as above produces markers (start, end, start). When the last marker is loaded from disk, it's kind gets incorrectly loaded as before_all_prefixed instead of after_all_prefixed. This leads to incorrect order of mutations. The solution is to skip writing a new range_tombstone_change with empty tombstone if the last range_tombstone_change already has empty tombstone. This is redundant information and can be safely removed, while the logic of encoding RTCs as markers doesn't handle such redundancy well. Closes #10643	2022-05-31 13:39:48 +03:00
Piotr Sarna	7169e021e5	Merge 'cql3: support list subscripts in WHERE clause' from Avi Kivity I noticed that `column_condition` (used in LWT `IF` clause) supports lists. As part of the Grand Expression Unification we'll need to migrate that to expressions, so we'll need to support list subscripts. Use the opportunity to relax the normal filtering to allow filtering on list subscripts: `WHERE my_list[:index] = :value`. Closes #10645 * github.com:scylladb/scylla: test: cql-pytest: add test for list subscript filtering doc: document list subscripts usable in WHERE clause cql3: expr: drop restrictions on list subscripts cql3: expr: prepare_expr: support subscripted lists cql3: expressions: reindent get_value() cql3: expression: evaluate() support subscripting lists	2022-05-31 09:28:52 +02:00
Avi Kivity	4b53af0bd5	treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime of the function object is less ambiguous, and so it is safer. Replace all eligible occurences (i.e. caller is a coroutine). One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra attention since there was a handle_exception() continuation attached. It is converted to a try/catch. Closes #10699	2022-05-31 09:06:24 +03:00
Botond Dénes	02608bec9d	Update tools/java submodule * tools/java a4573759a2...d4133b54c9 (1): > removeNode: Remove other alias for --ignore-dead-nodes	2022-05-31 07:56:54 +03:00
Botond Dénes	660921eb22	Merge 'Two improvements to configure.py' from Nadav Har'El This two-patch series makes two improvements to configure.py: The first patch fixes, yet again, issue #4706 where interrupting ninja's rebuild of build.ninja can leave it without any build.ninja at all. The patch uses a different approach from the previous pull-request #10671 that aimed to solve the same problem. The second patch makes the output of configure.py more reproducible, not resulting in a different random order every time. This is useful especially when debugging configure.py and wanting to check if anything changed in its output. Closes #10696 * github.com:scylladb/scylla: configure.py: make build.ninja the same every time configure.py: don't delete build.ninja when rebuild is interrupted	2022-05-31 06:35:16 +03:00
Avi Kivity	248cdf0e34	test: cql-pytest: add test for list subscript filtering Test match and mismatch, as well as out of bound cases.	2022-05-30 20:47:47 +03:00
Nadav Har'El	e85bd37c6e	Update seastar submodule * seastar 96bb3a1b8...2be9677d6 (37): > Merge 'stream_range_as_array: always close output stream' from Benny Halevy Fixes #10592 > net/api: add "server_socket::is_listening()" > src/net/proxy: remove unused variable > coroutine: parallel_for_each: relax contraints > native-stack: do not use 0 as ip address if !_dhcp > coroutine: fix a typo in comment > std-coroutine: include for LLVM-14 > tutorial: use non-variadic version of when_all_succeed() > scripts: Fix build.sh to use new --c++-standard config option > core/thread: initialize work::pr and work::th explicitly > util/log-impl: remove "const" qualifier in return type > map_reduce: remove redundant move() in return statement > util: mark unused parameter with [[maybe_unused]] > drop unused parameters > build: use "20" for the default CMAKE_CXX_STANDARD > build: make CMAKE_CXX_STANDARD a string > utils: log: don't crash on allocation failure while extending log buffer > tests: unix_domain_test: fix thread/future confusion in client_round() > compat: do not use std::source_location if it is broken > build: use CMAKE_CXX_STANDARD instead of Seastar_CXX_DIALECT > Merge 'Add hello-world demo from tutorial' from Pavel > rpc_tester: Put client/server sides into correct sched groups > reactor_backend: Use _r reference, not engine() method > future.hh: #include std-compat.hh for SEASTAR_COROUTINES_ENABLED > Merge "Add more CPU-hog facilities to RPC-tester" from Pavel E > Merge "io: Enlighten queued_request" from Pavel E > Correct swapped AIO detection/setup calls > sharded: De-duplicate map-reduce overloads > file: don't trample on xfs flags when setting xfs size hint > Merge "Per-class IO bandwidth limits" from Pavel E > Merge 'sstring: fix format and optimize the performance of sstring::find().' from Jianyong Chen > reactor_backend: Mark reactor_backend_aio::poll() private > scripts/build.sh: Mind if not running on a terminal > test, rpc: Don't work with large buffers > test, futures: Don't expect ready future to resolve immediately > source_location compatibility: Fix an unused private field error when treat warning as errors > file: Remove try-catch around noexcept calls	2022-05-30 17:46:32 +03:00
Pavel Emelyanov	7f2837824e	system_keyspace: Save coroutine's captured variable on stack Currently it works, but the newer version of seastar's map_reduce() is compiled in a way to trigger use-after-free on accessing captured value. tests: unit(dev), unit.alternator(debug on v1) Fixes #10689 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220523095409.6078-1-xemul@scylladb.com>	2022-05-30 17:46:32 +03:00
Botond Dénes	3a943b23fb	sstables: validate_checksums(): add chunk index to error message When logging a failed checksum on a compressed chunk. Currently, only the offset is logged, but the index of the chunk whose checksum failed to validate is also interesting. Closes #10693	2022-05-30 17:11:28 +03:00
Nadav Har'El	84e1fa0513	configure.py: make build.ninja the same every time In several places, configure.py uses unsorted sets which results in its output being in different order every time - both a different order of targets, and a different order in dependencies of each target. This is both strange, and annoying when trying to debug configure.py and trying to understand when, if at all, its output changes. So in this patch, we use "sorted(...)" in the right places that are needed to guarantee a fixed order. This fixed order is alphabetical, but that's not the goal of this patch - the goal is to ensure a fixed order. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-30 16:20:37 +03:00
Nadav Har'El	8db9e62de9	configure.py: don't delete build.ninja when rebuild is interrupted In commit `9cc9facbea`, I fixed issue #4706. That issue about what happens when interrupting a rebuild of build.ninja (which happens automatically when you run "ninja" after configure.py changed). We don't want to leave behind a half-built build.ninja, or leave it deleted. The solution in that commit was for configure.py to build a temporary file (build.ninja.tmp), and only as the very last step rename it build.ninja. Unfortunately, since that time, we added more last steps after what used to be that very last step :-( If this new code running after the rename takes a noticable amount of time, and if the user is unlucky enough to interrupt it during that time, ninja will see a modified output file (build.ninja) and a failed rule, and will delete the output file! The solution is to move the rename out of configure.py. Instead, we add a "--out=filename" option to configure.py which allows it to write directly to a different file name, not build.ninja. When rebuilding build.ninja, the rule will now call configure.py with "--out=build.ninja.new" and then rename it back to build.ninja. Any failure or interrupt at any stage of configure.py will leave build.ninja untouched, so ninja will not delete it - it will just delete the temporary build.ninja.new. Fixes #4706 (again) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-30 16:17:41 +03:00
Kamil Braun	78f81171ba	Merge 'raft: test non-voters in `randomized_nemesis_test`' from Kamil Braun We modify the `reconfigure` and `modify_config` APIs to take a vector of <server_id, bool> pairs (instead of just a vector of server_ids), where the bool indicates whether the server is a voter in the modified config. The `reconfiguration` operation would previously shuffle the set of servers and split it into two parts: members and non-members. Now it partitions it into three parts: voters, non-voters, and non-members. The PR also includes fixes for some liveness problems stumbled upon during testing. Closes #10640 * github.com:scylladb/scylla: test: raft: randomized_nemesis_test: include non-voters during reconfigurations raft: server: if `add_entry` with `wait_type::applied` successfully returns, ensure `state_machine::apply` is called for this entry raft: tracker: fix the definition of `voters()` raft: when printing `raft::server_address`, include `can_vote`	2022-05-30 15:06:35 +02:00
Raphael S. Carvalho	0307cdd2bf	compaction: Fix incremental compaction logging The messages only dumps the last sealed fragment, but it should dump all the output fragments replacing the exhausted input ones. Let's print origin of output fragments, so we can differ between files with compaction and garbage-collection origin. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220524232232.119520-1-raphaelsc@scylladb.com>	2022-05-30 15:58:14 +03:00
Botond Dénes	d71c865344	Merge "Fix snitching on Azure" from Pavel Emelyanov " Azure snitch tries to replicate db/rack info from all shards to all other shards. This may lead to use-after-free when shard A gets "this" from shard B, starts copying its _dc field and the shard A destructs its _dc from under B because it's receiving one from shard C. Also polish replication code a little bit while at it. " * 'br-azure-snitch-serialize' of https://github.com/xemul/scylla: snitch: Use invoke_on_others() to replicate snitch: Merge set_my_dc and set_my_rack into one azure_snitch: Do nothing on non-io-cpu	2022-05-30 15:35:37 +03:00
Benny Halevy	32e79840ca	tools: scylla-sstable: terminate error message with newline Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220523080747.2492640-1-bhalevy@scylladb.com>	2022-05-30 14:47:28 +03:00
Kamil Braun	6fc82be832	service: storage_service: remove get() call not in thread Regression introduced by code movement in `89163a3be4`. Fixes #10679. Closes #10680	2022-05-30 13:43:02 +03:00
Avi Kivity	8136f7bc4b	doc: document list subscripts usable in WHERE clause	2022-05-30 13:29:49 +03:00
Avi Kivity	f9b3c6ddbd	cql3: expr: drop restrictions on list subscripts Restriction validation forbids lists (somewhat oddly, it talks about indexes; validation should make a soft check about indexes (since it can fall back to filtering) and a hard check about supported filtering expressions), and enforces a map in another place. Remove the first restriction and relax the second to allow lists as well as maps as subscript operands. Some validation messages are adjusted to reflect that lists are supported.	2022-05-30 13:29:49 +03:00
Avi Kivity	35e0474410	cql3: expr: prepare_expr: support subscripted lists Infer the type of a list index as int32_type. The error message when a non-subscriptable type is provided is changed, so the corresponding test is changed too.	2022-05-30 13:29:49 +03:00
Avi Kivity	8d667e374b	cql3: expressions: reindent get_value() Whitespace-only change.	2022-05-30 13:29:49 +03:00
Avi Kivity	05388f7a2a	cql3: expression: evaluate() support subscripting lists We already support subscripting maps (for filtering WHERE m[3] = 6), so adding list subscript support is easy. Most of the code is shared. Differences are: - internal list representation is a vector of values, not of key/values - key type is int32_type, not defined by map - need to check index bounds	2022-05-30 13:29:49 +03:00
Piotr Sarna	5d59c841d0	Merge 'alternator: add Describe operations even if a feature... is not supported' from Nadav Har'El This small series implements the DescribeTimeToLive and DescribeContinuousBackups operations in Alternator. Even if the corresponding features aren't implemented, it can help applications that we implement just the Describe operation that can say that this feature is in fact currently disabled. Fixes #10660 Closes #10670 * github.com:scylladb/scylla: alternator: remove dead code alternator: implement DescribeContinuousBackups operation alternator: allow DescribeTimeToLive even without TTL enabled	2022-05-30 09:26:13 +02:00
Nadav Har'El	63132431e8	test/cql-pytest: reproducers for three scan bugs This patch contains five tests which reproduce three old bugs in Scylla's handling of multi-column restrictions like (c1,c2)<(1,2). These old bugs are: Refs #64 (yes, a two-digit issue!) Refs #4244 Refs #6200 The three github issues are closely intertwined, exposing the same or similar bugs in our internal implementation, and I suspect that eventually most of them could be fixed together. In writing these tests, I carefully read all three issues and the various failure scenarios described in them, tried to distill and simplify the scenarios, and also consider various other broken variants of the scenarios. The resulting tests are heavily commented, explaining the motivation of each test and exactly which of the above bugs it reproduces. All five tests included in this patch pass on Cassandra and currently fail on Scylla, so are marked "xfail". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10675	2022-05-30 09:25:00 +02:00
Kamil Braun	4c3678e2a0	gms: gossiper: fix `direct_fd_pinger::_generation_number` initialization It's an `int64_t` that needs to be explicitly initialized, otherwise the value is undefined. This is probably the cause of #10639, although I'm not sure - I couldn't reproduce it (the bug is dependent on how the binary is compiled, so that's probably it). We'll see if it reproduces with this fix, and if it will, close the issue. Closes #10681	2022-05-29 13:08:09 +03:00
Avi Kivity	6bcbe3157a	Merge "Turn table::config::sstables_manager* into table::sstables_manager&" from Pavel E " The table keeps references on sstables_ and compaction_ managers (among other things), but the latter sits as a pointer on table's config while the former -- as on-table direct reference. This set unifies both by turning sstables manager on-config pointer into on-table reference. branch: https://github.com/xemul/scylla/tree/br-table-vs-sstables-manager tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/574/ " * 'br-table-vs-sstables-manager' of https://github.com/xemul/scylla: tests: Remove sstables_manager& from column_family_test_config() table: Move sstables_manager from config onto table itself table, db, tests: Pass sstables_manager& into table constructor	2022-05-29 13:02:50 +03:00
Kamil Braun	ef7643d504	service: raft: raft_group0: don't call `_abort_source.request_abort()` `raft_group0` does not own the source and is not responsible for calling `request_abort`. The source comes from top-level `stop_signal` (see main.cc) and that's where it's aborted. Fixes #10668. Closes #10678	2022-05-27 16:37:07 +02:00
Nadav Har'El	52362de3df	test/cql-pytest: tests for assigning an empty string to non-string In issues #7944 and #10625 it was noticed that by assigning an empty string to a non-string type (int, date, etc.) using INSERT or INSERT JSON, some combinations of the above can create "empty" values while they should produce a clear error. The tests added in this patch explore the different combinations of types and insert modes, and reproduce several buggy cases in Scylla (resulting in xfail'ing tests) as well as Cassandra. We feared that there might be a way using those buggy statements to create a partition with an empty key - something which used to kill older versions of Scylla. But the tests show that this is not possible - while a user can use the buggy statements to create an empty value, Scylla refuses it when it is used as a single-column partition key. Refs #10625 Refs #7944 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10628	2022-05-27 16:37:01 +02:00
Avi Kivity	c83393e819	messaging: do isolate default tenants In `10dd08c9` ("messaging_service: supply and interpret rpc isolation_cookies", 4.2), we added a mechanism to perform rpc calls in remote scheduling groups based on the connection identity (rather than the verb), so that connection processing itself can run in the correct group (not just verb processing), and so that one verb can run in different groups according to need. In `16d8cdadc` ("messaging_service: introduce the tenant concept", 4.2), we changed the way isolation cookies are sent: scheduling_group messaging_service::scheduling_group_for_verb(messaging_verb verb) const { return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group; @@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge if (must_compress) { opts.compressor_factory = &compressor_factory; } opts.tcp_nodelay = must_tcp_nodelay; opts.reuseaddr = true; - opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie; + // We send cookies only for non-default statement tenant clients. + if (idx > 3) { + opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie; + } This effectively disables the mechanism for the default tenant. As a result some verbs will be executed in whatever group the messaging service listener was started in. This used to be the main group, but in `554ab03` ("main: Run init_server and join_cluster inside maintenance scheduling group", 4.5), this was change to the maintenance group. As a result normal read/writes now compete with maintenance operations, raising their latency significantly. Fix by sending the isolation cookie for all connections. With this, a 2-node cassandra-stress load has 99th percentile increase by just 3ms during repair, compared to 10ms+ before. Fixes #9505. Closes #10673	2022-05-27 16:36:57 +02:00
Michał Radwański	906cee7052	treewide: remove unqualified calls to std::move clang 15 emits such a warning: cql3/statements/raw/parsed_statement.cc:46:16: error: unqualified call to 'std::move' [-Werror,-Wunqualified-std-cast-call] , warnings(move(warnings)) ^ std:: cql3/statements/raw/parsed_statement.cc:52:101: error: unqualified call to 'std::move' [-Werror,-Wunqualified-std-cast-call] : prepared_statement(statement_, ctx.get_variable_specifications(), partition_key_bind_indices, move(warnings)) ^ std:: Closes #10656	2022-05-27 16:36:49 +02:00
Tomasz Grabiec	e9fbc0b6c5	Merge 'test.py: add cluster and new approval tests' from Konstantin Osipov The purpose of this series is to introduce infrastructure for managed scylla processes into test.py, switch some existing suites to use test.py managed processes and introduce cluster tests. All of this is expected to make possible to test Raft topology changes and schema changes using an easy to use and fast tool such as test.py. In general this will allow testing Scylla clusters from within the development test harness. Branch URL: kostja/test.py.v5 Closes #10406 * github.com:scylladb/scylla: test: disable topology/test_null test.py: disable cdc_with_lwt_test it's flaky in debug mode test.py: workaround for a python bug test: cleanup (drop keyspace) in two cql tests to support --repeat test.py: respect --verbose even if output is a tty test: remove tools/cql_repl test.py: switch cql/ suite to pytest/tabular output test: remove a flaky test case test.py: implement CQL approval tests over pytest test.py: implement cql_repl test.py: add topology suite test.py: add common utility functions to test/pylib test.py: switch cql-pytest and rest_api suites to PythonTestSuite test.py: introduce PythonTest and PythonTestSuite test.py: use artifact registry test.py: temporarily disable raft test.py: (pylib) add Scylla Server and Artifact Registry test.py: (pylib) add Host Registry to track used server hosts test.py: (pylib) add a pool of scylla servers (or clusters)	2022-05-27 16:36:08 +02:00
Pavel Emelyanov	3dab0bfc8d	tests: Remove sstables_manager& from column_family_test_config() It's unused arg in there after last patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-27 16:37:21 +03:00
Pavel Emelyanov	490bf65e11	table: Move sstables_manager from config onto table itself The manager reference is already available in constructor and thus can be copied to on-table member. The code that chooses the manager (user/system one) should be moved from make_column_family_config() into add_column_family() method. Once this happens, the get_sstables_manager() should be fixed to return the reference from its new location. While at it -- mark the method in question noexcept and add it's mutable overload. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-27 16:37:21 +03:00
Pavel Emelyanov	50e6810536	table, db, tests: Pass sstables_manager& into table constructor In core code there's only one place that constructs table -- in database.cc -- and this place currently has the sstables_manager pointer sitting on table config (despite it's a pointer, it's always non-null). All the tests always use the manager from one of _env's out there. For now the new contructor arg is unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-27 16:27:44 +03:00
Kamil Braun	2dafd99e3b	test: raft: randomized_nemesis_test: include non-voters during reconfigurations We modify the `reconfigure` and `modify_config` APIs to take a vector of <server_id, bool> pairs (instead of just a vector of server_ids), where the bool indicates whether the server is a voter in the modified config. The `reconfiguration` operation would previously shuffle the set of servers and split it into two parts: members and non-members. Now it partitions it into three parts: voters, non-voters, and non-members.	2022-05-27 12:06:18 +02:00
Kamil Braun	d128f65354	raft: server: if `add_entry` with `wait_type::applied` successfully returns, ensure `state_machine::apply` is called for this entry Previously it could happen that `add_entry` returned successfully but `state_machine::apply` was never called by the server for this entry, even though `wait_type::applied` was used, if the server loaded a snapshot that contained this entry in just the right moment. Some clients may find this behavior surprising, even though we may argue that it's not technically incorrect. For example, the nemesis test assumed that if `add_entry` returned successfully (with `wait_type::applied`), the local state machine applied the entry; the test uses `apply` to obtain an output - the result of the command - from the state machine. It's not a problem to give a stronger guarantee, so we do it in this commit. In the scenario where a snapshot causes Raft to skip over the entry, `add_entry` will finish exceptionally with `commit_status_unknown`.	2022-05-27 12:06:18 +02:00
Kamil Braun	f31f73b1e8	raft: tracker: fix the definition of `voters()` The previous implementation was weird, and it's not even clear if the C++ standard defined what the result would be (because it used `std::unordered_set::insert(iterator, iterator)`, where the iterators pointed to a sequence of elements with elements that already had equivalent elements in the set; cppreference does not specify which elements end up in the set in this case). In any case, in testing, the resulting set did not give the desired result: if the configuration was joint, and a server was a voter in the previous config but a non-voter in the current one, it would not be a member of this set. This would cause the server to not vote for itself when it became a candidate, which could lead to cluster unavailability. The new definition is simple: a server belongs to `voters()` iff it is a voter in current or previous configuration. This fixes the problem described above. Fixes #10618.	2022-05-27 12:06:18 +02:00
Kamil Braun	6e5d1f4784	raft: when printing `raft::server_address`, include `can_vote` Makes debugging easier when configurations include non-voters.	2022-05-27 12:06:18 +02:00
Nadav Har'El	e363febeb1	test/cql-pytest: reproducer for bug in index+filter+limit This patch adds reproducing tests for wrong handling of LIMIT in a query which uses a secondary index and filtering, described in issue #10649. In that case, Scylla incorrectly limits the number of rows found in the index before the filtering, while it should limit the number of rows after the filtering. The tests in this patch (which xfail on Scylla, and pass on Cassandra) go beyond the minimum required to reproduce this bug. It turns out that there are different sub-cases of this problem that go through different code paths, namely whether the base table has clustering keys or just partition keys, and whether the overall LIMITed result spans more than one page. So these tests attempt to also cover all these sub-cases. Without all these test sub-cases, an incomplete and incorrect fix of this bug may, by chance, cause the original test to succeed. Refs #10649 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10658	2022-05-26 16:46:45 +02:00
Tomasz Grabiec	edd447ed32	Merge 'raft: test `read_barrier`s in `randomized_nemesis_test`' from Kamil Braun Introduce a new operation, `raft_read`, which calls `read_barrier` on a server, reads the state of the server's state machine, and returns that state. Extend the generator in `basic_generator_test` to generate `raft_read`s. Only do it if forwarding is enabled (although it may make sense to test read barriers in non-forwarding scenario as well - we may think about it and do it in a follow-up). Check the consistency of the read results by comparing them with the model and using the result to extend the model with any newly observed elements. The patchset includes some fixes for correctness (#10578) and liveness (handling aborts correctly). Closes #10561 * github.com:scylladb/scylla: test: raft: randomized_nemesis_test: check consistency of reads test: raft: randomized_nemesis_test: perform linearizable reads using read_barriers test: raft: randomized_nemesis_test: add flags for disabling nemeses raft: server: in `abort()`, abort read barriers before waiting for rpc abort raft: server: handle aborts correctly in `read_barrier` raft: fsm: don't advance commit index further than match_idx during read_quorum	2022-05-26 16:46:35 +02:00
Nadav Har'El	62b6179c88	alternator: remove dead code Remove the function make_keyspace_name() that was never used. We could have used this function, but we didn't, and it had had an inconvenient API. If we later want to de-duplicate the several copies of "executor::KEYSPACE_NAME_PREFIX + table_name" we have in the code, we can do it with a better API. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-26 15:21:14 +03:00
Nadav Har'El	d0ca09a925	alternator: implement DescribeContinuousBackups operation Although we don't yet support the DynamoDB API's backup features (see issue #5063), we can already implement the DescribeContinuousBackups operation. It should just say that continuous backups, and point-in-time restores, and disabled. This will be useful for client code which tries to inquire about continuous backups, even if not planning to use them in practice (e.g., see issue #10660). Refs #5063 Refs #10660 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-26 15:13:50 +03:00
Konstantin Osipov	7b3f9bb5fa	test: disable topology/test_null Scylla has a bug that only fires ones in a hundred runs in debug mode when a schema change parallel to a topology change leads to a lost keyspace and internal error. Disable the tests until Raft is enabled for schema.	2022-05-26 14:09:58 +03:00
Nadav Har'El	8ecf1e306f	alternator: allow DescribeTimeToLive even without TTL enabled We still consider the TTL support in Alternator to be experimental, so we don't want to allow a user to enable TTL on a table without turning on a "--experimental-features" flag. However, there is no reason not to allow the DescribeTimeToLive call when this experimental flag is off - this call would simply reply with the truth - that the TTL feature is disabled for the table! This is important for client code (such as the Terraform module described in issue #10660) which uses DescribeTimeToLive for information, even when it never intends to actually enable TTL. The patch is trivial - we simply remove the flag check in DescribeTimeToLive, the code works just as before. After this patch, the following test now works on Scylla without experimental flags turned on: test/alternator/run test_ttl.py::test_describe_ttl_without_ttl Refs #10660 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-26 10:55:36 +03:00
Konstantin Osipov	5e67e48f8b	test.py: disable cdc_with_lwt_test it's flaky in debug mode The test is flaky in debug mode, see issue #10661 for details.	2022-05-26 09:23:13 +03:00
Konstantin Osipov	ad01840117	test.py: workaround for a python bug Workaround for a Python3 bug which prevents a correct exception printout when asyncio is used with logging on.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	180cc0fc4d	test: cleanup (drop keyspace) in two cql tests to support --repeat	2022-05-25 20:26:42 +03:00
Konstantin Osipov	dd633cdb21	test.py: respect --verbose even if output is a tty If output is a not a tty, verbose is set automatically. If the output is a tty, one has to request --verbose. However, a part of test.py verbosity was ignoring --verbose and looking only at the terminal type.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	9512e076af	test: remove tools/cql_repl Remove tools/cql_repl from the source, build targets and use in test.py. Superseded by ApprovalTest and test/pylib/cql_repl/cql_repl.py.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	b82c8447d7	test.py: switch cql/ suite to pytest/tabular output	2022-05-25 20:26:42 +03:00
Konstantin Osipov	c43736c53b	test: remove a flaky test case When we append an entry to a list with the same user-defined timestamp, the behaviour is actually undefined. If the append is processed by the same coordinator as the one that accepted the existing entry, then it gets the same timeuuid as the list key, and replaces (potentially) the existing list valiue. Then it gets a timeuuid which maybe both larger and smaller than the existing key's timeuuid, and then turns either to an append or a prepend. The part of the timestamp responsible for the result is the shard id's spoof node address implemented in scope of fixing Scylla's timeuuid uniqueness. When the test was implemented all spoof node ids where 0 on all shards and all coordinators. Later the difference in behaviour was dormant because cql_repl would always execute the append on the same shard. We could fix Scylla to use a zero spoof node address in case a user timestamp is supplied, but the purpose of this is unclear, it may actually be to the contrary of the user's intent.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	405517bc1b	test.py: implement CQL approval tests over pytest Before this patch, approval tests (test/cql/*) were using a C++ application called cql_repl, which is a seastar app running Scylla, reading commands from the standard input and producing results in Json format in the standard output. The rationale for this was to avoid running a standalone Scylla which could leak more resources such as open sockets. Now that other suites already start and stop Scylla servers, it makes more sense to run CQL commands in approval tests against an existing running server. It saves us from building a one more binary and allows to better format the output. Specifically, we would like to see Scylla output in tabular format in approval tests, which is difficult to do when C++ formatting libraries are used.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	c119087719	test.py: implement cql_repl Implement a pytest which would run CQL commands against a scylla server and pretty print server output. Will be used in existing Approval tests in subsequent patches.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	5b9262f567	test.py: add topology suite	2022-05-25 20:26:42 +03:00
Konstantin Osipov	26fa9336d1	test.py: add common utility functions to test/pylib	2022-05-25 20:26:42 +03:00
Konstantin Osipov	1955f9168a	test.py: switch cql-pytest and rest_api suites to PythonTestSuite Manage scylla servers for rest_api and cql-pytest suites using PythonTestSuite. The pool size determines the max number of servers test.py would run concurrently per suite. For tiny suites (rest_api) the cost of starting the servers overweights the cost of running tests so keep it at a minimum. cql-pytest cas dozens of tests, so run them in 4 parallel tracks.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	bc719209ee	test.py: introduce PythonTest and PythonTestSuite PythonTest and PythonTestSuite allow to use test.py-managed scylla servers (or clusters) to run pytest tests.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	1a74828834	test.py: use artifact registry Track running tests in the suite. Cleanup after each suite (after all tests in the suite end). Cleanup all artifacts before exit. Don't drop server logs if there is at least one failed test.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	26128a222b	test.py: temporarily disable raft Raft boot sporadically hangs in the master due to issue #10355.	2022-05-25 20:26:42 +03:00
Konstantin Osipov	7c1c83320c	test.py: (pylib) add Scylla Server and Artifact Registry Allow starting clusters of Scylla servers. Chain up the next server start to the end of the previous one, and set the next server's seed to the previous server. As a workaround for a race between token dissemination through gossip and streaming, change schema version to force a gossip round and make sure all tokens end up at the joining node in time. Make sure scylla start is not race prone. auth::standard_role_manager creates "cassandra" role in an async loop auth::do_after_system_ready(), which retries role creation with an exponential back-off. In other words, even after CQL port is up, Scylla may still be initializing. This race condition could lead to spurious errors during cluster bootstrap or during a test under CI. When the role is ready, queries begin to work, so rely on this "side effect". To start or stop servers, use a new class, ScyllaCluster, which encapsulates multiple servers united into a cluster. In it, validate that a test case cleans up after itself. Additionally, swallow startup errors and throw them when the test is actually used.	2022-05-25 20:26:37 +03:00
Kamil Braun	6268c63739	test: raft: randomized_nemesis_test: check consistency of reads The test would perform `read_barrier`s but not check the correctness of the reads: whether the state observed by a read is consistent with the model and recent enough (in short, check linearizability). This commit adds the correctness checks.	2022-05-25 15:00:19 +02:00
Kamil Braun	6b2b400143	test: raft: randomized_nemesis_test: perform linearizable reads using read_barriers Introduce a new operation, `raft_read`, which calls `read_barrier` on a server, reads the state of the server's state machine, and returns that state. Extend the generator in `basic_generator_test` to generate `raft_read`s. Only do it if forwarding is enabled (although it may make sense to test read barriers in non-forwarding scenario as well - we may think about it and do it in a follow-up). For now, we don't check the consistency of the results of the reads. They do return the observed state, but we don't compare it yet with the model. For now we simply issue the reads concurrently with other operations to introduce some more chaos to the cluster and check liveness and consistency of existing operations.	2022-05-25 15:00:19 +02:00
Kamil Braun	4ea5807862	test: raft: randomized_nemesis_test: add flags for disabling nemeses Makes it easier to debug stuff.	2022-05-25 15:00:16 +02:00
Kamil Braun	c8237d405e	raft: server: in `abort()`, abort read barriers before waiting for rpc abort `rpc::abort` may need to wait until all read barriers finish, so abort read barrier before waiting for `rpc::abort` to finish to avoid a deadlock on shutdown. `rpc::abort` is still called before the read barriers are aborted, only waited for after. Calling it first prevents new read barriers from being started by `rpc` (see `rpc::abort` comment). Also prevent new read barriers from being started after abort starts directly on a leader by checking the `_aborted` flag at the beginning of `execute_read_barrier`. Finally, use the opportunity to remove some compiler-dependent code.	2022-05-25 14:56:32 +02:00
Kamil Braun	86c5036353	raft: server: handle aborts correctly in `read_barrier` The `wait_for_apply` function, called from `read_barrier`, didn't handle aborts. Fix that.	2022-05-25 14:56:32 +02:00
Kamil Braun	1eb849c3d7	raft: fsm: don't advance commit index further than match_idx during read_quorum It's not safe to advance the commit index further than match_idx since beyond that point the follower's log may be outdated. Fixes #10578.	2022-05-25 14:56:32 +02:00
Konstantin Osipov	1b5472f0be	test.py: (pylib) add Host Registry to track used server hosts	2022-05-25 14:59:01 +03:00
Konstantin Osipov	2781c4fc66	test.py: (pylib) add a pool of scylla servers (or clusters)	2022-05-25 14:59:01 +03:00
Avi Kivity	19316dee94	Merge 'auto-scale promoted index' from Benny Halevy Add column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB). When the promoted index (serialized) size gets to this threshold, it's halved by merging each two adjacent blocks into one and doubling the desired_block_size. Fixes #4217 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10646 * github.com:scylladb/scylla: sstables: mx: add pi_auto_scale_events metric sstables: mx/writer: auto-scale promoted index	2022-05-25 10:27:19 +03:00
Tomasz Grabiec	f87274f66a	sstable: partition_index_cache: Fix abort on bad_alloc during page loading When entry loading fails and there is another request blocked on the same page, attempt to erase the failed entry will abort because that would violate entry_ptr guarantees, which is supposed to keep the entry alive. The fix in `92727ac36c` was incomplete. It only helped for the case of a single loader. This patch makes a more general approach by relaxing the assert. The assert manifested like this: scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed. Fixes #10617 Closes #10653	2022-05-25 09:27:04 +03:00
Avi Kivity	e99b99e537	docs: unconfuse doc test checker wrt gdbinit file The docs test dislike the gdbinit link because it refers out of the source tree. Unconfuse the tests by removing the link. It's sad, but the file is more easily used by referring to it rather than viewing it, so give a hint about that too. Closes #10650	2022-05-24 20:20:30 +03:00
Gleb Natapov	083b47cecb	gossiper: replace ad-hoc guard with defer() msg_proc_guard is a guard that makes sure _msg_processing is always decreased. We can use regular defer() to achieve the same. Message-Id: <YoZTQPbTMWAdCObs@scylladb.com>	2022-05-24 19:20:25 +03:00
Pavel Emelyanov	ed23e83207	Merge 'Coroutinize distributed loader' from Benny Halevy Before touching any of this code for https://github.com/scylladb/scylla/issues/9559, that requires a change when loading sstables from the staging subdirectory, simplify it using coroutines. Closes #10609 * github.com:scylladb/scylla: replica: distributed_loader: reindent populate_keyspace replica: distributed_loader: coroutinize populate_keyspace replica: distributed_loader: reindent handle_sstables_pending_delete replica: distributed_loader: coroutinize handle_sstables_pending_delete replica: distributed_loader: reindent cleanup_column_family_temp_sst_dirs replica: distributed_loader: coroutinize cleanup_column_family_temp_sst_dirs replica: distributed_loader: reindent make_sstables_available replica: distributed_loader: coroutinize make_sstables_available sstable_directory: parallel_for_each_restricted: keep func alive across calls replica: distributed_loader: reindent reshape replica: distributed_loader: coroutinize reshape replica: distributed_loader: coroutinize reshard replica: distributed_loader: reindent run_resharding_jobs replica: distributed_loader: coroutinize run_resharding_jobs replica: distributed_loader: reindent distribute_reshard_jobs replica: distributed_loader: coroutinize distribute_reshard_jobs replica: distributed_loader: reindent collect_all_shared_sstables replica: distributed_loader: coroutinize collect_all_shared_sstables replica: distributed_loader: reindent process_sstable_dir replica: distributed_loader: coroutinize process_sstable_dir	2022-05-24 18:01:13 +03:00
Kamil Braun	e8f9bca288	Merge 'raft: test `modify_config` API in `randomized_nemesis_test`' from Kamil Braun Extend the reconfiguration nemesis to send `modify_config` requests as well as `reconfigure` requests. It chooses one or the other with probability 1/2. Fix a bunch of problems that surfaced during testing. Closes #10544 * github.com:scylladb/scylla: test: raft: randomized_nemesis_test: send `modify_config` requests in reconfiguration nemsesis test: raft: randomized_nemesis_test: fix `rpc` reply ID generation test: raft: randomized_nemesis_test: during bouncing call, allow a leader to reroute to itself test: raft: randomized_nemesis_test: handle timed_out_error from modify_config service: raft: rpc: don't call `execute...` functions after `abort()` raft: server: fix bad_variant_access in `modify_config`	2022-05-24 16:47:31 +02:00
Benny Halevy	33bad72fd2	sstables: mx: add pi_auto_scale_events metric Counts the number of promoted index auto-scale events. A large number of those, relative to `partition_writes`, indicates that `column_index_size_in_kb` should be increased. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-24 13:32:39 +03:00
Benny Halevy	6677028212	sstables: mx/writer: auto-scale promoted index Add column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB). When the promoted index (serialized) size gets to this threshold, it's halved by merging each two adjacent blocks into one and doubling the desired_block_size. Fixes #4217 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-24 13:32:35 +03:00
Kamil Braun	700a1fdd20	test: raft: randomized_nemesis_test: send `modify_config` requests in reconfiguration nemsesis Extend the reconfiguration nemesis to send `modify_config` requests as well as `reconfigure` requests. It chooses one or the other with probability 1/2.	2022-05-24 11:39:08 +02:00
Kamil Braun	2222f095b3	test: raft: randomized_nemesis_test: fix `rpc` reply ID generation When `rpc` wants to perform a two-way RPC call it sends a message containing a `reply_id`. The other side will send the `reply_id` back when answering, so the original side can match the response to the promise corresponding to the future being waited on by the RPC caller. Previously each instance of `rpc` generated reply IDs independently as increasing integers starting from 0. The network delivers messages based on Raft server IDs. A response message may thus be delievered not to the original instance which invoked the RPC, but to a new instance which uses the same Raft server ID (after we simulated a server crash/stop and restart, creating a new server with the same ID that reuses the previous instance's `persistence` instance but has a new `rpc`). The new instance could have started a new RPC call using the same `reply_id` as one currently being in-flight that was started by the previous instance. The new instance could then receive and handle a response that was intended for the previous instance, leading to weird bugs. Fix this by replacing the local reply ID counters by a global counter so that every two-way RPC call gets a unique reply ID.	2022-05-24 11:39:08 +02:00
Kamil Braun	b9807f07e6	test: raft: randomized_nemesis_test: during bouncing call, allow a leader to reroute to itself A server executing a `modify_config` call, even if it initially was a leader and accepted the request, may end up throwing a `not_a_leader` error, rerouting the caller to a new leader - but this new leader may be that same server. This happens because `execute_modify_config` translates certain errors that it considers transient (such as `conf_change_in_progress`) into `not_a_leader{last_known_leader}`, in attempt to notify the caller that they should retry the request; but when this translation happens, the `last_known_leader` may be that same server (it could have even lost leadership and then regained it back while the request was being handled). This is not strictly an error, and it should be safe for the client to retry the request by sending it to the same server. The nemesis test assumed that a server never returns `not_a_leader{itself}`; this commit drops the assumption. An alternative solution would be to extend the error types that are now translated to `not_a_leader` so they include information about the last known leader. This way the client does not lose information about the original error and still gets a potential contact point for retry.	2022-05-24 11:36:51 +02:00
Kamil Braun	b33bc7a5d6	test: raft: randomized_nemesis_test: handle timed_out_error from modify_config May be propagated from `rpc::send_modify_config` to the caller of `modify_config`.	2022-05-24 11:36:51 +02:00
Kamil Braun	4767b163ef	service: raft: rpc: don't call `execute...` functions after `abort()` The functions are called from RPC when a follower forwards a request to a leader (`add_entry`, `modify_config`, `read_barrier`). The call may be attempted during shutdown. The Raft shutdown code cleans up data structures created by those requests. Make sure that they are not updated concurrently with shutdown. This can lead to problems such as using the server object after it was aborted, or even destroyed. After this change, the RPC implementation may wait for a `execute_modify_config` call to finish before finishing abort. That call in turn may be stuck on `wait_for_entry`. Thus the waiter may prevent RPC from aborting. Fix this be moving the wait on the future returned from `_rpc->abort()` in `server::abort()` until after waiters were destroyed.	2022-05-24 11:36:51 +02:00
Kamil Braun	5e06d0ad6f	raft: server: fix bad_variant_access in `modify_config` `modify_config` would call `execute_modify_config` or `_rpc->send_modify_config`, which returned a reply of type `add_entry_reply`. This is a variant of 3 options: `entry_id`, `not_a_leader`, or `commit_status_unknown`. The code would check for the `entry_id` option and otherwise assume that it was `not_a_leader`. During nemesis testing however, the reply was sometimes `commit_status_unknown`, which caused a `bad_variant_access` exception during `std::get` call. Fix this. There is a similar piece of code in `add_entry`, but there it should be impossible to obtain `commit_status_unknown` even though the types don't enforce it. Make it more explicit with a comment and an assertion.	2022-05-24 11:36:51 +02:00
Nadav Har'El	dc5c9321fe	test/cql-pytest: have new_test_table() recycle table names Scylla has a long-standing bug (issue #7620) where having many tombstones in the schema table significantly slows down further schema operations. Many cql-pytest tests use new_test_table() to create a temporary test table with a specific schema. Before this patch, each temporary table was created with a random name, and deleted after the test. When running many tests on the same Scylla server, this results in a lot of tombstones in the schema tables, and really slow schema operations. For example, look at home much time it takes to run the same test file N times: $ test/cql-pytest/run --count N test_filtering.py N=25 - 16 seconds (total time for the N repetitions) N=50 - 41 seconds N=100 - 122 seconds Notice how progressively slower each repetition is becoming - the total test time should have been linear in N, but it isn't! In this patch, we keep a cache of already-deleted table names (not the tables, just their names!) so as to reuse the same name when we can instead of inventing a new random name. With this patch, the performance improvement after some repetitions is amazing (compare to the table above): N=25 - 14 seconds N=50 - 29 seconds N=100 - 46 seconds Note how the testing time is now more-or-less linear in the number of repetitions, as expected. The table-name recycling trick is the same trick I already used in the past for the translated Cassandra tests (test/cql-pytest/cassandra_tests). The problem was even more obvious there because those tests create a lot of different tables. But the same problem also exists in cql-pytest in general, so let's solve it here too. Refs #7620 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10635	2022-05-24 11:32:25 +03:00
Asias He	1f8b529e08	range_streamer: Disable restream logic Consider: - n1 and n2 in the cluster - n3 bootstraps to join - n1 does not hear gossip update from n3 due to network issue - n1 removes n3 from gossip and pending node list - stream between n1 and n3 fails - n1 and n3 network issue is fixed - n3 retry the stream with n1 - n3 finishes the stream with n1 - n3 advertises normal to join the cluster The problem is that n1 will not treat n3 as the pending node so writes will not route to n3 once n1 removes n3. Another problem is that when n1 gets normal gossip status update from n3. The gossip listener will fail because n1 has removed n3 so n1 could not find the host id for n3. This will cause n1 to abort. To fix, disable the retry logic in range_streamer so that once a stream with existing fails the bootstrap fails. The downside is that we lose the ability to restream caused by temporary network issue but since we have repair based node operation. We can use it to resume the previous failed node operations. Fixes: #9805 Closes #9806	2022-05-24 11:24:25 +03:00
Avi Kivity	21728dff6f	Merge 'cql: Remove support for null inside lists of IN values' from Jan Ciołek Currently we support queries like: ```cql SELECT * FROM ks.tab WHERE p IN (1, 2, null, 4); ``` Nothing can be equal to null so this is equivalent to: ```cql SELECT * FROM ks.tab WHERE p IN (1, 2, 4); ``` Cassandra doesn't support it at all. ```cql > SELECT * FROM ks.tab WHERE p IN (1, 2, null, 4) Error: DbError(Invalid, "Invalid null value in condition for column p") > SELECT * FROM ks.tab WHERE p IN (1, 2, ?, 4) # ? is NULL Error: DbError(Invalid, "Invalid null value in condition for column p") > SELECT * FROM ks.tab WHERE p IN ? # ? is (1, 2, null, 4) Error: DbError(Invalid, "Invalid null value in condition for column p") ``` It makes little sense to send a null inside list of IN values and supporting it is a bit cumbersome. Supporting it causes trouble because internally the values are represented as a list, not a tuple, and lists can't contain nulls. Because of that code requires exceptions because in this single case there can be a null inside of a collection. This PR starts treating a llist of IN values the same as any other list and as result nulls are forbidden inside them. In case of a null the message is the same as any other collection: ``` null is not supported inside collections ``` I'm not entirely happy about it - someone could be confused if they received this message after a query that didn't involve any collections. The problem with making a prettier error message is that once again we would have to give `evaluate` additional information that it's now evaluating a list of IN values. And we would end up back with `evaluate_IN_list` I think we could consider adding some kind of generic context to evaluate. The context would contain the whole expression and a mark on the part that we are currently evaluating. Then in case of error we could use this context and use it to create more helpful error messages, e.g. point to the part of the expression where a problem occured. But that's outside of the scope of this PR. Fixes #10579 Closes #10620 * github.com:scylladb/scylla: cql: Add test for null in IN list cql: Forbid null in lists of IN values	2022-05-24 09:15:13 +03:00
Jan Ciolek	ff3205cf19	cql: Add test for null in IN list Added tests that check for error when null or unset appears in a list of IN values. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-05-24 00:17:54 +02:00
Jan Ciolek	f9b1fc0b69	cql: Forbid null in lists of IN values We used to allow nulls in lists of IN values, i.e. a query like this would be valid: SELECT * FROM tab WHERE pk IN (1, null, 2); This is an old feature that isn't really used and is already forbidden in Cassandra. Additionally the current implementation doesn't allow for nulls inside the list if it's sent as a bound value. So something like: SELECT * FROM tab WHERE pk IN ?; would throw an error if ? was (1, null, 2). This is inconsistent. Allowing it made writing code cumbersome because this was the only case where having a null inside of a collection was allowed. Because of it there needed to be separate code paths to handle regular lists and lists of NULL values. Forbidding it makes the code nicer and consistent at the cost of a feature that isn't really important. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-05-24 00:17:41 +02:00
Avi Kivity	e65b3ed50a	Merge 'Allow trigger off strategy compaction early for node operations' from Asias He This patch set adds two commits to allow trigger off strategy early for node operations. ) repair: Repair table by table internally This patch changes the way a repair job walks through tables and ranges if multiple tables and ranges are requested by users. Before: ``` for range in ranges for table in tables repair(range, table) ``` After: ``` for table in tables for range in ranges repair(range, table) ``` The motivation for this change is to allow off-strategy compaction to trigger early, as soon as a table is finished. This allows to reduce the number of temporary sstables on disk. For example, if there are 50 tables and 256 ranges to repair, each range will generate one sstable. Before this change, there will be 50 256 sstables on disk before off-strategy compaction triggers. After this change, once a table is finished, off-strategy compaction can compact the 256 sstables. As a result, this would reduce the number of sstables by 50X. This is very useful for repair based node operations since multiple ranges and tables can be requested in a single repair job. Refs: #10462 ) repair: Trigger off strategy compaction after all ranges of a table is repaired When the repair reason is not repair, which means the repair reason is node operations (bootstrap, replace and so on), a single repair job contains all the ranges of a table that need to be repaired. To trigger off strategy compaction early and reduce the number of temporary sstable files on disk, we can trigger the compaction as soon as a table is finished. Refs: #10462 Closes #10551 github.com:scylladb/scylla: repair: Trigger off strategy compaction after all ranges of a table is repaired repair: Repair table by table internally	2022-05-23 18:58:21 +03:00
Avi Kivity	3fadae74e7	Merge "Keep cluster-join code in one place" from Pavel E " There are several issues with it - it's scattered between main() and storage_service methods - yet another incarnation of it also sits in the cql-test-env - the prepare_to_join() and join_token_ring() names are lying to readers, as sometimes node joins the ring in prepare- stage - storage service has to carry several private fields to keep the state between prepare- and join- parts - some storage service dependencies are only needed to satisfy joining, but since they cannot start early enough, they are pushed to storage service uninitialized "in the hope" that it won't use them until join This patch puts joining steps in one place and enlightens storage service not to carry unneeded dependencies/state onboard. And eliminates one more usage of global proxy instance while at it. branch: https://github.com/xemul/scylla/tree/br-merge-init-server-and-join-cluster tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/466/ refs: #2795 " * 'br-merge-init-server-and-join-cluster' of https://github.com/xemul/scylla: storage_service: Remove global proxy call storage_service: Remove sys_dist_ks from storage_service dependencies storage_service: Remove cdc_gen_service from storage_service dependencies storage_service: Make _cdc_gen_id local variable storage_service: Make _bootstrap_tokens local variable storage_service: Merge prepare- and join- private members storage_service: Move some code up the file storage_service: Coroutinize join_token_ring storage_service: Fix indentation after previous patch storage_service: Execute its .bootstrap() into async() storage_service: Dont assume async context in mark_existing_views_as_built storage_service: Merge init-server and join-cluster main, storage_service: Move wait for gossip to settle main, storage_service: Move passive announce subscription main, storage_service: Move early group0 join call	2022-05-23 17:33:02 +03:00
Piotr Dulikowski	ead7bdd6f8	storage_proxy: remove unused overload of query_mutations_locally An overload of storage_proxy::query_mutations_locally was declared in `a35136533d` which takes a vector of partition ranges as an argument, but it was never defined. This commit removes the unused overload declaration. Closes #10610	2022-05-23 16:20:51 +03:00
Avi Kivity	75e001fc6a	cql3: grammar: unify production for bind variables Since `9b49d27a8` ("cql3: expr: Remove shape_type from bind_variable"), bind variables no longer remember their context (e.g. if they are in a scalar or vector comparison, or if they are in an IN or other relation. Exploit that my merging all of the productions that generate a bind variables (that are now exactly equal) into a single marker production. Closes #10624	2022-05-23 17:41:16 +03:00
Pavel Emelyanov	d755fdc1f4	storage_service: Remove global proxy call Storage service needs it to calculate schema version on join. The proxy at this point can be passed as an argument to the joining helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	bc051387c5	storage_service: Remove sys_dist_ks from storage_service dependencies The service in question is only needed join_cluster-time, no need to keep it in the dependencies list. This also solves the dependency trouble -- the distributed keyspace is sharded::start-ed after it's passed to storage_service initialization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	5a97ba7121	storage_service: Remove cdc_gen_service from storage_service dependencies This service is only needed join-time, it's better to pass it as argument to join_cluster(). This solves current reversed dependency issuse -- the cdc_gen_svc is now started after it's passed to storage service initialization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	0c40b69411	storage_service: Make _cdc_gen_id local variable Same as with _bootstrap_tokens -- this variable is only needed throughout a single function invocation, so it doesn't have to be a class member. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	80bd317292	storage_service: Make _bootstrap_tokens local variable Now it's a member on storage_service, but it was such just to carry the set of tokens between to subsequent calls. Now when all the joining happens in one function, the set can become local variable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	3f6d3ea601	storage_service: Merge prepare- and join- private members These two are the real code that does preparation and joining. They are called in async() context by public storage_service methods that had been merged recently, so this patch merges the internals. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	7ac73bb87f	storage_service: Move some code up the file No logic change, this is to keep join_token_ring next to prepare_to_join so that the patch merging them becomes clean and small. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	f478c7f29c	storage_service: Coroutinize join_token_ring Next patch will merge this method with prepare_to_join() which is already coroutinized. To make it happen -- coroutinize it in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	81e7de076e	storage_service: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	3a16d3ee95	storage_service: Execute its .bootstrap() into async() Next patches will coroutinize join_cluster(), so the .bootstrap() method should return a future. It's worth coroutinizing it as well, but that's a huge change, so for now -- keep it in its own explicit async(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	4a6bf57e8f	storage_service: Dont assume async context in mark_existing_views_as_built Next patches will coroutinize join_cluster(), this is a preparation to that change. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	282cc070bc	storage_service: Merge init-server and join-cluster Now they always follow one another both in main and cql-test-env. Also, despite the name, init-server does joins the cluster when it's just a normal node restarting, so join-cluster is called when the cluster is already joind. This merge make the function be named as what it really does. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	b2b86b0c83	main, storage_service: Move wait for gossip to settle And make cql-test-env configure to skip it not to slow down tests in vain. Another side effect is that cql-test-env would trigger features enabling at this point, but that's OK, they are enabled anyway. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	b842167bcc	main, storage_service: Move passive announce subscription Storage service already has a vector of random subscription scope holders, this becomes yet another one. This partially reverts `e4f35e2139`, which's half-step backwards, but so far I've no better ideas where to track that scope guard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Pavel Emelyanov	89163a3be4	main, storage_service: Move early group0 join call It happens right after the prepare to join, moving it at the end of the latter call doesn't change the code logic. A side effect -- this removes a silly join_group0() one-line helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-23 12:55:30 +03:00
Nadav Har'El	02fb6f33fb	test//run: improve signal handling in test runners When a user runs a script and presses control-C, a SIGINT (signal 2) gets sent to every process in the script's "process group". By default, every subprocess started by a script joins the parent's process group. Our test//run test-runner scripts typically start two processes: scylla and pytest. If we keep them in the same process group, a control-C would kill them in a random order and that is ugly - if Scylla is killed before pytest, we'll see a few test failures before pytest is finally killed. So the existing code put Scylla in its own process group, and killed it on exit after killing pytest. But there were a few inconsistencies in our implementation, leading to some annoying behaviors: 1. Doing "kill -2" to the runner's process (not a control-C which sends a signal to the process group) caused scylla and pytest to be killed on exit. So far so good. But, we should kill their entire process groups, not just the one process. This is important when pytest starts its own subprocesses (as happens in cql-pytest/test_tools.py), otherwise they just remain running. We need to call pgkill() instead of kill(), but also we forgot to start a new process group for the pytest run - so this patch fixes it. 2. Our exit handler - which kills the subprocesses - only gets called on signals which Python catches, and this is only SIGINT. Killing the test runner with SIGTERM or SIGHUP before this patch caused the subprocesses to be left running. In this patch we also catch SIGTERM and SIGHUP, so our exit handler is also run in that case. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10629	2022-05-23 12:24:19 +03:00
Piotr Sarna	84833e495e	Merge 'Additional cql-pytest tests related to WHERE expressions' from Nadav Har'El This small series includes a few more CQL tests in the cql-pytest framework. The main patch is a translation of a unit test from Cassandra that checks the behavior of restrictions (WHERE expressions, filtering) in different cases. It turns out that Cassandra didn't implement some cases - for example filtering on unfrozen UDTs - but Scylla does implement them. So in the translated test, the checks-that-these-features-generate-an-error from Cassandra are commented out, and this series also includes separate tests for these Scylla-unique features to check that they actually work correctly and not just that they exist. Closes #10611 * github.com:scylladb/scylla: cql-pytest: translate Cassandra's tests for relations test/cql-pytest: add test for filtering UDTs test/cql-pytest: tests for IN restrictions and filtering test/cql-pytest: test more cases of overlapping restrictions	2022-05-23 11:10:24 +02:00
Avi Kivity	6a3494442e	Update abseil submodule * abseil f70eadad...9e408e05 (109): > Cord: workaround a GCC 12.1 bug that triggers a spurious warning > Change workaround for MSVC bug regarding compile-time initialization to trigger from MSC_VER 1910 to 1930. 1929 is the last _MSC_VER for Visual Studio 2019. > Don't default to the unscaled cycle clock on any Apple targets. > Use SSE instructions for prefetch when __builtin_prefetch is unavailable > Replace direct uses of __builtin_prefetch from SwissTable with the wrapper functions. > Cast away an unused variable to play nice with -Wunused-but-set-variable. > Use NullSafeStringView for const char* args to absl::StrCat, treating null pointers as "" Fixes #1167 > raw_logging: Extract the inlined no-hook-registered behavior for LogPrefixHook to a default implementation. > absl: fix use-after-free in Mutex/CondVar > absl: fix live-lock in CondVar > Add a stress test for base_internal::ThreadIdentity reuse. > Improve compiler errors for mismatched ParsedFormat inputs. > Internal change > Fix an msan warning in cord_ringbuffer_test > Fix spelling error "charachter" > Document that Consume(Prefix\|Suffix)() don't modify the input on failure > Fixes for C++20 support when not using std::optional. > raw_logging: Document that AbortHook's buffers live for as long as the process remains alive. > raw_logging: Rename SafeWriteToStderr to indicate what about it is safe (answer: it's async-signal-safe). > Correct the comment about the probe sequence. It's (i/2 + i)/2 not (i/2 - i)/2. > Improve analysis of the number of extra `==` operations, which was overly complicated, slightly incorrect. > In btree, move rightmost_ into the CompressedTuple instead of root_. > raw_logging: Rename LogPrefixHook to reflect the other half of it's job (filtering by severity). > Don't construct/destroy object twice > Rename function_ref_benchmark.cc into more generic function_type_benchmark.cc, add missing includes > Fixed typo in `try_emplace` comment. > Fix a typo in a comment. > Adds ABSL_CONST_INIT to initializing declarations where it is missing > Automated visibility attribute cleanup. > Fix typo in absl/time/time.h > Fix typo: "a the condition" -> "a condition". > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Fix build with uclibc-ng (#1145) > Export of internal Abseil changes > Export of internal Abseil changes > Replace the implementation of the Mix function in arm64 back to 128bit multiplication (#1094) > Support for QNX (#1147) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Exclude unsupported x64 intrinsics from ARM64EC (#1135) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add NetBSD support (#1121) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Some trivial OpenBSD-related fixes (#1113) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add support of loongarch64 (#1110) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Disable ABSL_INTERNAL_ENABLE_FORMAT_CHECKER under VsCode/Intellisense (#1097) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > macos: support Apple Universal 2 builds (#1086) > cmake: make `random_mocking_bit_gen` library public. (#1084) > cmake: use target aliases from local Google Test checkout. (#1083) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > cmake: add ABSL_BUILD_TESTING option (#1057) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Fix googletest URL in CMakeLists.txt (#1062) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Fix Randen and PCG on Big Endian platforms (#1031) > Export of internal Abseil changes Closes #10630	2022-05-22 23:46:33 +03:00
Avi Kivity	4afe2a24b9	cql3: grammar: make 'relation' return an expression rather than append to a vector The 'relation' production is self-contained, except for its interface to the rest of the grammar where it appends to a vector of expressions (that happens to represent a conjunction of relations). Make it stand- alone by returning an expression, and move the responsibility for appending to an expression vector to the whereClause production (later we can make it build a conjunction expression rather than a vector of expressions, paving the way for more boolean operators). Closes #10623	2022-05-22 22:16:12 +03:00
Avi Kivity	a6b554409b	storage_proxy: stop using deprecated std::not1 and std::bind1st in cas(), get_paxos_participants(), and create_write_response_handler_helper() Use equivalent std::not_fn and std::bind_front instead. Closes #10622	2022-05-22 22:13:12 +03:00
Nadav Har'El	ac393a62a1	cql-pytest: translate Cassandra's tests for relations This is a translation of Cassandra's CQL unit test source file validation/operations/SelectSingleColumnRelationTest.java into our cql-pytest framework. This test file includes 23 tests for various types of SELECT operations which involve relations, a.k.a expressions (i.e., WHERE). All 23 tests pass on Cassandra. 3 of the tests fail on Scylla reproducing 2 already known Scylla issues and three minor previously-unknown issues: Previously known issues: Refs #2962: Collection column indexing Refs #10358: Comparison with UNSET_VALUE should produce an error Three new (and minor) issue: Refs #10577: Is max-clustering-key-restrictions-per-query too low? Refs #10631: Invalid IN restriction is reported as a '=' restriction Refs #10632: Column name printed in a strange way in error message NOTE: Scylla supports some expressions which Cassandra does not. In some cases the Cassandra unit test had checks that certain constructs are not allowed, and I had to comment out such checks when the expression does work in Scylla. But of course, in such cases, it is not enough to comment out a check - we also need to verify that Scylla's unique behavior is the correct one. For that, we will have separate cql-pytest test for those features - they won't be in the translated Cassandra unit tests (of course). For example, in this test I had to comment out a check that filtering on non-frozen UDTs is not allowed. In a separate patch which I'm sending in parallel, I added a new test - test_filter_UDT_restriction_nonfrozen - which will verify that what Scylla does in that case is the correct behavior. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-22 20:49:04 +03:00
Nadav Har'El	8af2c4aced	test/cql-pytest: add test for filtering UDTs Both Scylla and Cassandra support filtering on frozen UDTs, which are compared using lexicographical order. This patch adds a test to verify that the behavior here is the same - and indeed it is. For non-frozen UDTs, Cassandra does not allow filtering on them (this was decided in CASSANDRA-13247), but Scylla does. So we also add a test on how non-frozen UDTs work - that passes on Scylla (and of course not in Cassandra). The two tests here - for frozen and non-frozen UDTs - are identical (they just call the same function) - to ensure these two cases work the same. This is important because we can't judge the correctness of the non-frozen test by comparison to Cassandra - because it can't run there. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-22 20:20:15 +03:00
Nadav Har'El	1c355d773a	test/cql-pytest: tests for IN restrictions and filtering Cassandra only allows IN restrictions on primary key columns. With filtering (ALLOW FILTERING), this limitation makes little sense, and it turns out that Scylla does not have limitation. So this patch adds a test that we support such queries correctly (we can't compare to Cassandra because it doesn't implement this). Another test checks IN restrictions on an indexed column (without ALLOW FILTERING). We could have implemented this - the indexed column behaves like a partition key - but we didn't, so this test xfails. It also fails on Cassandra because as mentioned above, Cassandra didn't implement IN except for primary key columns. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-22 20:20:15 +03:00
Nadav Har'El	bd71e6d962	test/cql-pytest: test more cases of overlapping restrictions As noted by test_filtering.py::test_multiple_restrictions_on_same_column Cassandra WHERE does not allow specifying two restrictions on the the same column, but Scylla does allow it, and this test verifies that the results are correct (conflicting restrictions would lead to no results, but overlapping restrictions can return some results). In this patch we add yet another example of multiple restrictions on the same column that was seen in a Cassandra unit test - this time one of the restrictions involves a IN. These patch helps confirm that the expression evaluation is done correct (and, again, differently from Cassandra - Cassandra results in an error in this case). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-22 20:20:15 +03:00
Alejo Sanchez	3904d3b96e	install-dependencies.sh: add python3-pytest-asyncio Add package pytest-asyncio for async pytest support. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #10616 [avi: regenerate frozen toolchain]	2022-05-22 17:46:56 +03:00
Pavel Emelyanov	1199c6e5da	snitch: Use invoke_on_others() to replicate The replication happens on all shards but current one. There's a special helper in seastar for such cases Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-20 18:16:22 +03:00
Pavel Emelyanov	5ec87285f8	snitch: Merge set_my_dc and set_my_rack into one These two are always used in pair. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-20 18:16:19 +03:00
Pavel Emelyanov	c6d0bc87d0	azure_snitch: Do nothing on non-io-cpu All snitch drivers are supposed to snitch info on some shard and replicate the dc/rack info across others. All, but azure really do so. The azure one gets dc/rack on all shards, which's excessive but not terrible, but when all shards start to replicate their data to all the others, this may lead to use-after-frees. fixes: #10494 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-20 18:15:57 +03:00
Benny Halevy	b3e2204fe6	replica: distributed_loader: reindent populate_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:16:41 +03:00
Benny Halevy	a3c1dc8cee	replica: distributed_loader: coroutinize populate_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:15:06 +03:00
Benny Halevy	5b038affae	replica: distributed_loader: reindent handle_sstables_pending_delete Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	b8260c9983	replica: distributed_loader: coroutinize handle_sstables_pending_delete Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	48122d3006	replica: distributed_loader: reindent cleanup_column_family_temp_sst_dirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	8ba10dba2d	replica: distributed_loader: coroutinize cleanup_column_family_temp_sst_dirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	5f4d20267d	replica: distributed_loader: reindent make_sstables_available Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	b3ebbf35e2	replica: distributed_loader: coroutinize make_sstables_available Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	868cea21e0	sstable_directory: parallel_for_each_restricted: keep func alive across calls Without that there's use-after-free when called from distributed_loader::make_sstables_available where func is turned into a coroutine and the shared_sstable parameter is not explicitly copied and captured for the continuation of sst->move_to_new_dir. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	b13f44ca61	replica: distributed_loader: reindent reshape Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	a1e663f225	replica: distributed_loader: coroutinize reshape Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	cf0d0a18a0	replica: distributed_loader: coroutinize reshard Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	e1ba285d52	replica: distributed_loader: reindent run_resharding_jobs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	29e51ed0cd	replica: distributed_loader: coroutinize run_resharding_jobs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	b65d55cbbf	replica: distributed_loader: reindent distribute_reshard_jobs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	3baa4d4946	replica: distributed_loader: coroutinize distribute_reshard_jobs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	ba1eb7ab9c	replica: distributed_loader: reindent collect_all_shared_sstables Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	84d528cd84	replica: distributed_loader: coroutinize collect_all_shared_sstables Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:53 +03:00
Benny Halevy	8080e98309	replica: distributed_loader: reindent process_sstable_dir Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:51 +03:00
Benny Halevy	33179c8647	replica: distributed_loader: coroutinize process_sstable_dir Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-20 17:07:00 +03:00
Botond Dénes	f873806c7c	Merge 'Fix use-after-free when queue reader is destroyed before its handle' from Benny Halevy The handle must not point at this reader implementation after it's destroyed. This fixes use-after-free when the queue_reader_v2 is destroyed first as repair_writer_impl::_queue_reader, before repair_writer_impl::_mq is destroyed. The issue was introduced in `39205917a8` in the definition of `repair_writer_impl`. Fixes #10528 While at it, fix also an ignored exceptional future seen in the test: `repair_additional_test.py::TestRepairAdditional::test_repair_kill_3` Closes #10591 * github.com:scylladb/scylla: mutation_readers: queue_reader_v2: detach from handle when destroyed messaging_service: do_make_sink_source: handle failed source future	2022-05-19 21:51:41 +03:00
Raphael S. Carvalho	b120cacdd1	compaction_manager: Allow off-strategy to proceed in parallel to in-strategy compactions Off-strategy works on maintenance sstable set using maintenance scheduling group, whereas "in-strategy" works on main sstable set and uses compaction group. Today, it can happen that off-strategy has to wait for an "in-strategy" maintenance compaction, e.g. cleanup, to complete before getting a chance to run. But that's not desired behavior as off-strategy uses maintenance group, and its candidates don't add to the backlog that influences "in-strategy" bandwidth. Therefore, "in-strategy" and off-strategy should be decoupled, with off-strategy having its own semaphore for guaranteeing serialization across tables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #10595	2022-05-19 17:37:11 +03:00
Takuya ASADA	b6003989f9	scylla_setup: stop using sudo -u, use user/group parameter on subprocess module To run scylla-housekeeping we currently use "sudo -u scylla <cmd>" to switch scylla user, but it fails on some environment. Since recent version of Python 3 supports to switch user on subprocess module, let's use python native way and drop sudo. Fixes #10483 Closes #10538	2022-05-19 17:21:35 +03:00
Avi Kivity	fcf292b5d1	Merge 'Add range deserializer' from Benny Halevy Currently, the idl-generated deserialization code, e.g. mutation_partition_view::rows() deserializes and returns a complete utils::chunked_vector<deletable_row_view> . And that could be arbitrarily long. To consume it gently, we don't need the whole vector in advance, but rather we can consume it one element at a time (and in a nested way for cells in a row in the future). Use `range_deserializer` to consume range tombstones and rows one item at a time. We may consider in the future also gently iterating over cells in a row and then dipping into collection cells that might also contain a large number of items. Fixes #10558 Closes #10566 * github.com:scylladb/scylla: ser: use vector_deserializer by default for all idl vectors mutation_partition_view: do_accept_gently: use the range based deserializers idl-compiler: generate *_range methods using vector_deserializer serializer_impl: add vector_deserializer test: frozen_mutation_test: test_writing_and_reading_gently: log detailed error	2022-05-19 17:21:35 +03:00
Avi Kivity	5285ccbb12	Merge 'Add prune ghost rows statement' from Piotr Sarna This series is split from another, bigger RFC series which provides manual remedies to deal with inconsistencies between the base table and its views. This part deals with ghost rows by providing a statement which fetches view rows from a given range, then reads its corresponding rows from the base table (cl=ALL), and finally removes rows which were not present in the base table at all, qualifying them as ghost rows. Motivations for introducing such a statement: * in case of detected inconsistencies, it can be used to fix materialized views without recreating them from scratch, which can take days and generates lots of throughput * a tool which periodically scrubs a materialized view can be easily created on top of this statement, especially that it's possible to remove ghost rows from a user-defined view token range; This series comes with a unit test. The reason for digging up this series is because it's still possible to end up with ghost rows in certain rather improbable scenarios, and we lack a way of fixing them without rebuilding the whole view. For instance, in case of a failed synchronous update to a local view, the user will be notified that the query failed, but a ghost row can be created nonetheless. The pruning statement introduced in this series would allow healing the failure locally, without rebuilding the whole view. Tests: unit(dev) Closes #10426 * github.com:scylladb/scylla: docs: add a paragraph on PRUNE MATERIALIZED VIEW statement service,test: add a test case for error during pruning tests: add ghost row deletion test case cql3: enable ghost row deletion via CQL cql3: add a statement for deleting ghost rows cql3: convert is_json statement parameter to enum pager: add ghost row deleting pager db,view: add delete ghost rows visitor	2022-05-19 17:21:35 +03:00
Gleb Natapov	c2ef390a52	service: raft: move group0 write path into a separate file Writing into the group0 raft group on a client side involves locking the state machine, choosing a state id and checking for its presence after operation completes. The code that does it resides now in the migration manager since the currently it is the only user of group0. In the near future we will have more client for group0 and they all will have to have the same logic, so the patch moves it to a separate class raft_group0_client that any future user of group0 can use to write into it. Message-Id: <YoYAJwdTdbX+iCUn@scylladb.com>	2022-05-19 17:21:35 +03:00
Avi Kivity	78eccd8763	Merge "Remove sstable_format_slector::sync()" from Pavel E " There's an explicit barrier in main that waits for the sstable format selector to finish selecting it by the time node start to join a cluter. (Actually -- not quite, when restarting a normal node it joins cluster in prepare_to_join()). This explicit barrier is not needed, the sync point already exists in the way features are enabled, the format-selector just needs to use it. branch: https://github.com/xemul/scylla/tree/br-format-selector-sync tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/351/ refs: #2795 " * 'br-format-selector-sync' of https://github.com/xemul/scylla: format-selector: Remove .sync() point format-selector: Coroutinize maybe_select_format() format-selector: Coroutinize simple methods	2022-05-19 17:21:35 +03:00
Avi Kivity	08ed4d7405	Merge 'scylla-gdb.py: add commands to dump sstables summary and index-cache' from Botond Dénes This series adds two commands: * scylla sstable-summary * scylla sstable-index-cache The former dumps the content of the sstable summary. This component is kept in memory in its entirety, so this can be easily done. The latter command dumps the content of the sstable index cache. This contains all the index-pages that are currently cached. The promoted index is not dumped yet and there is no indication of whether a given entry is in the LRU or not, but this already allows at seeing what pages are in the cache and what aren't. Closes #10546 * github.com:scylladb/scylla: scylla-gdb.py: add scylla sstable-index-cache command scylla-gdb.py: add scylla sstable-summary command test/scylla-gdb: add sstable fixture scylla-gdb.py: make chunked_vector a proper container wrapper" scylla-gdb.py: make small_vector a proper container wrapper" scylla-gdb.py: add sstring container wrapper scylla-gdb.py: add chunked_managed_vector container wrapper scylla-gdb.py: add managed_vector container wrapper scylla-gdb.py: std_variant: add workaround for clang template bug scylla-gdb.py: add bplus_tree container wrapper	2022-05-19 17:21:35 +03:00
Pavel Emelyanov	3e53a0965c	scylla-gdb: Handle new seastar/fair_queue layout Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220519094452.14538-1-xemul@scylladb.com>	2022-05-19 13:29:03 +03:00
Benny Halevy	9c5feb2781	mutation_readers: queue_reader_v2: detach from handle when destroyed The handle must not point at this reader implementation after it's destroyed. This fixes use-after-free when the queue_reader_v2 is destroyed first as repair_writer_impl::_queue_reader, before repair_writer_impl::_mq is destroyed. The issue was introduced in `39205917a8` in the definition of `repair_writer_impl`. Fixes #10528 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-19 11:48:03 +03:00
Benny Halevy	1308b45c58	messaging_service: do_make_sink_source: handle failed source future I've stumbled upon this with version `a2901a376d` in debug mode when testing repair_additional_test.py::TestRepairAdditional::test_repair_kill_3: WARN 2022-05-17 07:26:12,581 [shard 0] seastar - Exceptional future ignored: seastar::rpc::closed_error (connection is closed), backtrace: 0x137c33d0 0x1ad14d0d 0x1ad149cd 0x1ad16fc3 0x1ad17e52 0x19d8a809 0x19d8ab6a 0x139165a9 0x17be0d21 0x17bdcfb0 0x17bf3611 0x17bf39f0 0x17bf3c62 0x17bf3958 0x17bf57d8 0x17bf5468 0x19efe44e 0x19f04ac6 0x19f09732 0x19f072a1 0x19cca281 0x19cc7de5 0x13859cbf 0x13d309d6 0x13d3090b 0x13d30775 0x1391364d 0x13858521 /lib64/libc.so.6+0x27b74 0x137774ad Decode: ``` seastar::report_failed_future(seastar::future_state_base::any&&) at //./seastar/src/core/future.cc:218 seastar::future_state_base::any::check_failure() at //./seastar/include/seastar/core/future.hh:573 seastar::future_state<seastar::rpc::source<repair_row_on_wire_with_cmd> >::clear() at ././seastar/include/seastar/core/future.hh:615 ~future_state at ././seastar/include/seastar/core/future.hh:620 (inlined by) ~future at ././seastar/include/seastar/core/future.hh:1343 ~ at ./message/messaging_service.cc:841 ``` Looks like if sink.close() fails after source.failed() then source gets abandoned. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-19 11:47:38 +03:00
Avi Kivity	5c481973a3	sstables/processing_result_generator.hh: refine check for coroutine standard We have a check for whether we can use standard coroutines (in namespace std) or the technical specification (in std::experimental), but it doesn't work since Clang doesn't report the correct standard version. Use a compiler versionspecific check, inspired by Seastar's check. This allows building with clang 14. Closes #10603	2022-05-19 11:31:40 +03:00
Nadav Har'El	0040e9e7f4	Merge 'cql: Add proper validation for null and unset inside collections send as bound values' from Jan Ciołek Let's say we have a query like: ```cql INSERT INTO ks.t (list_column) VALUES (?); ``` And the driver sends a list with null inside as the bound value, something like `[1, 2, null, 4]`. In such case we should throw `invalid_request_exception` because `nulls` are not allowed inside collections. Currently when a query like this gets executed Scylla throws an ugly marshalling error. This is because the validation code reads size of the next element, interprets it as an unsigned integer and tries to read this much. In case of `null` element the size is `-1`, which when converted to unsigned `size_t` gives 18446744073709551615 and it fails to read this much. This PR adds proper validation checks to make the error message better. I also added some tests. I originally tried to write them in python, but python driver really doesn't like sending invalid values. Trying to send `[1, None, 2]` results in a list with empty value instead of null. Trying to send `[1, UNSET_VALUE, 2]` Fails before query even leaves the driver. Fixes #10580 Closes #10599 * github.com:scylladb/scylla: cql3: Add tests for null and unset inside collections cql3: Add null and unset checks in collection validation	2022-05-19 11:25:24 +03:00
Piotr Sarna	e54a4ebdcb	docs: add a paragraph on PRUNE MATERIALIZED VIEW statement	2022-05-19 10:16:04 +02:00
Piotr Sarna	b8a36ff253	service,test: add a test case for error during pruning The test case checks that errors which occur during materialized view pruning are properly propagated back to the user.	2022-05-19 10:16:04 +02:00
Piotr Sarna	995468520e	tests: add ghost row deletion test case The tests checks if manually injected ghost rows are properly deleted by the ghost row delete statement - and, that non-ghost regular rows are left intact.	2022-05-19 10:16:03 +02:00
Piotr Sarna	be2ef862bd	cql3: enable ghost row deletion via CQL This commit allows accepting a CQL request to clear ghost rows from a given view partition range. Currently its syntax is a purposely convoluted mix of existing keywords, which makes sure that the statement is never issued by mistake. Example runs: -- try deleting all ghost rows, effectively performs a paged full scan PRUNE MATERIALIZED VIEW my_mv; -- try deleting ghost rows from a single view partition PRUNE MATERIALIZED VIEW my_mv WHERE mv_pk = 3; -- try deleting ghost rows from a token range (effective full scans) PRUNE MATERIALIZED VIEW my_mv WHERE TOKEN(mv_pk) > 7 AND TOKEN(mv_pk) < 42	2022-05-19 10:11:50 +02:00
Piotr Sarna	ec0a3bbbd4	cql3: add a statement for deleting ghost rows In order to expose the API for deleting ghost rows from a view, a CQL statement is created. It is loosely based on select_statement, as its first step is to select view table rows.	2022-05-19 10:11:50 +02:00
Piotr Sarna	d74e25be67	cql3: convert is_json statement parameter to enum Right now is_json is used to decide if the statement needs to be treated in a special way. For two types (regular statement and JSON statement), a boolean is enough, but this series extends it for two more types, so the flag is converted to an enum.	2022-05-19 10:11:50 +02:00
Piotr Sarna	2c6e1a5409	pager: add ghost row deleting pager The pager is based on ghost row deleting visitor - it simply traverses each fetched page with it in order to delete ghost rows from the view table.	2022-05-19 10:11:50 +02:00
Piotr Sarna	c3a9658535	db,view: add delete ghost rows visitor The visitor is used to traverse view rows, and if it detects a ghost row it qualifies it for deletion. Qualification is based on a base table read with cl=ALL: if the corresponding row is not present in the base table, it is considered a ghost.	2022-05-19 10:11:50 +02:00
cvybhu	7adc572ec6	cql3: Add tests for null and unset inside collections Add a bunch of tests that test what happens when there is a null or unset value inside collections. They are not allowed so every such attempt should end with invalid_request_exception with proper message. I had to write a new function for collection serialization. I tried to use data_value and its methods, but it's impossible to create a data_value that represents an unset value. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-19 00:15:17 +02:00
Benny Halevy	4e78b4de1b	ser: use vector_deserializer by default for all idl vectors Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-18 19:24:18 +03:00
Benny Halevy	ea89a9aa56	mutation_partition_view: do_accept_gently: use the range based deserializers Currently use the range_deserializer for range tombstones and rows. We may consider in the future also gently iterating over cells in a row and then dipping into collection cells that might also contain a large number of items. Fixes #10558 Test: frozen_mutation_test(dev, debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-18 19:24:18 +03:00
Benny Halevy	29632e739d	idl-compiler: generate _range methods using vector_deserializer Generate code for _range methods that return a vector_deserializer rather than constructing the complete vector of views. This would be useful for streamed mutation unfreezing in the following patch. Later, we should just use vector_deserializer for all vectors. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-18 19:24:18 +03:00
Benny Halevy	5b902d9fd6	serializer_impl: add vector_deserializer To be used for streaming through a serialized vector, deserializing the items as we go when dereferencing or incrementing the iterator. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-18 19:10:13 +03:00
Avi Kivity	e6fe9cc683	Merge 'Reapply: "disable_auto_compaction: stop ongoing compactions"' from Eliran Sinvani Reapply: "disable_auto_compaction: stop ongoing compactions" This is a reapplication of a former commit `4affa801a5` which was reverted by `8e8dc2c930`. This commit is a fixed version of the original where a call to the compaction_manager constructor accidentally issued (`compaction_manager()`) instead a call to retrieve a compaction manager reference (`get_compaction_manager()`), we don't use this function because it doesn't exist anymore - it existed at the time the patch was written bu was removed in `9066224cf4` later on, instead, we just use the private table member _compaction_manager which refs the compaction manager. The explanation for the bad effect is probably that a `this` pointer capture down the call chain, resulted in a use after free which had an unknown effect on the system. (memory corruption at startup). Test: unit (dev,debug) write performance test as the one used to find the bug. A screenshot of the performance test can be found at https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381 Fixes https://github.com/scylladb/scylla/issues/9313 Refs https://github.com/scylladb/scylla/issues/10146 For completeness, the original commit message was: The api call disables new regular compaction jobs from starting but it doesn't wait for ongoing compaction to stop and so it's much less useful. Returning after stopping regular compaction jobs and waiting for them to stop guarantees that no regular compactions job are running when nodetool disableautocompaction returns successfully. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #10597 * github.com:scylladb/scylla: compaction_manager: Make invoking the empty constructor more explicit Reapply: "disable_auto_compaction: stop ongoing compactions"	2022-05-18 18:33:12 +03:00
Eliran Sinvani	c5e5692a01	compaction_manager: Make invoking the empty constructor more explicit The compaction manager's empty constructor is supposed to be invoked only in testing environment, however, it is easy to invoke it by mistake from production code. Here we add a more verbose constructor and making the default compaction private, the verbose compiler need to be invoked with a tag for_testing_tag, this will ensure that this constructor will be invoked only when intended. The unit tests were changed according to this new paradigm. Tests: unit (dev) Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2022-05-18 14:57:10 +03:00
Eliran Sinvani	c138981286	Reapply: "disable_auto_compaction: stop ongoing compactions" This is a reapplication of a former commit `4affa801a5` which was reverted by `8e8dc2c930`. This commit is a fixed version of the original where a call to the compaction_manager constructor accidentally issued (`compaction_manager()`) instead a call to retrieve a compaction manager reference (`get_compaction_manager()`), we don't use this function because it doesn't exist anymore - it existed at the time the patch was written bu was removed in `9066224cf4` later on, instead, we just use the private table member _compaction_manager which refs the compaction manager. The explanation for the bad effect is probably that a `this` pointer capture down the call chain, resulted in a use after free which had an unknown effect on the system. (memory corruption at startup). Test: unit (dev,debug) write performance test as the one used to find the bug. A screenshot of the performance test can be found at https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381 Fixes #9313 Refs #10146 For completeness, the original commit message was: The api call disables new regular compaction jobs from starting but it doesn't wait for ongoing compaction to stop and so it's much less useful. Returning after stopping regular compaction jobs and waiting for them to stop guarantees that no regular compactions job are running when nodetool disableautocompaction returns successfully. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2022-05-18 14:57:10 +03:00
cvybhu	345e89756b	cql3: Add null and unset checks in collection validation Validating a collection should ensure that there are no null or unset values inside the collection. The validation already fails in case of such values, but it does so in an ugly way. Length of null and unset value is negative but is cast to unsigned size_t. Then it tries to read a really large value and fails with marshalling error. The new checks are a better way to handle this. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-18 11:05:14 +02:00
Benny Halevy	9e1c76ea9e	test: frozen_mutation_test: test_writing_and_reading_gently: log detailed error The nested exception is also interesting and boost doesn't print it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-18 08:21:54 +03:00
Botond Dénes	6fd1322bf3	test/boost/mutation_test: test_query_digest: use the same now everywhere This test started failing sporadically of late. This failure is seen quite often in CI tests but is very hard to reproduce locally. The problem seems to be timing related, as the same seeds that fail in CI don't fail locally. This patch is a speculative fix. The test has a single time-related components: `gc_clock::now()`. This is invoked in 4 different places during a single iteration, giving ample opportunity for off-by-one errors to appear. Although there is no solid proof for this being the problem, this is a good candidate. This patch replaces all those different invocations, with a single one per test: this value is then propagated to all places that need it. Fixes: #10554 Marking the patch as a fix for the issue, if the problem re-surfaces after this patch we'll re-poen it. Closes #10589	2022-05-17 16:25:04 +03:00
Takuya ASADA	883b97d8b2	dist/common/scripts: generate debug log when exception occurred Using traceback_with_variables module, generate more detail traceback with variables into debug log. This will help fixing bugs which is hard to reproduce. Closes #10472 [avi: regenerate frozen toolchain]	2022-05-17 13:18:27 +03:00
Raphael S. Carvalho	ca322fb7c2	compaction_manager: Quickly abort maintenance compaction waiting for its turn Today, aborting a maintenance compaction like major, which is waiting for its turn to run, can take lots of time because compaction manager will only be able to bail out the task once it gets the "permit" from the serialization mechanism, i.e. semaphore. Meaning that the command that started the task will only complete after all this time waiting for the "permit". To allow a pending maintenance compaction to be quickly aborted, we can use the abortable variant of get_units(). So when user submits an abortion request, get_units() will be able to return earlier through the abort exception. Refs #10485. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #10581	2022-05-17 13:14:51 +03:00
Avi Kivity	a8f9ed56fb	Merge 'cql3: Replace relation class with expression' from Jan Ciołek Before this change parser used to output instances of the `relation` class, which were later converted to `restrction`. `relation` took care of initial processing such as preparing and some validation checks. This PR aims to remove the `relation` class and perform it's functionality using only `expression`. This is a step towards removing the legacy classes and converting all AST analysis to work on `expressions`. Closes #10409 * github.com:scylladb/scylla: cql3: Remove scalar from bind_variable_scalar_prepare_expression cql3: expr: Remove shape_type from bind_variable cql3: Remove prepare_expression_multi_column cql3: Remove relation class cql3: Add more tests for expr::printer cql3: Make parser output expression for relations cql3: expr: add printer for expression cql3: expr: expr::to_restriction: Handle token relations cql3: expr: expr::to_restriction: Handle multi column relations cql3: expr: Add expr::to_restriction for single column relations cql3: expr: Add prepare_binary_operator cql3: expr: Change how prepare_expression handles bind_variable clq3: expr: Add columns to expr::token struct cql3: expr: Modify list_prepare_expression to handle lists of IN values cql3: expr: Add expr::as_if for non-const expressions	2022-05-17 12:51:54 +03:00
Takuya ASADA	00ce34c29b	scylla_prepare: describe error more correctly Currently our error message on scylla_prepare says "Exception occurred while creating perftune.yaml", even perftune.yaml is already generated, and error occurred after that. To describe error more correctly, add another error message after perftune.yaml generated. see scylladb/scylla-enterprise#2201 Closes #10575	2022-05-16 20:05:58 +03:00
cvybhu	21453ac9a4	cql3: Remove scalar from bind_variable_scalar_prepare_expression There is now only one function to prepare bind_variable, so we can remove 'scalar' from its name. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	9b49d27a8d	cql3: expr: Remove shape_type from bind_variable shape_type was used in prepare_expression to differentiate between a few cases and create the correct receivers. This was used by the relation class. Now creating the correct receiver has been delegated to the caller of prepare_expression and all bind_variables can be handled in the same simple way. shape_type is not needed anymore. Not having it is better because it simplifies things. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	c0fc82d4be	cql3: Remove prepare_expression_multi_column This function was used by multi_column_relation.hh, but now it isn't needed anymore. The only way to prepare a bind_variable is now the standard prepare_expression. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	d85f680df3	cql3: Remove relation class Functionality of the relation class has been replaced by expr::to_restriction. Relation and all classes deriving from it can now be removed. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	575c4bd76b	cql3: Add more tests for expr::printer Now that parser outputs expressions it's much easier to check whether expression printer works correctly. We can prepare a bunch of strings which will be parsed and then printed back to string. Then we can compare those strings. It's much easier than creating expresions to print manually. The only downside is that this tests only unprepared version of expression, so instead of column_value there will be unresolved identifier, insted of constant untyped_constant etc. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	51cdbdeacb	cql3: Make parser output expression for relations Parser used to output the where clause as a vector of relations, but now we can change it to a vector of expressions. Cql.g needs to be modified to output expressions instead of relations. The WHERE clause is kept in a few places in the code that need to be changed to vector<expression>. Finally relation->to_restriction is replaced by expr::to_restriction and the expressions are converted to restrictions where required. The relation class isn't used anywhere now and can be removed. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
Michał Sala	f6bdc4d694	cql3: expr: add printer for expression expression::printer is used to print CQL expressions in a pretty way that allows them to be parsed back to the same representation. There is a bunch of things that need to be changed when compared to the current implementation of opreatorr<<(expression) to output something parsable. column names should be printed without 'unresolved_identifier()' and sometimes they need to be quoted to perserve case sensitivity. I needed to write new code for printing constant values because the current one did debug printing (e.g. a set was printed as '1; 2; 3'). A list of IN values should be printed inside () intead of [], but because it is internally represented as a list it is by default printed with []. To fix this a temporary tuple_constructor is created and printed. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	c4f846dbc8	cql3: expr: expr::to_restriction: Handle token relations Implement converting token relations to expressions. The code is mostly tekken from functions in token_relation.hh, because we are replicating functionliaty of the functions called token_relation::new_XX_restrictions. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	5fc5012f9b	cql3: expr: expr::to_restriction: Handle multi column relations Implement converting multi column relations to expressions. The code is mostly taken from functions in multi_column_relation.hh, because we are replicating functionality of the functions called multi_column_relation::new_XX_restriction. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:58 +02:00
cvybhu	89950e02b5	cql3: expr: Add expr::to_restriction for single column relations Add a function that will be used to convert expressions received from the parser to restrictions. Currently parser creates relations with expressions inside and then those relations are converted to restrictions. Once this function is implemented we will be able to skip creating relations altogether and convert straight from expression to restriction. This will allow us to remove the relation class. Further functionality will be implemented in the following commits. This commit implements converting single column relations to expressions. The code is mostly taken from functions in single_column_relation.hh, because we are replicating functionality of the functions called single_column_relation::new_XX_restriction. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:17:57 +02:00
cvybhu	3e5e5c4a17	cql3: expr: Add prepare_binary_operator Add a function that allows to prepare a binary_operator received from the parser. It resolves columns on the LHS, calculates type of LHS, and prepares RHS with the correct type. It will be used by expr::to_restriction. Some basic type checks are performed, but more throughout checks will be required in expr::to_restriction to fully validate a relation. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:15:37 +02:00
cvybhu	5dee55d433	cql3: expr: Change how prepare_expression handles bind_variable The situation with preparing bind_variable is a bit strange, there are four shapes of bind variables and receiver behaviour is not in line with other types. To prepare a bind_variable for a list of IN values for an int column the current code requires us to pass a receiver of type int. This is counterintuitive, to prepare a string we pass a receiver with string type, so to prepare list<int> we should pass a receiver of type list<int>, not just int. This commit changes the behaviour in two ways: - Shape of bind_variable doesn't matter anymore - The bind_variable gets the receiver passed to prepare_expression, no more list<receiver> magic. Other variants of bind_variable_x_prepare_expression are not removed yet because they are needed by prepare_expression_mutlti_column. They will be removed later, along with bind_variable::shape_type. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:15:36 +02:00
cvybhu	be6e741b6c	clq3: expr: Add columns to expr::token struct The expr::token struct is created when something like token(p1, p2) occurs in the WHERE clause. Currently expr::token doesn't keep columns passed as arguemnts to the token function. They weren't needed because token() validation was done inside token_relation. Now that we want to use only expressions we need to have columns inside the token struct and validate that those are the correct columns. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:03:11 +02:00
cvybhu	b99aae7d41	cql3: expr: Modify list_prepare_expression to handle lists of IN values The standard CQL list type doesn't allow for nulls inside the collection. However lists of IN values are the exception where bind nullsare allowed, for example in restrictions like: p IN (1, 2, null) To be able to use list_prepare_expression with lists of IN values a flag is added to specify whether nulls should be allowed. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:03:11 +02:00
cvybhu	2b5818697a	cql3: expr: Add expr::as_if for non-const expressions expr::as_if is our wrapper for std::get_if. There was a version for const expression, but there weren't one for mutable expression. Add the mutable version, it will be needed in the following commits. Signed-off-by: cvybhu <jan.ciolek@scylladb.com>	2022-05-16 18:03:11 +02:00
Pavel Emelyanov	f81f1c7ef7	format-selector: Remove .sync() point The feature listener callbacks are waited upon to finish in the middle of the cluster joining process. I particular -- before actually joining the cluster the format should have being selected. For that there's a .sync() method that locks the semaphore thus making sure that any update is finished and it's called right after the wait_for_gossip_to_settle() finishes. However, features are enabled inside the wait_for_gossip_to_settle() in a seastar::async() context that's also waited upon to finish. This waiting makes it possible for any feature listener to .get() any of its futures that should be resolved until gossip is settled. Said that, the format selection barrier can be moved -- instead of waiting on the semaphore, the respective part of the selection code can be .get()-ed (it all runs in async context). One thing to care about -- the remainder should continue running with the gate held. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-16 14:14:14 +03:00
Pavel Emelyanov	7fee50f1e3	format-selector: Coroutinize maybe_select_format() This method is run when a feature is enabled. It's a bit trickier than the others, also there are two methods actually, that are merged into one by this patch. By and large most of the care is about the _sel gate and _sem semaphore. The gate protects the whole selection code from the selector being freed from underneath it on stop. The semaphore is only needed to keep two different format selections from each other -- each update the system keyspace, local variable and replica::database instance on all shards. In the end there's a gossiper update, but it happens outside of the semaphore. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-16 14:13:59 +03:00
Pavel Emelyanov	93df88aac4	format-selector: Coroutinize simple methods These all are just straightfowrard usage of co_await's around the code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-16 14:13:59 +03:00
Nadav Har'El	a2b9a17927	test/cql-pytest: add test for default clustering order of SELECT Our documentation for SELECT, https://docs.scylladb.com/getting-started/dml/#ordering-results says that: "The ORDER BY clause lets you select the order of the returned results. It takes as argument a list of column names along with the order for the column (ASC for ascendant and DESC for descendant, omitting the order being equivalent to ASC)." The test in this patch confirms that the last emphasized line is not accurate - The default order for SELECT is the default order of the table being read - NOT always ascending order. If the table was created with descending WITH CLUSTERING ORDER BY, then a SELECT not specifying an ORDER BY will get this descending order by default. The test passes on both Scylla and Cassandra, demonstrating that this behavior is expected and correct - regardless of what our docs say. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220515115030.775813-1-nyh@scylladb.com>	2022-05-16 11:52:02 +02:00
Avi Kivity	45f75ef595	service/memory_limiter.hh: correct license When split from service/storage_service.hh (`4ca2ae13`) it accidentally changed license. Change it back (since it does not contain Apache derived code, constrain it to AGPL-3.0-or-later). Closes #10572	2022-05-16 10:01:06 +03:00
Benny Halevy	8a6f8c622d	sstables: writer: pass bytes_ostream by reference The bytes_stream param is passed by value from `write_promoted_index` (since `0d8463aba5`) causing an uneeded copy. This can lead to OOM if the promoted index is extremely large. Pass the bytes_ostream by reference instead to prevent this copy. Fixes #10569 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10570	2022-05-15 17:53:38 +03:00
Avi Kivity	528ab5a502	treewide: change metric calls from make_derive to make_counter make_derive was recently deprecated in favor of make_counter, so make the change throughput the codebase. Closes #10564	2022-05-14 12:53:55 +02:00
Piotr Sarna	768b5f3f29	utils: mark loading_cache::shrink as noexcept Current code already assumes (correctly), that shrink() does not throw, otherwise we risk leaking memory allocated in get_ptr(): ``` ts_value_lru_entry* new_lru_entry = Alloc().template allocate_object<ts_value_lru_entry>(); // Remove the least recently used items if map is too big. shrink(); ``` Let's be explicit and mark shrink() and a few helper methods that it uses as noexcept. Ultimately they are all noexcept anyway, because polymorphic allocator's deallocation routines don't throw, and neither do boost intrusive list iterators. Closes #10565	2022-05-13 18:28:58 +03:00
Nadav Har'El	c51a41a885	test/cql-pytest: test for multiple restrictions on same column It turns out that there is a difference between how Scylla and Cassandra handle multiple restrictions on the same column - for example "WHERE c = 0 and c >0". Cassandra treats all such cases as invalid queries, whereas Scylla allows them. This test demonstrates this difference (it is marked "scylla_only" because it's a Scylla-only feature), and also verifies that the results of such queries on Scylla are correct - i.e., if the two restrictions conflict the result is empty, and if the two restrictions overlap, the result can be non-empty. The test passes, verifying that although Scylla differs from Cassandra on this, its behavior is correct. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220512165107.644932-1-nyh@scylladb.com>	2022-05-13 08:43:57 +03:00
Avi Kivity	0dd5f02022	build: enable ABSL_PROPAGATE_CXX_STD Recently Abseil started to ask to enable ABSL_PROPAGATE_CXX_STD, warning that it will do so itself in the future. Do so, and specify that we use C++20 to avoid inconsistencies. Closes #10563	2022-05-13 07:12:03 +02:00
Avi Kivity	5937b1fa23	treewide: remove empty comments in top-of-files After `fcb8d040` ("treewide: use Software Package Data Exchange (SPDX) license identifiers"), many dual-licensed files were left with empty comments on top. Remove them to avoid visual noise. Closes #10562	2022-05-13 07:11:58 +02:00
Botond Dénes	1f0d3d57eb	Merge 'Convert (almost) all uses of flat_mutation_reader_assertions to v2' from Michael Livshin "Almost" because 2 uses of the v1 asserter remain (as they are deliberate). Closes #10518 * github.com:scylladb/scylla: tests: remove obsolete utility functions tests: less trivial flat_reader_assertions{,_v2} conversions tests: trivial flat_reader_assertions{,_v2} conversions flat_mutation_reader_assertions_v2: improve range tombstone support	2022-05-13 08:04:20 +03:00
Eliran Sinvani	8e8dc2c930	Revert "table: disable_auto_compaction: stop ongoing compactions" This reverts commit `4affa801a5`. In issue #10146 a write throughput drop of ~50% was reported, after bisect it was found that the change that caused it was adding some code to the table::disable_auto_compaction which stops ongoing compactions and returning a future that resolves once all the compaction tasks for a table, if any, were terminated. It turns out that this function is used only at startup (and in REST api calls which are not used in the test) in the distributed loader just before resharding and loading of the sstable data. It is then reanabled after the resharding and loading is done. For still unknown reason, adding the extra logic of stopping ongoing compactions made the write throughput drop to 50%. Strangely enough this extra logic should (still unvalidated) not have any side effects since no compactions for a table are supposed to be running prior to loading it. This regains the performance but also undo a change which eventually should get in once we find the actual culprit. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #10559 Reopens #9313.	2022-05-12 18:51:25 +03:00
Benny Halevy	333f6c5ec9	mutation_partition_view: do_accept_gently: keep clustering_row key on stack We're hitting a unit test failure as in https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/1010/artifact/testlog/aarch64_dev/frozen_mutation_test.test_writing_and_reading_gently.918.log ``` unknown location(0): fatal error: in "test_writing_and_reading_gently": std::_Nested_exception<std::runtime_error>: frozen_mutation::unfreeze_gently(): failed unfreezing mutation pk{00801806000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000} of ks.cf ``` on aarch64 clang 12.0.1, in release and dev modes (but not debug). This turned out to be a miscompilation in `position_in_partition_view::for_key(cr.key())` that returns a position_in_partition_view of the clustering_key_prefix rvalue that cr.key() returns. The latter is lost on aarch64 in release mode. Keeping the key on the stack allows to safely pass a view to it. Fixes #10555 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-12 16:26:07 +03:00
Piotr Sarna	eb6f4cc839	Merge 'dependencies: add rust' from Wojciech Mitros The main reason for adding rust dependency to scylla is the wasmtime library, which is written in rust. Although there exist c++ bindings, they don't expose all of its features, so we want to do that ourselves using rust's cxx. The patch also includes an example rust source to be used in c++, and its example use in tests/boost/rust_test. The usage of wasmtime has been slightly modified to avoid duplicate symbol errors, but as a result of adding a Rust dependency, it is going to be removed from `configure.py` completely anyway Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #10341 * github.com:scylladb/scylla: docs: document rust tests: add rust example	2022-05-12 15:24:58 +02:00
Michael Livshin	00ed4ac74c	batchlog_manager: warn when a batch fails to replay Only for reasons other than "no such KS", i.e. when the failure is presumed transient and the batch in question is not deleted from batchlog and will be retried in the future. (Would info be more appropriate here than warning?) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10556	2022-05-12 13:34:03 +03:00
Asias He	7a38b806be	repair: Trigger off strategy compaction after all ranges of a table is repaired When the repair reason is not repair, which means the repair reason is node operations (bootstrap, replace and so on), a single repair job contains all the ranges of a table that need to be repaired. To trigger off strategy compaction early and reduce the number of temporary sstable files on disk, we can trigger the compaction as soon as a table is finished. Refs: #10462	2022-05-12 10:46:11 +08:00
Asias He	3dc9a81d02	repair: Repair table by table internally This patch changes the way a repair job walks through tables and ranges if multiple tables and ranges are requested by users. Before: ``` for range in ranges for table in tables repair(range, table) ``` After: ``` for table in tables for range in ranges repair(range, table) ``` The motivation for this change is to allow off-strategy compaction to trigger early, as soon as a table is finished. This allows to reduce the number of temporary sstables on disk. For example, if there are 50 tables and 256 ranges to repair, each range will generate one sstable. Before this change, there will be 50 * 256 sstables on disk before off-strategy compaction triggers. After this change, once a table is finished, off-strategy compaction can compact the 256 sstables. As a result, this would reduce the number of sstables by 50X. This is very useful for repair based node operations since multiple ranges and tables can be requested in a single repair job. Refs: #10462	2022-05-12 10:46:11 +08:00
Wojciech Mitros	cb5d054a67	docs: document rust Using Rust in Scylla is not intuitive, the doc explains the entire process of adding new Rust source files to Scylla. What happens during compilation is also explained. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-05-11 16:49:31 +02:00
Wojciech Mitros	4ad012cb6a	tests: add rust example The patch includes an example rust source to be used in c++, and its example use in tests/boost/rust_test. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-05-11 16:49:31 +02:00
Botond Dénes	c3eddab976	scylla-gdb.py: add scylla sstable-index-cache command Print currently cached index pages for the given sstable. Example output: (gdb) scylla sstable-index-cache $sst [0]: [ { key: 63617373616e647261, token: 356242581507269238, position: 0 } ] Total: 1 page(s) (1 loaded, 0 loading)	2022-05-11 16:17:40 +03:00
Botond Dénes	2f3f07881b	scylla-gdb.py: add scylla sstable-summary command Print content of sstable summary. Example output: (gdb) scylla sstable-summary $sst header: {min_index_interval = 128, size = 1, memory_size = 21, sampling_level = 128, size_at_full_sampling = 0} first_key: 63617373616e647261 last_key: 63617373616e647261 [0]: { token: 356242581507269238, key: 63617373616e647261, position: 0}	2022-05-11 16:16:24 +03:00
Botond Dénes	7af107b39d	test/scylla-gdb: add sstable fixture	2022-05-11 16:16:05 +03:00
Benny Halevy	4a5842787e	memtable_list: clear_and_add: let caller clear the old memtables As a follow up on `b8263e550a`, make clear_and_add synchronous yet again, and just return the swapped list of memtables so that the caller (table::clear) can clear them gently. Refs https://github.com/scylladb/scylla/pull/10424#discussion_r867455056 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10540	2022-05-11 14:46:30 +02:00
Botond Dénes	eeca9f24e8	Merge 'Docs: improve debugging.md' from Benny Halevy This series update debugging.md with: - add an example .gdbinit file - update recommendation for finding the relocatable packages using a build-id on http://backtrace.scylladb.com/ Closes #10492 * github.com:scylladb/scylla: docs: debugging.md: update instructions regarding backtrace.scylladb.com docs: debugging.md: add a sample gdbinit file	2022-05-11 14:46:30 +02:00
Nadav Har'El	b2450886d7	Merge 'Debugging.md relocatable package updates' from Botond Dénes Drop the section about non-relocatable packages. They are not a thing anymore. Also tweaked the instructions for launching the toolchain container. Closes #10539 * github.com:scylladb/scylla: docs/debugging.md: adjust instructions for using the toolchain docs/debugging.md: drop section about handling binaries from non-relocatable packages	2022-05-11 14:46:30 +02:00
Takuya ASADA	a9dfe5a8f4	scylla_sysconfig_setup: handle >=32CPUs correctly Seems like `59adf05` has a bug, the regex pattern only handles first 32CPUs cpuset pattern, and ignores rest. We should extend regex pattern to handle all CPUs. Fixes #10523 Closes #10524	2022-05-11 14:46:30 +02:00
Nadav Har'El	043b1c7f89	Update seastar submodule. Unfortunately, also requires two changes to Scylla itself to make it still compile - see below * seastar 5e863627...96bb3a1b (18): > install-dependencies: add rocky as a supported distro > circleci: relax docker limits to allow running with new toolchain > core: memory: Add memory::free_memory() also in Debug mode > build: bump up zlib to 1.2.12 > cmake: add FindValgrind.cmake > Merge 'seastar-addr2line: support sct syslogs' from Benny Halevy > rpc: lower log level for 'failed to connect' errors > scripts: Build validation > perftune.py: remove rx_queue_count from mode condition. > memory: add attributes to memalign for compatibility with glibc 2.35 > condition-variable: Fix timeout "when" potentially not killing timer > Merge "tests: perf: measure coroutines performance" from Benny > Merge: Refine COUNTER metrics > Revert "Merge: Refine COUNTER metrics" > reactor: document intentional bitwise-on-bool op in smp_pollfn::poll() > Merge: Refine COUNTER metrics > SLES: additionally check irqbalance.service under /usr/lib > rpc_tester: job_cpu: mark virtual methods override Changes to Scylla also included in this merge: 1. api: Don't export DERIVEs (Pavel Emelyanov) Newer seastar doesn't have DERIVE metrics, but does have REAL_COUNTER one. Teach the collectd getter the change. (for the record: I don't understand how this endpoing works at all, there's a HISTOGRAM metrics out there that would be attempted to get exposed with the v.ui() call which's totally wrong) 2. test: use linux_perf_events.{cc,hh} from Seastar Seastar now has linux_perf_events.{cc,hh}. Remove Scylla's version of the same files and use Seastar's. Without this change, Scylla fails to compile when some source files end up including both versions and seeing double definitions. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-11 14:46:30 +02:00
Piotr Sarna	209c2f5d99	sstables: define generation_type for sstables No functional changes intended - this series is quite verbose, but after it's in, it should be considerably easier to change the type of SSTable generations to something else - e.g. a string or timeUUID. Closes #10533	2022-05-11 14:46:30 +02:00
Beni Peled	3abe4a2696	Adjust scripts/pull_github_pr.sh to check tests status Closes #10263 Closes #10264	2022-05-11 14:46:30 +02:00
Botond Dénes	7501a075bd	sstables/index_reader: push down eof() check to advance_to(index_bound&, dht::ring_position_view) Commit `e8f3d7dd13` added eof() checks to public partition-level advance_to() methods, to ensure we do not attempt to re-read the last page of the index when at eof(). It was noted however that this check would be safer in advance_to(index_bound&, dht::ring_position_view) because that is the method that all these higher-level methods end up calling. Placing the check there would guarantee safety for all such operations. This path does exactly that: it pushes down the check to said method. One change needed for this to work is to check eof on the bound that is currently advanced, instead of unconditionally checking the lower bound. Closes #10531	2022-05-11 14:46:30 +02:00
Asias He	77b1db475c	locator: Do not enforce public ip address for broadcast_rpc_address Reported by Felipe Cardeneti: - Create a 2-node Scylla cluster w/ Ec2MultiRegionSnitch - Check system.peers table Scylla (uses public address) ``` cqlsh> select peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers; peer \| data_center \| host_id \| preferred_ip \| rack \| rpc_address \| schema_version ---------------+-------------+--------------------------------------+---------------+------+---------------+-------------------------------------- 18.216.98.219 \| us-east-2 \| d9443741-a12e-4bbb-91ce-9931cece589c \| 172.31.43.122 \| 2c \| 18.216.98.219 \| 95c3fca5-c463-3aba-98c6-1c0b3fac5b58 (1 rows) ``` Cassandra (uses local address): ``` cqlsh> SELECT peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers; peer \| data_center \| host_id \| preferred_ip \| rack \| rpc_address \| schema_version ---------------+-------------+--------------------------------------+---------------+------------+---------------+-------------------------------------- 52.15.104.255 \| us-east-2 \| 42c0b717-775f-4998-a420-0388fe8b4e70 \| 172.31.42.126 \| us-east-2c \| 172.31.42.126 \| 2207c2a9-f598-3971-986b-2926e09e239d (1 rows) ``` Config diff: ``` cassandra.yaml:rpc_address: 0.0.0.0 cassandra.yaml:broadcast_rpc_address: 172.31.42.126 /etc/scylla/scylla.yaml:broadcast_rpc_address: 172.31.42.126 /etc/scylla/scylla.yaml:rpc_address: 0.0.0.0 ``` After this patch, if broadcast_rpc_address is unset, Ec2MultiRegionSnitch will use the public ip address to set broadcast_rpc_address. If broadcast_rpc_address is set, Ec2MultiRegionSnitch will not modify it. Fixes #10236 Closes #10519	2022-05-11 14:46:30 +02:00
Tomasz Grabiec	f703e8ded5	Merge 'New failure detector for Raft' from Kamil Braun We introduce a new service that performs failure detection by periodically pinging endpoints. The set of pinged endpoints can be dynamically extended and shrinked. To learn about liveness of endpoints, user of the service registers a listener and chooses a threshold - a duration of time which has to pass since the last successful ping in order to mark an endpoint as dead. When an endpoint responds it's immediately marked as alive. Endpoints are identified using abstract integer identifiers. The method of performing a ping is a dependency of the service provided by the user through the `pinger` interface. The implementation of `pinger` is responsible for translating the abstract endpoint IDs to 'real' addresses. For example, production implementation may map endpoint IDs to IP addresses and use TCP/IP to perform the ping, while a test/simulation implementation may use a simulated network that also operates on abstract identifiers. Similarly, the method of measuring time is a dependency provided by the user using the `clock` interface. The service operates on abstract time intervals and timepoints. So, for example, in a production implementation time can be measured using a stopwatch, while in test/simulation we can use a logical clock. The service distributes work across different shards. When an endpoint is added to the set of detected endpoints, the service will choose a shard with the smallest amount of workers and create a worker that is responsible for periodically pinging this endpoint on that shard and sending notifications to listeners. We modify the randomized nemesis test to use the new service. The service is sharded, but for simplicity of implementation in the test we implement rpcs and sleeps by routing the requests to shard 0, where logical timers and network live. rpcs are using the existing simulated network and clock using the existing logical timers. We also integrate the service with production code. There, `pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo message requires the node's gossip generation number. We handle this by embedding the pinger implementation inside `gossiper`, and making `gossiper` update the generation number (cached inside the pinger class) periodically. Production `clock` is a simple implementation which uses `std::chrono::steady_clock` and `seastar::sleep_until` underneath. Translating `steady_clock` durations to `direct_fd::clock` durations happens by taking the number of ticks. We connect the group 0 raft server rpc implementation to the new service, so that when servers are added or removed from the the group 0 configuration, corresponding endpoints are added to the direct failure detector service. Thus the set of detected endpoints will be equal to the group 0 configuration. On each shard, we register a listener for the service. The listener maintains a set of live addresses; on mark_alive it adds a server to the set and on mark_dead it removes it. This set is then used to implement the `raft::failure_detector` interface, consisting of `is_alive()` function, which simply checks set membership. --- v6: - remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: `f617aeca62..d4b225437c` v5: - renamed `rpc` to `pinger` - replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates` - replaced `unsigned` with `shard_id` - fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior) - improve `_num_workers` assertions - signal `_alive_start_changed` only when `_alive_start` indeed changed - renamed `{_marked}_alive_start` to `{_marked}_alive_start_index` v4: - rearrange ping_fiber(). Remove the loop at the end of the big `while` which was timing out listeners (after the sleep). Instead: - rely on the loop before the sleep for timing out listeners - before calling ping(), check if there is a timed out listener, if so abandon the ping, immediately proceed to the timing-out-listeners loop, and then immediately proceed to the next iteration of the big `while` (without sleeping) - inline send_mark_dead() and send_mark_alive(); each was used in exactly one place after the rearrangement - when marking alive, instead of repeatedly doing `--_alive_start` and signalling the condition variable, just do `_alive_start = 0` and signal the condition variable once - fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was `_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`. Indeed, we want to wait for the stopping code (`destroy_worker()`) to set `_alive_start = _fd._listeners.size()` before `notify_fiber()` finishes so `notify_fiber()` can send the final `mark_dead` notifications for this endpoint. There was a race before where `notify_fiber()` could finish before it sent those notifications (because it finished as soon as it noticed `_as.abort_requested()`) - fix some waits in the unit test; they depended on particular ordering of tasks by the Scylla reactor, the test could sometimes hang in debug mode which randomizes task order - fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an exceptional discarded future in some cases v3: - fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard - invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers - add a unit test (second patch) v2: - rename `direct_fd` namespace to `direct_failure_detector` - move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh} - cleaned license comments - removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead: - _listeners is now a boost::container::flat_multimap (previously it was std::multimap) - _alive_start is no longer an iterator to _listeners, but an index (size_t) - _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed - ping_fiber() signals _alive_start_changed when it changes _alive_start - notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start - replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still) - _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`) - `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes - same for `failure_detector::impl::remove_endpoint` - `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen - comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers) - remove unnecessary `if (_as.abort_requested())` in `ping_fiber()` - in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately. - due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start." - `register_listener` now takes a `listener&`, not `listener` - at `register_listener` comment why we allow different thresholds (second to last paragraph) - at `register_listener` mention that listeners can be registered on any shard (last paragraph) - add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`. - replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution: - the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once - when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures. - when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc - commented that `clock::sleep_until` should signalize aborts using `sleep_aborted` - `clock::now()` is `noexcept` - `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item - in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item - randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification) - message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called) - rebase Closes #10437 * github.com:scylladb/scylla: service: raft: remove `raft_gossip_failure_detector` service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness service: raft: add/remove direct failure detector endpoints on group 0 configuration changes main: start direct failure detector service messaging_service: abortable version of `send_gossip_echo` message: abortable version of `send_message` test: raft: randomized_nemesis_test: remove old failure_detector test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector` test: raft: randomized_nemesis_test: ping all shards on each tick test: unit test for new failure detector service direct_failure_detector: introduce new failure detector service	2022-05-11 14:46:27 +02:00
Botond Dénes	a4ed7186c0	scylla-gdb.py: make chunked_vector a proper container wrapper" Add missing __len__() and __iter__() methods.	2022-05-11 15:11:33 +03:00
Botond Dénes	2f785ee8f0	scylla-gdb.py: make small_vector a proper container wrapper" Add missing __len__() and __iter__() methods.	2022-05-11 15:11:12 +03:00
Botond Dénes	f66077da4b	scylla-gdb.py: add sstring container wrapper	2022-05-11 15:10:49 +03:00
Botond Dénes	afa6cfb42a	scylla-gdb.py: add chunked_managed_vector container wrapper	2022-05-11 15:10:39 +03:00
Botond Dénes	8048b15c57	scylla-gdb.py: add managed_vector container wrapper	2022-05-11 15:10:17 +03:00
Botond Dénes	2b40ed663e	scylla-gdb.py: std_variant: add workaround for clang template bug Clang and GDB don't see eye to eye on the template arguments of std::variants. Executables generated by clang are known to yield 0 template arguments when one queries them via the GDB python API. This patch adds a workaround to the std_variant wrapper, allowing a caller who knows the type to brute-force getting the variant member with the known type.	2022-05-11 15:10:11 +03:00
Botond Dénes	f975841702	scylla-gdb.py: add bplus_tree container wrapper	2022-05-11 15:10:03 +03:00
Benny Halevy	4a40e34577	docs: debugging.md: update instructions regarding backtrace.scylladb.com Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-11 10:23:09 +03:00
Benny Halevy	97b002e13e	docs: debugging.md: add a sample gdbinit file This gdbinit contains recommended settings commonly useful for debugging scylla core dumps. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-11 10:23:08 +03:00
Botond Dénes	f725779957	docs/debugging.md: adjust instructions for using the toolchain The attached volume doesn't need to be relabeled anymore (`:z` not needed at the end of the volume attach instructions). This also allows dropping the `sudo` from the invocation.	2022-05-11 08:25:29 +03:00
Botond Dénes	264db30ca5	docs/debugging.md: drop section about handling binaries from non-relocatable packages All our releases ship with relocatable packages now. This section is obsolete (thankfully).	2022-05-11 08:20:14 +03:00
Michael Livshin	1e690e6773	tests: remove obsolete utility functions Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-10 22:10:40 +03:00
Michael Livshin	864882253a	tests: less trivial flat_reader_assertions{,_v2} conversions Dealing with the handful of tests that check range tombstones in interesting ways and need more than search-and-replace. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-10 22:10:40 +03:00
Michael Livshin	3cc2343775	tests: trivial flat_reader_assertions{,_v2} conversions (Which entails temporary cut-and-pasting some utility functions) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-10 22:10:40 +03:00
Michael Livshin	51cf84e8c9	flat_mutation_reader_assertions_v2: improve range tombstone support * Track the active range tombstone. * Add `may_produce_tombstones()`. * Flesh out `produces_row_with_key()`. * Add more trace logs. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-10 22:10:40 +03:00
Nadav Har'El	2c39c4c284	Merge 'Handle errors during snapshot' from Benny Halevy This series refactors `table::snapshot` and moves the responsibility to flush the table before taking the snapshot to the caller. `flush_on_all` and `snapshot_on_all` helpers are added to replica::database (by making it a peering_sharded_service) and upper layers, including api and snapshot-ctl now call it instead of calling cf.snapshot directly. With that, error are handed in table::snapshot and propagated back to the callers. Failure to allocate the `snapshot_manager` object is fatal, similar to failure to allocate a continuation, since we can't coordinate across the shards without it. Test: unit(dev), rest_api(debug) Fixes #10500 Closes #10513 * github.com:scylladb/scylla: table: snapshot: handle errors table: snapshot: get rid of skip_flush param database: truncate: skip flush when taking snapshot test: rest_api: storage_service: verify_snapshot_details: add truncate database: snapshot_on_all: flush before snapshot if needed table: make snapshot method private database: add snapshot_on_all snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot api: storage_service: increase visibility of snapshot ops in the log api: storage_service: coroutinize take_snapshot and del_snapshot api: storage_service: take_snapshot: improve api help messages test: rest_api: storage_service: add test_storage_service_snapshot database: add flush_on_all variants test: rest_api: add test_storage_service_flush	2022-05-10 10:52:10 +03:00
Benny Halevy	1d39d803af	table: snapshot: handle errors Turn table::snapshot into a coroutine, catch exceptions, and return them to the caller. Make sure that coordination across shards would not break even if any of the shards hits an error, by always signaling semaphores other shards wait on. All errors except for failing to allocate the snapshot_manager objects are caught and propagated back. Failing to allocate the snapshot_manager is fatal similar to failing to allocate a continuation since we can't coordinate across the shards without it, so abort that fails. Fixes #10500 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:45:14 +03:00
Benny Halevy	9e69089306	table: snapshot: get rid of skip_flush param Now that all callers flush on their own before calling table::snapshot. Refs #10500 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:45:14 +03:00
Benny Halevy	31881273a1	database: truncate: skip flush when taking snapshot database::truncate already flushes the table on auto_snapshot so there is never a reason to flush it again in table::snapshot. Note that cf.can_flush() is false only if memtables are empty so there nothing to flush or there is is no seal_immediate_fn and then table::snapshot wouldn't be able to flush either. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:45:14 +03:00
Benny Halevy	fc79787863	test: rest_api: storage_service: verify_snapshot_details: add truncate Truncate the test table and verify that the 'live' snapshot size is now non-zero. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:45:14 +03:00
Benny Halevy	46c950fb31	database: snapshot_on_all: flush before snapshot if needed flush_on_all shards before taking the snapshot if !skip_flush so we can get rid of flushing in table::snapshot. Refs #10500 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:45:14 +03:00
Benny Halevy	33bd52921e	table: make snapshot method private Only callable by database. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:45:14 +03:00
Benny Halevy	e1d58d4422	database: add snapshot_on_all And move the logic from snapshot-ctl down to the replica::database layer. A following patch will move the flush phase from the replica::table::snapshot layer out to the caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:45:14 +03:00
Benny Halevy	aa127a2dbb	snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema Detecting a secondary index by checking for a dot in the table name is wrong as tables generated by Alternator may contain a dot in their name. Instead detect bot hmaterialized view and secondary indexes using the schema()->is_view() method. Fixes #10526 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:44:52 +03:00
Benny Halevy	1fbcdbd2e8	snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot There is no functional change in this patch. Only refactoring of the code. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:16:39 +03:00
Benny Halevy	01b1e54e22	api: storage_service: increase visibility of snapshot ops in the log snapshot operations over the api are rare but they contain significant state on disk in the form of sstables hard-linked to the snapshot directories. Also, we've seen snapshot operations hang in the field, requiring a core dump to analyse the issue, while there were no records in the log indicating when previous snapshot operations were last executed. This change promotes logging to info level when take_snapshot and del_snapshot start, and logs errors if in case they fail. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:15:46 +03:00
Benny Halevy	b9d972d029	api: storage_service: coroutinize take_snapshot and del_snapshot Before making any further changes in them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:02:52 +03:00
Benny Halevy	10b86ee5bd	api: storage_service: take_snapshot: improve api help messages Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 10:02:47 +03:00
Benny Halevy	e95ecbbea6	test: rest_api: storage_service: add test_storage_service_snapshot Test the snapshot operations via the rest api. Added test/rest_api/rest_util.py with new_test_snapshot that creates a new test snapshot and automagically deletes it when the `with` block if exited, similar to new_test_keyspace and new_test_table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 09:56:44 +03:00
Benny Halevy	5b4eb44795	database: add flush_on_all variants Use by api layer. Will be used in a later patch to flush on all shards before taking a snapshot. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 09:56:44 +03:00
Benny Halevy	05c7f4b832	test: rest_api: add test_storage_service_flush Add a basic rest_api test for keyspace_flush. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-10 09:56:44 +03:00
Nadav Har'El	1c6163d51f	Merge 'cql3: expr: allow bind markers in collection literals' from Michał Sala Allowing bind markers in collection literals is a change which causes minor differences in behavior between Scylla and Cassandra. Despite such an undesirable effect, I think allowing them is a good idea because it makes [refactoring work made by cvybhu](https://github.com/scylladb/scylla/pull/10409) easier - `469d03f8c2`. Also, making Scylla accept a superset of valid Cassandra cql expressions does not make us less compatible (maybe apart from test suit compatibility). Closes #10457 * github.com:scylladb/scylla: test/boost: cql_query_test: allow bound variables in test_list_of_tuples_with_bound_var test/boost: cql_query_test: test bound variables in collection literals cql3: expr: do not allow unset values inside collections cql3: expr: prepare_expr: allow bind markers in collection literals	2022-05-09 19:15:22 +03:00
Botond Dénes	fd27fbfe64	Merge "Add user types carrier helper" from Pavel Emelyanov " There's a cql_type_parser::parse() method that needs to get user types for a keyspace by its name. For this it uses the global storage proxy instance as a place to get database from. This set introduces an abstract user_types_storage helper object that's responsible in providing the user types for the caller. This helper, in turn, is provided to the parse() method by the database itself or by the schema_ctxt object that needs parse() to unfreeze schemas and doesn't have database at those times. This removes one more get_storage_proxy() call. " * 'br-user-types-storage' of https://github.com/xemul/scylla: cql_type_parser: Require user_types_storage& in parse() schame_tables: Add db/ctxt args here and there user_types: Carry storage on database and schema_ctxt data_dictionary: Introduce user types storage	2022-05-09 17:38:52 +03:00
Nadav Har'El	ca700bf417	scripts/pull_github_pr.sh: clean up after failed cherry-pick When pull_github_pr.sh uses git cherry-pick to merge a single-patch pull request, this cherry-pick can fail. A typical example is trying to merge a patch that has actually already been merged in the past, so cherry-pick reports that the patch, after conflict resolution, is empty. When cherry-pick fails, it leaves the working directory in an annoying mid-cherry-pick state, and today the user needs to manually call "git cherry-pick --abort" to return to the normal state. The script should it automatically - so this is what we do in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-05-09 17:23:34 +03:00
Pavel Emelyanov	598ce8111d	repair: Handle discarded stopping future When repair_meta stops it does so in the background and reports back a shared future into whose shared promise peer it resolves that background activity. There's a shorter way to forward a future result into another, even shared, promise. And this method doesn't need to discard a future. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/253 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-09 17:23:12 +03:00
Pavel Emelyanov	3b4af86ad9	proxy (and suddenly redis): Don't check latency_counter.is_start() The lcs at those places are explicitly start()ed beforehand. The is_start() check is necessary when using the latency_counter with a histogram that may or may not start the counter (this is the case in several class table methods). tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-09 17:20:41 +03:00
Raphael S. Carvalho	48e3117ebc	compaction: move propagate_replacement() into private namespace propagate_replacement() is an internal function that shouldn't be in the public interface. No one besides an unit test for incremental compaction needs it. In the future, I want to revisit incremental compaction unit test to stop using it and only rely on public interfaces Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220506171647.81063-1-raphaelsc@scylladb.com>	2022-05-09 16:49:50 +03:00
Kamil Braun	0f7a1179c8	service: raft: remove `raft_gossip_failure_detector` It's no longer used, having been replaced by the direct_failure_detector listener.	2022-05-09 15:31:19 +02:00
Kamil Braun	295aec2633	service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness On each shard, we register a listener for the new direct failure detector service. The listener maintains a set of live addresses; on mark_alive it adds a server to the set and on mark_dead it removes it. This set is then used to implement the `raft::failure_detector` interface, consisting of `is_alive()` function, which simply checks set membership. There is some complexity in between, because we need to translate direct_failure_detector endpoint_ids to inet_addresses and raft::server_ids to inet_addreses, but all building blocks are already there.	2022-05-09 15:31:19 +02:00
Kamil Braun	7e4bb68061	service: raft: add/remove direct failure detector endpoints on group 0 configuration changes We connect the group 0 raft server rpc implementation to the new direct failure detector service, so that when servers are added or removed from the the group 0 configuration, corresponding endpoints are added to the direct failure detector service. Thus the set of detected endpoints will be equal to the group 0 configuration. This causes the failure detector service to start pinging endpoints, but no listeners are registered yet. The following commit changes that.	2022-05-09 15:31:19 +02:00
Kamil Braun	38f65e5a2e	main: start direct failure detector service We add the new direct failure detector to the list of services started in the Scylla process. To start the service, we need an implementation of `pinger` and `clock`. `pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo message requires the node's gossip generation number. We handle this by embedding the pinger implementation inside `gossiper`, and making `gossiper` update the generation number (cached inside the pinger class) periodically. `clock` is a simple implementation which uses `std::chrono::steady_clock` and `seastar::sleep_until` underneath. Translating `steady_clock` durations to `direct_failure_detector::clock` durations happens by taking the number of ticks. The service is currently not used, just initialized; no endpoints are added and no listeners are registered yet, but the following commits change that.	2022-05-09 13:14:42 +02:00
Kamil Braun	9551256e81	messaging_service: abortable version of `send_gossip_echo` Use the new `send_message_abortable` function to implement an abortable version of `send_gossip_echo`. These echo messages will be used for direct failure detection.	2022-05-09 13:14:41 +02:00
Kamil Braun	f2548fc3fa	message: abortable version of `send_message` I want to be able to timeout `send_message`, but not through the existing `send_message_timeout` API which forces me to use a particular clock/duration/timepoint type. Introduce a more general `send_message_abortable` API which gets an `abort_source&`, subscribes to it, and uses the `rpc::cancellable` interface to cancel the RPC on abort. The function is 90% copy-pasta from `send_message{_timeout}`, only the abort part is new.	2022-05-09 13:14:41 +02:00
Kamil Braun	c15f3a9698	test: raft: randomized_nemesis_test: remove old failure_detector No longer used. Split from the previous commit for a better diff.	2022-05-09 13:14:41 +02:00
Kamil Braun	915d329f1f	test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector` Until now the nemesis test used its own failure detector implementation which used one-way heartbeats. Switch it to use the new direct failure detection service, which will also be used in production code. Integrating it does require some work however as we need to implement the `pinger` and `clock` interfaces for the failure detector. The service is sharded, but for simplicity of implementation we implement rpcs and sleeps by routing the requests to shard 0, where logical timers and network live.	2022-05-09 13:14:41 +02:00
Kamil Braun	e5fc0681d9	test: raft: randomized_nemesis_test: ping all shards on each tick Right now the test is running entirely on shard 0, but we want to introduce a sharded service to the test. The initial naive attempt of doing that failed because the test would time out (reach the tick limit) before any work distributed to other shards could even start. The solution in this commit solves that by synchronizing the shards on each tick. When the test is ran with smp=1, the behavior is as before.	2022-05-09 13:14:41 +02:00
Kamil Braun	e4f85cf425	test: unit test for new failure detector service	2022-05-09 13:14:41 +02:00
Kamil Braun	666e5a414d	direct_failure_detector: introduce new failure detector service The new service performs failure detection by periodically pinging endpoints. The set of pinged endpoints can be dynamically extended and shrinked. To learn about liveness of endpoints, user of the service registers a listener and chooses a threshold - a duration of time which has to pass since the last successful ping in order to mark an endpoint as dead. When an endpoint responds it's immediately marked as alive. Endpoints are identified using abstract integer identifiers. The method of performing a ping is a dependency of the service provided by the user through the `pinger` interface. The implementation of `pinger` is responsible for translating the abstract endpoint IDs to 'real' addresses. For example, production implementation may map endpoint IDs to IP addresses and use TCP/IP to perform the ping, while a test/simulation implementation may use a simulated network that also operates on abstract identifiers. Similarly, the method of measuring time is a dependency provided by the user using the `clock` interface. The service operates on abstract time intervals and timepoints. So, for example, in a production implementation time can be measured using a stopwatch, while in test/simulation we can use a logical clock. The service distributes work across different shards. When an endpoint is added to the set of detected endpoints, the service will choose a shard with the smallest amount of workers and create a worker that is responsible for periodically pinging this endpoint on that shard and sending notifications to listeners. Endpoints can be added or removed only through the shard 0 instance of the service and shard 0 is responsible for coordinating the endpoint workers. Listeners can be registered on any shard.	2022-05-09 13:14:40 +02:00
David Garcia	3e0f81180e	docs: disable link checker Closes #10434	2022-05-09 12:45:28 +02:00
Avi Kivity	81af9342f1	Merge "Simplify gossiper state map API" from Pavel E " There's a enpoint->state map member of the gossiper class. First ugly thing about it is that the member is public. Next, there's a whole bunch of helpers around that map that export various bits of information from it. All of those helpers reshard to shard-0 to read from the state mape ignoring the fact that the map is replicated on all shards internally. Also, some of those helpers effectively duplicate each other for no real gain. Finally, most of them are specific to api/ code, and open-coding them often makes api/ handlers shorter and simpler. This set removes the unused, api-only or trivial state map accessors and marks the state map itself private (underscore prefix included). tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/233/ " * 'br-gossiper-sanitize-api-2' of https://github.com/xemul/scylla: gossiper: Add underscores to new private members code: Indentation fix after previous patch gossiper, code: Relax get_up/down/all_counters() helpers api: Fix indentation after previous patch gossiper, api: Remove get_arrival_samples() gossiper, api: Remove get/set phi convict threshold helpers gossiper, api: Move get_simple_states() into API code gossiper: In-line std::optional<> get_endpoint_state_for_endpoint() overload gossiper, api: Remove get_endpoint_state() helpers gossiper: Make state and locks maps private gossiper: Remove dead code	2022-05-08 22:56:23 +03:00
Avi Kivity	94f677b790	Merge 'sstables/index_reader: short-circuit fast-forward-to when at EOF' from Botond Dénes Attempting to call advance_to() on the index, after it is positioned at EOF, can result in an assert failure, because the operation results in an attempt to move backwards in the index-file (to read the last index page, which was already read). This only happens if the index cache entry belonging to the last index page is evicted, otherwise the advance operation just looks-up said entry and returns it. To prevent this, we add an early return conditioned on eof() to all the partition-level advance-to methods. A regression unit test reproducing the above described crash is also added. Fixes: #10403 Closes #10491 * github.com:scylladb/scylla: sstables/index_reader: short-circuit fast-forward-to when at EOF test/lib/random_schema: add a simpler overload for fixed partition count	2022-05-08 14:17:40 +03:00
Juliusz Stasiewicz	603dd72f9e	CQL: Replace assert by exception on invalid auth opcode One user observed this assertion fail, but it's an extremely rare event. The root cause - interlacing of processing STARTUP and OPTIONS messages - is still there, but now it's harmless enough to leave it as is. Fixes #10487 Closes #10503	2022-05-08 11:33:58 +03:00
Michał Chojnowski	fb1a9e97c9	cql3: restrictions: statement_restrictions: pass arguments to std::bind_front by reference Fix an accidental copy of query_options in range_or_slice_eq_null. Closes #10511	2022-05-08 11:32:53 +03:00
Avi Kivity	1ecb87b7a8	Merge 'Harden table truncate' from Benny Halevy This series fixes a few issue on the table truncate path: - "memtable_list: safely futurize clear_and_add" - reinstates an async version of table::clear_and_add, just safe against #10421 - a unit test reproducing #10421 was added to make sure the new version is indeed safe. - "table: clear: serialize with ongoing flush" fixes #10423 - a unit test reproducing #10423 was added Fixes #10281 Fixes #10423 Test: unit(dev), database_test. test_truncate_without_snapshot_during_{writes,flushes} (debug) Closes #10424 * github.com:scylladb/scylla: test: database_test: add test_truncate_without_snapshot_during_writes memtable_list: safely futurize clear_and_add table: clear: serialize with ongoing flush	2022-05-08 11:30:21 +03:00
Avi Kivity	287c01ab4d	Merge ' sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()' from Michał Chojnowski primitive_consumer::read_bytes() destroys and creates a vector for every value it reads. This happens for every cell. We can save a bit of work by reusing the vector. Closes #10512 * github.com:scylladb/scylla: sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes() utils: fragmented_temporary_buffer: add release()	2022-05-08 11:26:31 +03:00
Raphael S. Carvalho	8e99d3912e	compaction: LCS: don't write to disengaged optional on compaction completion Dtest triggers the problem by: 1) creating table with LCS 2) disabling regular compaction 3) writing a few sstables 4) running maintenance compaction, e.g. cleanup Once the maintenance compaction completes, disengaged optional _last_compacted_keys triggers an exception in notify_completion(). _last_compacted_keys is used by regular for its round-robin file picking policy. It stores the last compacted key for each level. Meaning it's irrelevant for any other compaction type. Regular compaction is responsible for initializing it when it runs for the first time to pick files. But with it disabled, notify_completion() will find it uninitialized, therefore resulting in bad_optional_access. To fix this, the procedure is skipped if _last_compacted_keys is disengaged. Regular compaction, once re-enabled, will be able to fill _last_compacted_keys by looking at metadata of the files. compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_ block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy] now passes. Fixes #10378. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #10508	2022-05-08 11:23:13 +03:00
Raphael S. Carvalho	5682393693	compaction: Fix use-after-move when retrying maintenance compaction SSTable was moved into descriptor, so on failure, it couldn't be used without resulting in a segfault. Fix it by not moving sst, and changing signature to make it explicit we don't want to move the content. Fixes #10505. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #10506	2022-05-08 11:16:55 +03:00
Michał Chojnowski	ddc535a4a2	sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes() read_bytes destroys and creates a vector for every value it reads. This happens for every cell. We can save a bit of work by reusing the vector.	2022-05-07 13:04:16 +02:00
Michał Chojnowski	8cfbe9c9c1	utils: fragmented_temporary_buffer: add release() Add a release() method to fragmented_temporary_buffer. This method releases the underlying vector to allow for its reuse.	2022-05-07 13:04:16 +02:00
Pavel Emelyanov	9d364f19dc	gossiper: Add underscores to new private members The state map and guarding locks were moved to private and now should have a _ prefix Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 11:32:03 +03:00
Pavel Emelyanov	334d3434e7	code: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	5ac28a29d3	gossiper, code: Relax get_up/down/all_counters() helpers These helpers count elements in the endpoint state map. It makes sense to keep them in gossiper API, but it's worth removing the wrappers that do invoke_on(0). This makes code shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	5f53799ffb	api: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	0ef33b71ba	gossiper, api: Remove get_arrival_samples() It's empty too, but the API-side conversion probably has some value for the future, so keep it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	37d392c772	gossiper, api: Remove get/set phi convict threshold helpers These are empty anyway. API caller can place return stubs itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	ad786d6b4d	gossiper, api: Move get_simple_states() into API code The API method in question just tries to scan the state map. There's no need in doing invoke_on(0) and in a separate helper method in gossiper, the creation of the json return value can happen in the API handler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	49dd6b5371	gossiper: In-line std::optional<> get_endpoint_state_for_endpoint() overload The method helps updating enpoint state in handle_major_state_change by returning a copy of an endpoint state that's kept while the map's entry is being replaced with the new state. It can be replaced with a shorter code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	f278d84cfe	gossiper, api: Remove get_endpoint_state() helpers There are two of them -- one to do invoke_on(0) the other one to get the needed data. The former one is not needed -- the scanned endpoint state map is replicated accross shards and is the same everywhere. The latter is not needed, because there's only one user of it -- the API -- which can work with the existing gossiper API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	0aea43a245	gossiper: Make state and locks maps private Locks are not needed outside gossiper, state map is sometimes read from, but there a const getter for such cases. Both methods now desrve the underbar prefix, but it doesn't come with this short patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Pavel Emelyanov	690b21aa4d	gossiper: Remove dead code Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-06 10:34:48 +03:00
Botond Dénes	9623589c77	Merge 'Futurize data_read_resolver::resolve and to_data_query_result' from Benny Halevy This series futurizes two synchronous functions used for data reconciliation: `data_read_resolver::resolve` and `to_data_query_result` and does so by introducing lower-level asynchronous infrastructure: `mutation_partition_view::accept_gently`, `frozen_mutation::unfreeze_gently` and `frozen_mutation::consume_gently`, and `mutation::consume_gently`. This trades some cycles on this cold path to prevent known reactor stalls. Fixes #2361 Fixes #10038 Closes #10482 * github.com:scylladb/scylla: mutation: add consume_gently frozen_mutation: add consume_gently query: coroutinize to_data_query_result frozen_mutation: add unfreeze_gently mutation_partition_view: add accept_gently methods storage_proxy: futurize data_read_resolver::resolve	2022-05-06 10:23:02 +03:00
Botond Dénes	e8f3d7dd13	sstables/index_reader: short-circuit fast-forward-to when at EOF Attempting to call advance_to() on the index, after it is positioned at EOF, can result in an assert failure, because the operation results in an attempt to move backwards in the index-file (to read the last index page, which was already read). This only happens if the index cache entry belonging to the last index page is evicted, otherwise the advance operation just looks-up said entry and returns it. To prevent this, we add an early return conditioned on eof() to all the partition-level advance-to methods. A regression unit test reproducing the above described crash is also added.	2022-05-05 14:42:37 +03:00
Botond Dénes	98f3d516a2	test/lib/random_schema: add a simpler overload for fixed partition count Some tests want to generate a fixed amount of random partitions, make their life easier.	2022-05-05 14:33:37 +03:00
Piotr Sarna	eeec502aee	Merge 'gms: feature_service: reduce boilerplate to add a cluster feature' from Avi Kivity Currently, adding a cluster feature requires editing several files and repeating the new feature name several times. This series reduces the boilerplate to a single line (for non-experimental features), and perhaps three for experimental features. Closes #10488 * github.com:scylladb/scylla: gms: feature_service: remove variable/helper function duplication gms: feature: make `operator bool` implicit gms: feature_service: remove feature variable duplication in enable() gms: feature_service: remove feature variable declaration/definition duplication gms: features: de-quadruplicate active feature names gms: features: de-quadruplicate deprecated feature names gms: feature_service: avoid duplicating feature names when listing known features	2022-05-05 12:43:15 +02:00
Benny Halevy	ca1b616092	mutation: add consume_gently Allow yielding when consuming a mutation, and use in to_data_query_result. Fixes #10038 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-05 13:32:25 +03:00
Benny Halevy	09fb2c983a	frozen_mutation: add consume_gently Allow yielding when consuming a frozen_mutation, and use in to_data_query_result. Refs #10038 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-05 13:32:25 +03:00
Benny Halevy	c9612855c7	query: coroutinize to_data_query_result Reduce stalls by maybe yielding in-between partitions, and by awaiting unfreeze_gently where possible. Refs #10038 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-05 13:32:25 +03:00
Benny Halevy	e12454f175	frozen_mutation: add unfreeze_gently And use in data_read_resolver::resolve Fixes #2361 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-05 13:32:25 +03:00
Benny Halevy	4963eb73b5	mutation_partition_view: add accept_gently methods Allow yielding when consuming mutation_partition_view. To be used in later patches by a new unfreeze_gently function and frozen_mutation::consume. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-05 13:32:25 +03:00
Benny Halevy	f02c25f2c3	storage_proxy: futurize data_read_resolver::resolve Allow yielding in data_read_resolver::resolve to prevent reactor stalls. TODO: unfreeze_gently, to prevent stalls due to large partitions. Refs #2361 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-05 13:32:25 +03:00
Pavel Emelyanov	0f698910e8	cql_type_parser: Require user_types_storage& in parse() Right now to get user types the method in question gets global proxy instance to get database from it and then peek a keyspace, its metadata and, finally, the user types. There's also a safety check for proxy not being initialized, which happens in tests. Instead of messing with the proxy, the parse() method now accepts the user_types_storage reference from which it gets the types. All the callers already have the needed storage at hand -- in most of the cases it's one shared between the database and schema_ctxt. In case of tests is's a dummy storage, in case of schema-loader it's its local one. The get_column_mapping() is special -- it doesn't expect any user-types to be parsed and passes "" keyspace into it, neither it has db/ctxt to get types storage from, so it can safely use the dummy one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-05 13:11:18 +03:00
Pavel Emelyanov	44f38d4de2	schame_tables: Add db/ctxt args here and there This is to have them in places that call cql_type_parser::parse. Pure churn reduction for the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-05 13:11:18 +03:00
Pavel Emelyanov	2104d90dd0	user_types: Carry storage on database and schema_ctxt The user types storage is needed in cql_type_parser::parse which is in turn called with either replica::database or scema_ctxt at hand. To facilitate the former case replica::database has its own user types storage created in database constructor. The latter case is a bit trickier. In many cases the ctxt is created as a temporary object and the database is available at those places. Also the ctxt object lives on the schema_registry instance which doesn't have database nearby. However, that ctxt lifetime is the same as the registry instance one and when it's created there's a database at hand (it's the database constructor that calls schema_registry.init() passing "this" into it). Thus, the solution is to make database's user types storage be a shared pointer that's shared between database itself and all the ctxts out there including the one that lives on schema_registry instance. When database goes away it .deactivate()s its user types storage so that any ctxts that may share it stay on the safe side and don't use database after free. This part will go away when the schema_registry will be deglobalized. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-05 13:06:04 +03:00
Pavel Emelyanov	860dbab474	data_dictionary: Introduce user types storage The interface in question will be used by cql type parser to get user types. There are already three possible implementations of it: - dummy, when no user types are in use (e.g. tests) - schema-loader one, which gets user types from keyspaces that are collected on its implementation of the database - replica::database one, which does the same, but uses the real database instance and that will be shared between scema_ctxts Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-05 09:44:26 +03:00
Avi Kivity	19ab3edd77	gms: feature_service: remove variable/helper function duplication Each feature has a private variable and a public accessor. Since the accessor effectively makes the variable public, avoid the intermediary and make the variable public directly. To ease mechanical translation, the variable name is chosen as the function name (without the cluster_supports_ prefix). References throughout the codebase are adjusted.	2022-05-04 18:59:56 +03:00
Avi Kivity	435b46cd52	gms: feature: make `operator bool` implicit Features are usually used as booleans, so forcing allowing them to implicitly decay to bool is not a mistake. In fact a bunch of helper functions exist to cast feature variables to bool. Prepare to reduce this boilerplate by allowing automatic conversion to bool.	2022-05-04 18:58:24 +03:00
Avi Kivity	81ad595f61	gms: feature_service: remove feature variable duplication in enable() We have a list of all feature variables in enable(), but the list is also available programatically in _registered_features, so use that instead.	2022-05-04 18:44:28 +03:00
Avi Kivity	f0f4759163	gms: feature_service: remove feature variable declaration/definition duplication Feature variables are both declared and defined. Make that happen in one place, reducing boilerplate.	2022-05-04 18:24:56 +03:00
Avi Kivity	0f95258577	gms: features: de-quadruplicate active feature names Active feature names are present four or five times in the code: a delaration in feature.hh, a definition and initialization (two copies) in feature_service.cc, a use in feature_service.cc, and a possible reference in feature_service.cc if the feature is conditionally enabled. Switch to just one copy or two, using the "foo"sv operator (and "foo"s) to generate a string_view (string) as before. Note that a few features had different external and C++ names; we preserve the external name. This patch does cause literal strings to be present in two places, making them vulnerable to misspellings. But since feature names are immutable, there is little risk that one will change without the other.	2022-05-04 18:12:53 +03:00
Avi Kivity	980b109adb	gms: features: de-quadruplicate deprecated feature names Deprecated features are unused, but are present four times in the code: a delaration in feature.hh, a definition and initialization (two copies) in feature_service.cc, and a use in feature_service.cc. Switch to just one copy, using the "foo"sv operator to generate a string_view as before. Note that a few features had different external and C++ names; we preserve the external name.	2022-05-04 17:54:05 +03:00
Avi Kivity	ebe5ce2870	gms: feature_service: avoid duplicating feature names when listing known features We already have the registered features in a data structure, collect them from there instead of repeating.	2022-05-04 16:19:42 +03:00
Calle Wilund	78350a7e1b	cdc: Ensure columns removed from log table are registered as dropped If we are redefining the log table, we need to ensure any dropped columns are registered in "dropped_columns" table, otherwise clients will not be able to read data older than now. Includes unit test. Should probably be backported to all CDC enabled versions. Fixes #10473 Closes #10474	2022-05-04 14:19:39 +02:00
Michał Radwański	29e09a3292	db/config: command line arguments logger_stdout_timestamps and logger_ostream_type are no longer ignored Closes #10452	2022-05-04 14:40:52 +03:00
Pavel Emelyanov	063d26bc9e	system_keyspace/config: Swallow string->value cast exception When updating an updateable value via CQL the new value comes as a string that's then boost::lexical_cast-ed to the desired value. If the cast throws the respective exception is printed in logs which is very likely uncalled for. fixes: #10394 tests: manual Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220503142942.8145-1-xemul@scylladb.com>	2022-05-04 08:35:12 +03:00
Pavel Emelyanov	b26a3da584	gossiper: Coroutinize wait_for_gossip_to_settle() Looks notably shorter this way tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220422093000.24407-1-xemul@scylladb.com>	2022-05-03 15:58:04 +03:00
Botond Dénes	4440d4b41a	Merge "De-globalize gossiper" from Pavel Emelyanov " - Alternator gets gossiper for its proxy dependency - Forward service method that takes global gossiper can re-use proxy method (forward -> proxy reference is already there) - Table code is patched to require gossiper argument - Snitch gets a dependency reference on snitch_ptr and some extra care for snitch driver vs snitch-ptr interaction and gossip test - Cql test env should carry gossiper reference on-board - Few places can re-use the existing local gossiper reference - Scylla-gdb needs to get gossiper from debug namespace and needs _not_ to get feature service from gossiper " * 'br-gossiper-deglobal-2' of https://github.com/xemul/scylla: code: De-globalize gossiper scylla-gdb, main: Get feature service without gossiper help test: Use cql-test-env gossiper cql test env: Keep gossiper reference on board code: Use gossiper reference where possible snitch: Use local gossiper in drivers snitch: Keep gossiper reference test: Remove snitch from manual gossip test gossiper: Use container() instead of the global pointer main, cql_test_env: Start snitch later snitch: Move snitch_base::get_endpoint_info() forward service: Re-use proxy's helper with duplicated code table: Don't use global gossiper alternator: Don't use global gossiper	2022-05-03 15:56:07 +03:00
Nadav Har'El	6fb762630b	cql-pytest: translate Cassandra's tests for SELECT operations This is a translation of Cassandra's CQL unit test source file validation/operations/SelectTest.java into our our cql-pytest framework. This large test file includes 78 tests for various types of SELECT operations. Four additional tests require UDF in Java syntax, and were skipped. All 78 tests pass on Cassandra. 25 of the tests fail on Scylla reproducing 3 already known Scylla issues and 8 previously-unknown issues: Previously known issues: Refs #2962: Collection column indexing Refs #4244: Add support for mixing token, multi- and single-column restrictions Refs #8627: Cleanly reject updates with indexed values where value > 64k Newly-discovered issues: Refs #10354: SELECT DISTINCT should allow filter on static columns, not just partition keys Refs #10357: Spurious static row returned from query with filtering, despite not matching filter Refs #10358: Comparison with UNSET_VALUE should produce an error Refs #10359: "CONTAINS NULL" and "CONTAINS KEY NULL" restrictions should match nothing Refs #10361: Null or UNSET_VALUE subscript should generate an invalid request error Refs #10366: Enforce Key-length limits during SELECT Refs #10443: SELECT with IN and ORDER BY orders rows per partition instead of for the entire response Refs #10448: The CQL token() function should validate its parameters Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10449	2022-05-03 11:45:05 +03:00
Pavel Emelyanov	e80adbade3	code: De-globalize gossiper No code uses global gossiper instance, it can be removed. The main and cql-test-env code now have their own real local instances. This change also requires adding the debug:: pointer and fixing the scylle-gdb.py to find the correct global location. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	89ee15b05b	scylla-gdb, main: Get feature service without gossiper help This is needed not to mess with removed global gossiper in the next patch. Other than this, it's better to access services by their own debug:: pointers, not via under-the-good dependencies chains. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	b25fc29801	test: Use cql-test-env gossiper There's yet another -test-env -- the alternator- one -- which needs gossiper. It now uses global reference, but can grab gossiper reference from the cql-test-env which partitipates in initialization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	b0544ba7bd	cql test env: Keep gossiper reference on board The reference is already available at the env initialization, but it's not kept on the env instance itself. Will be used by the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	4bea0b7491	code: Use gossiper reference where possible Some places in the code has function-local gossiper reference but continue to use global instance. Re-use the local reference (it's going to become sharded<> instance soon). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	e502047c74	snitch: Use local gossiper in drivers Each driver has a pointer to this shard snitch_ptr which, in turn, has the reference on gossiper. This lets drivers stop using the global gossiper instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	38c77d0d85	snitch: Keep gossiper reference The reference is put on the snitch_ptr because this is the sharded<> thing and because gossiper reference is the same for different snitch drivers. Also, getting gossiper from snitch_ptr by driver will look simpler than getting it from any base class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	52fc4d6b22	test: Remove snitch from manual gossip test It's not in use out there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	7a0ca3fedc	gossiper: Use container() instead of the global pointer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:40 +03:00
Pavel Emelyanov	2d32c47d0d	main, cql_test_env: Start snitch later Snitch depends on gossiper and system keyspace, so it needs to be started after those two do. fixes #10402 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:57:32 +03:00
Pavel Emelyanov	f85e12ffa5	snitch: Move snitch_base::get_endpoint_info() This method is only needed by production_snitch_base inheritants Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:34:52 +03:00
Pavel Emelyanov	282a1880a5	forward service: Re-use proxy's helper with duplicated code The get_live_endpoints matches the same method on the proxy side. Since the forward service carries proxy reference, it can use its method (which needs to be made public for that sake). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:34:51 +03:00
Pavel Emelyanov	11c99fc41b	table: Don't use global gossiper The table::get_hit_rate needs gossiper to get hitrates state from. There's no way to carry gossiper reference on the table itself, so it's up to the callers of that method to provide it. Fortunately, there's only one caller -- the proxy -- but the call chain to carry the reference it not very short ... oh, well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:33:08 +03:00
Pavel Emelyanov	7a5c2cdbe6	alternator: Don't use global gossiper There's proxy at hand which can provide local gossiper reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:33:07 +03:00
Avi Kivity	a2901a376d	Merge 'Coroutinize some `storage_service` member functions' from Pavel Solodovnikov These trivial changes are mostly intended to reduce the use of `seastar::async`. Closes #10416 * github.com:scylladb/scylla: service: storage_service: coroutinize `start_gossiping()` service: storage_service: coroutinize `node_ops_cmd_heartbeat_updater()` service: storage_service: coroutinize `node_ops_abort_thread()` service: storage_service: coroutinize `node_ops_abort()` service: storage_service: coroutinize `node_ops_done()` service: storage_service: coroutinize `node_ops_update_heartbeat()` service: storage_service: coroutinize `force_remove_completion()` service: storage_service: coroutinize `start_leaving()` service: storage_service: coroutinize `start_sys_dist_ks()` service: storage_service: coroutinize `prepare_to_join()` service: storage_service: coroutinize `removenode_add_ranges()` service: storage_service: coroutinize `unbootstrap()` service: storage_service: coroutinize `get_changed_ranges_for_leaving()`	2022-05-02 12:59:36 +03:00
Botond Dénes	53c66fe24a	Merge "Make LCS reshape and major more efficient by picking the ideal output level" from Raphael S. Carvalho " Today, both operations are picking the highest level as the ideal level for placing the output, but the size of input should be used instead. The formula for calculating the ideal level is: ceil(log base(fan_out) of (total_input_size / max_fragment_size)) where fan_out = 10 by default, total_input_size = total size of input data and max_fragment_size = maximum size for fragment (160M by default) such that 20 fragments will be placed at level 2, as level 1 capacity is 10 fragments only. By placing the output in the incorrect level, tons of backlog will be generated for LCS because it will either have to promote or demote fragments until the levels are properly balanced. " * 'optimize_lcs_major_and_reshape/v2' of https://github.com/raphaelsc/scylla: compaction: LCS: avoid needless work post major compaction completion compaction: LCS: avoid needless work post reshape completion compaction: LCS: extract calculation of ideal level for input compaction: LCS: Fix off-by-one in formula used to calculate ideal level	2022-05-02 10:16:09 +03:00
Pavel Solodovnikov	47834313d8	repair: avoid infinite recursion on stringifying unknown node_ops_cmd Cast the cmd representation to underlying type and avoid infinite recursion in the `operator <<(node_ops_cmd)`. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20220430180701.1012190-1-pa.solodovnikov@scylladb.com>	2022-05-02 10:08:34 +03:00
Avi Kivity	5169ce40ef	Merge 'loading_cache: force minimum size of unprivileged ' from Piotr Grabowski This series enforces a minimum size of the unprivileged section when performing `shrink()` operation. When the cache is shrunk, we still drop entries first from unprivileged section (as before this commit), however, if this section is already small (smaller than `max_size / 2`), we will drop entries from the privileged section. This is necessary, as before this change the unprivileged section could be starved. For example if the cache could store at most 50 entries and there are 49 entries in privileged section, after adding 5 entries (that would go to unprivileged section) 4 of them would get evicted and only the 5th one would stay. This caused problems with BATCH statements where all prepared statements in the batch have to stay in cache at the same time for the batch to correctly execute. To correctly check if the unprivileged section might get too small after dropping an entry, `_current_size` variable, which tracked the overall size of cache, is changed to two variables: `_unprivileged_section_size` and `_privileged_section_size`, tracking section sizes separately. New tests are added to check this new behavior and bookkeeping of the section sizes. A test is added, that sets up a CQL environment with a very small prepared statement cache, reproduces issue in #10440 and stresses the cache. Fixes #10440. Closes #10456 * github.com:scylladb/scylla: loading_cache_test: test prepared stmts cache loading_cache: force minimum size of unprivileged loading_cache: extract dropping entries to lambdas loading_cache: separately track size of sections loading_cache: fix typo in 'privileged'	2022-05-01 19:36:35 +03:00
Avi Kivity	325eb9b4d2	Merge 'compaction: get rid of reader v1' from Benny Halevy This series gets rid of the remaining usage of flat_mutation_reader v1 in compaction Test: sstable_compaction_test Closes #10454 * github.com:scylladb/scylla: compaction: sanitize headers from flat_mutation_reader v1 flat_mutation_reader: get rid of class filter compaction: cleanup_compaction: make_partition_filter: return flat_mutation_reader_v2::filter	2022-05-01 19:29:10 +03:00
Avi Kivity	ab72dbc93e	Merge 'Make internal statements caching more coherent' from Eliran Sinvani There was some doubts about which internal prepared statements are cached and which aren't. In addition some queries that should have been cached (IMO), weren't. This PR adds some verbosity to the caching enabling parameter as well as adding caching to some queries. As a followup I would suggest to have internal queries as a compile time strings that have a compile time hash, this will make the cache lookup not be dependent on the query textual length as it is today, this makes sense given that the queries are static even today. Closes #10465 Fixes #10335. * github.com:scylladb/scylla: internal queries: add caching to some queries query_processor: remove default internal query caching behavior query_processor: make execute_internal caching parameter more verbose	2022-05-01 19:27:06 +03:00
Eliran Sinvani	a16b4e407d	internal queries: add caching to some queries Some of the internal queries didn't have caching enabled even though there are chances of the query executing in large bursts or relatively often, example of the former is `default_authorized::authorize` and for the later is `system_distributed_keyspace::get_service_levels`. Fixes #10335 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2022-05-01 13:30:02 +03:00
Pavel Solodovnikov	1031a9fa09	service: storage_service: coroutinize `start_gossiping()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-05-01 12:12:30 +03:00
Pavel Solodovnikov	4af27ca653	service: storage_service: coroutinize `node_ops_cmd_heartbeat_updater()` Also, pass `node_ops_cmd` by value to get rid of lifetime issues when converting to coroutine. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-05-01 12:07:36 +03:00
Eliran Sinvani	e0c7178e75	query_processor: remove default internal query caching behavior When executing internal queries, it is important that the developer will decide if to cache the query internally or not since internal queries are cached indefinitely. Also important is that the programmer will be aware if caching is going to happen or not. The code contained two "groups" of `query_processor::execute_internal`, one group has caching by default and the other doesn't. Here we add overloads to eliminate default values for caching behaviour, forcing an explicit parameter for the caching values. All the call sites were changed to reflect the original caching default that was there. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2022-05-01 08:33:55 +03:00
Eliran Sinvani	38b7ebf526	query_processor: make execute_internal caching parameter more verbose `execute_internal` has a parameter to indicate if caching a prepared statement is needed for a specific call. However this parameter was a boolean so it was easy to miss it's meaning in the various call sites. This replaces the parameter type to a more verbose one so it is clear from the call site what decision was made.	2022-05-01 08:33:55 +03:00
Avi Kivity	7f1e368e92	Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop. Fixes: #10450 Closes #10451 * github.com:scylladb/scylla: replica/database: drop_column_family(): drop querier cache entries after waiting for ops replica/database: finish coroutinizing drop_column_family() replica/database: make remove(const column_family&) private	2022-04-29 22:06:51 +03:00
Tomasz Grabiec	dbef83af71	Merge 'raft: fix startup hangs' from Kamil Braun Fix hangs on Scylla node startup with Raft enabled that were caused by: - a deadlock when enabling the USES_RAFT feature, - a non-voter server forgetting who the leader is and not being able to forward a `modify_config` entry to become a voter. Read the commit messages for details. Fixes: #10379 Refs: #10355 Closes #10380 * github.com:scylladb/scylla: raft: actively search for a leader if it is not known for a tick duration raft: server: return immediately from `wait_for_leader` if leader is known service: raft: don't support/advertise USES_RAFT feature	2022-04-29 19:47:10 +02:00
Piotr Grabowski	6537dc6126	loading_cache_test: test prepared stmts cache Add a new test that sets up a CQL environment with a very small prepared statements cache. The test reproduces a scenario described in #10440, where a privileged section of prepared statement cache gets large and that could possibly starve the unprivileged section, making it impossible to execute BATCH statements. Additionally, at the end of the test, prepared statements/"simulated batches" with prepared statements are executed a random number of times, stressing the cache. To create a CQL environment with small prepared cache, cql_test_config is extended to allow setting custom memory_config value.	2022-04-29 19:22:55 +02:00
Piotr Grabowski	3f2224a47f	loading_cache: force minimum size of unprivileged This patch enforces a minimum size of unprivileged section when performing shrink() operation. When the cache is shrank, we still drop entries first from unprivileged section (as before this commit), however if this section is already small (smaller than max_size / 2), we will drop entries from the privileged section. For example if the cache could store at most 50 entries and there are 49 entries in privileged section, after adding 5 entries (that would go to unprivileged section) 4 of them would get evicted and only the 5th one would stay. This caused problems with BATCH statements where all prepared statements in the batch have to stay in cache at the same time for the batch to correctly execute. New tests are added to check this behavior and bookkeeping of section sizes. Fixes #10440.	2022-04-29 19:19:04 +02:00
Piotr Grabowski	06612ddf1c	loading_cache: extract dropping entries to lambdas Extract the logic of dropping an entry from privileged/unprivileged sections to a separate named local lambdas.	2022-04-29 19:19:03 +02:00
Piotr Grabowski	bebc4c8147	loading_cache: separately track size of sections This patch splits _current_size variable, which tracked the overall size of cache, to two variables: _unprivileged_section_size and _privileged_section_size. Their sum is equal to the old _current_size, but now you can get the size of each section separately. lru_entry's cache_size() is replaced with owning_section_size() which references in which counter the size of lru_entry is currently stored.	2022-04-29 19:19:03 +02:00
Michał Sala	35e02858b2	test/boost: cql_query_test: allow bound variables in test_list_of_tuples_with_bound_var	2022-04-29 10:46:57 +02:00
Michał Sala	ed377933a1	test/boost: cql_query_test: test bound variables in collection literals	2022-04-29 10:46:51 +02:00
Raphael S. Carvalho	736c96cc6f	compaction: LCS: avoid needless work post major compaction completion That's done by picking the ideal level for the input, such that LCS won't have to either promote or demote data, because the output level is not the best candidate for having the size of the output data. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-28 20:19:28 -03:00
Raphael S. Carvalho	bea551ea14	compaction: LCS: avoid needless work post reshape completion That's done by picking the ideal level for reshape input, such that LCS won't have to either promote or demote data, because the output level is not the best candidate for having the size of the output data. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-28 20:19:28 -03:00
Raphael S. Carvalho	b25e53a845	compaction: LCS: extract calculation of ideal level for input ideal level is calculated as: ceil(log base10 of ((input_size + max_fragment_size - 1) / max_fragment_size)) such that 20 fragments will be placed at level 2, as level 1 capacity is 10 fragments only. The goal of extracting it is that the formula will be useful for major in addition to reshape. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-28 20:19:04 -03:00
Raphael S. Carvalho	e82c02fae6	compaction: LCS: Fix off-by-one in formula used to calculate ideal level To calculate ideal level, we use the formula: log 10 (input_size / max_fragment_size) input_size / max_fragment_size is calculating number of fragments. the problem is that the calculation can miss the last fragment, so wrong level may be picked if last fragment would cause the target level to exceed its capacity. To fix it, let's tweak the formula to: log 10 ((input_size + max_fragment_size - 1) / max_fragment_size) such that the actual # of fragments will be calculated. If wrong level is picked, it can cause unnecessary writeamp as, LCS will later have to promote data into the next level. Problem spotted by Benny Halevy. Fixes #10458. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-28 20:09:04 -03:00
Michał Sala	69bc60ef24	cql3: expr: do not allow unset values inside collections Semantic of unset values inside collections is undefined. Previous behavior of transforming list with unset value into unset value was removed, because I couldn't find a reason for its existence.	2022-04-28 19:33:09 +02:00
Avi Kivity	aee94b7176	Merge "Convert remaining mutation sources to v2" from Botond " After the recent conversion of the row-cache, two v1 mutation sources remained: the memtable and the kl sstable reader. This series converts both to a native v2 implementation. The conversion is shallow: both continue to read and process the underlying (v1) data in v1, the fragments are converted to v2 right before being pushed to the reader's buffer. This conversion is simple, surgical and low-risk. It is also better than the upgrade_to_v2() used previously. Following this, the remaining v1 reader implementations are removed, with the exception of the downgrade_to_v1(), which is the only one left at this point. Removing this requires converting all mutation sinks to accept a v2 stream. upgrade_to_v2() is now not used in any production code. It is still needed to properly test downgrade_to_v1() (which is till used), so we can't remove it yet. Instead it hidden as a private method of mutation_source. This still allows for the above mentioned testing to continue, while preventing anyone from being tempted to introduce new usage. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191 " * 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla: readers: make upgrade_to_v2() private test/lib/mutation_source_test: remove upgrade_to_v2 tests readers: remove v1 forwardable reader readers: remove v1 empty_reader readers: remove v1 delegating_reader sstables/kl: make reader impl v2 native sstables/kl: return v2 reader from factory methods sstables: move mp_row_consumer_reader_k_l to kl/reader.cc partition_snapshot_reader: convert implementation to native v2 mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()	2022-04-28 20:31:23 +03:00
Michał Sala	4766e25d6e	cql3: expr: prepare_expr: allow bind markers in collection literals It's easier to allow them then not to do so.	2022-04-28 19:31:09 +02:00
Piotr Grabowski	fe9b62bc99	loading_cache: fix typo in 'privileged' Fix typo from 'priviledged' to 'privileged'.	2022-04-28 17:51:26 +02:00
Benny Halevy	78d6f6a519	compaction: sanitize headers from flat_mutation_reader v1 flat_mutation_reader make_scrubbing_reader no longer exists and there is no need to include flat_mutation_reader.hh nor forward declare the class. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-28 17:23:04 +03:00
Benny Halevy	4c35de962f	flat_mutation_reader: get rid of class filter It is no longer in use. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-28 17:19:20 +03:00
Benny Halevy	f634b6d3be	compaction: cleanup_compaction: make_partition_filter: return flat_mutation_reader_v2::filter We filter only on the parittion key, so it doesn't matter, but we want to get rid of flat_mutation_reader v1. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-28 17:16:47 +03:00
Nadav Har'El	ad7cc71748	Merge 'sstables: Fix deletion of partial SSTables' from Raphael "Raph" Carvalho If SSTable write fails, it will leave a partial sst which contains a temporary TOC in addition to other components partially written. temporary TOC content is written upfront, to allow us from deleting all partial components using the former content if write fails. After commit `e5fc4b6`, partial sst cannot be deleted because it is incorrectly assuming all files being deleted unconditionally has TOC, but that's not true for partial files that need to be removed. The consequence of this is that space of partial files cannot be reclaimed, making it worse for Scylla to recover from ENOSPC, which could happen by selecting a set of files for compaction with higher chance of suceeeding given the free space. Let's fix this by taking into account temp TOC for partial files. Fixes #10410. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #10411 * github.com:scylladb/scylla: sstables: Fix deletion of partial SSTables sstables: Fix fsync_directory() sstables: Rename dirname() to a more descriptive name	2022-04-28 16:35:46 +03:00
Nadav Har'El	01bd858b6b	cql: fix error message that refers to tuple instead of UDT Slice restrictions on the "duration" type are not allowed, and also if we have a collection, tuple or UDT of durations. We made an effort to print helpful messages on the specific case encountered, such as "Slice restrictions are not supported on UDTs containing duration". But the if()s were reverse, meaning that a UDT - which is also a tuple - will be reported as a tuple instead of UDT as we intended (and as Cassandra reports it). The wrong message was reproduced in the unit test translated from Cassandra, select_test.py::testFilteringOnUdtContainingDurations Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220428071807.1769157-1-nyh@scylladb.com>	2022-04-28 14:16:58 +03:00
Botond Dénes	178c271bf4	readers: make upgrade_to_v2() private The only user is the tests of downgrade_to_v1(), which uses it through mutation source. To avoid any new users popping up, we make it a private method of the latter. In the process the pass-through optimization is dropped, it is not needed for tests anyway.	2022-04-28 14:12:24 +03:00
Botond Dénes	272da51f80	test/lib/mutation_source_test: remove upgrade_to_v2 tests We don't have any upgrade_to_v2() left in production code, so no need to keep testing it. Removing it from this test paves the way for removing it for good (not in this series).	2022-04-28 14:12:24 +03:00
Botond Dénes	7420fb9411	readers: remove v1 forwardable reader No users.	2022-04-28 14:12:24 +03:00
Botond Dénes	f527956cdb	readers: remove v1 empty_reader The only user is row level repair: it is replaced with downgrade_to_v1(make_empty_flat_reader_v2()). The row level reader has lots of downgrade_to_v1() calls, we will deal with these later all at once. Another use is the empty mutation source, this is trivially converted to use the v2 variant.	2022-04-28 14:12:24 +03:00
Botond Dénes	ea37e9c04e	readers: remove v1 delegating_reader The only user is a test, which is hereby converted to use the v2 delegating reader.	2022-04-28 14:12:24 +03:00
Botond Dénes	70d019116f	sstables/kl: make reader impl v2 native The conversion is shallow: the meat of the logic remains v1, fragments are converted to v2 right before being pushed into the buffer. This approach is simple, surgical and is still better then a full upgrade_to_v2().	2022-04-28 14:12:24 +03:00
Botond Dénes	a22b02c801	sstables/kl: return v2 reader from factory methods This just moves the upgrade_to_v2() calls to the other side of said factory methods, preparing the ground for converting the kl reader impl to a native v2 one.	2022-04-28 14:12:24 +03:00
Botond Dénes	4b222e7f37	sstables: move mp_row_consumer_reader_k_l to kl/reader.cc Its only user is in said file, so that is a better place for it.	2022-04-28 14:12:24 +03:00
Botond Dénes	4f77e74bd4	partition_snapshot_reader: convert implementation to native v2 The underlying mutation representation is still v1, so the implementation still has to do conversion. This happens right above the lsa reader component.	2022-04-28 14:12:12 +03:00
Botond Dénes	9c7455825b	mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()	2022-04-28 14:11:51 +03:00
Botond Dénes	024ceec61e	replica/database: drop_column_family(): drop querier cache entries after waiting for ops Reads (part of operations) running concurrent to `drop_column_family()` can create querier cache entries while we wait for them to finish in `await_pending_ops()`. Move the cache entry eviction to after this, to ensure such entries are also cleaned up before destroying the table object. This moves the `_querier_cache.evict_all_for_table()` from `database::remove()` to `database::drop_column_family()`. With that the former doesn't have to return `future<>` anymore. While at it (changing the signature) also rename `column_family` -> `table`. Also add a regression unit test.	2022-04-28 13:40:13 +03:00
Botond Dénes	4c17da9996	replica/database: finish coroutinizing drop_column_family() Said method was already coroutinized, but only halfway, possibly because of the difficulty in expressing `finally()` with coroutines. We now have `coroutines::as_future()` which makes this easier, so finish the job.	2022-04-28 13:40:13 +03:00
Botond Dénes	9b7550f845	replica/database: make remove(const column_family&) private It has no external users. And it shouldn't have either, tables should be removed via drop_column_family().	2022-04-28 13:40:08 +03:00
Avi Kivity	de0ee13f45	schema_tables: forward-declare user_function and user_aggerates These bring in wasm.hh (though they really shouldn't) and make everyone suffer. Forward declare instead and add missing includes where needed. Closes #10444	2022-04-28 07:22:02 +03:00
Botond Dénes	2c08468fcb	Merge 'Make headers self-contained' from Avi Kivity Minor fixlets to make `ninja dev-headers` pass. Closes #10445 * github.com:scylladb/scylla: readers/from_mutations_v2.hh: make self-contained data_dictionary/storage_options.hh: make self-contained	2022-04-28 07:20:10 +03:00
Avi Kivity	a9812166cd	replica, partition_snapshot_reader, keys: replace boost::any with std::any Reduce #include load by standardizing on std::any. In keys.cc, we just drop the unneeded include. One instance of boost::any remains in config_file, due to a tie-in with other boost components. Closes #10441	2022-04-28 07:18:53 +03:00
Avi Kivity	3a81cb7cc3	readers/from_mutations_v2.hh: make self-contained Due to an inline function, we need the definition of flat_mutation_reader_v2.hh, so include it.	2022-04-27 15:55:16 +03:00
Avi Kivity	28406c2c56	data_dictionary/storage_options.hh: make self-contained Add "seastarx.hh" so sstring works (rather than seastar::sstring).	2022-04-27 15:54:32 +03:00
Avi Kivity	333fdcb3f5	Update tools/java submodule (fix NodeProbe: Malformed IPv6 address at index) * tools/java 9bc83b7a32...a4573759a2 (1): > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index Fixes #10442.	2022-04-27 14:51:47 +03:00
Benny Halevy	e88871f4ec	replica: database: move shard_of implementation to mutation layer We don't need the database to determine the shard of the mutation, only its schema. So move the implementation to the respecive definitions of mutation and frozen_mutation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10430	2022-04-27 14:40:24 +03:00
Nadav Har'El	f6ce7891a5	test/alternator: add test for key length limits DynamoDB limits partition-key length to 2048 bytes and sort-key length to 1024 bytes. Alternator currently has no such limits officially, but if a user tries a key length of over 64 KB, the result will be an "internal server error" as Alternator runs into Scylla's low-level key length limit of 64 KB. In this patch we add (mostly xfailing) tests confirming all the above observations. The tests include extensive comments on what they are testing and why. Some of these tests (specifically, the ones checking what happens above 64 KB) should pass once Alternator is fixed. Other tests - requiring that the limits be exactly what they are in DynamoDB - may either not pass or change in the future, depending on what we decide the limits should be in Alternator. Refs #10347 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10438	2022-04-26 18:09:19 +02:00
Raphael S. Carvalho	791403e4bb	sstables: Fix deletion of partial SSTables If SSTable write fails, it will leave a partial sst which contains a temporary TOC in addition to other components partially written. temporary TOC content is written upfront, to allow us from deleting all partial components using the former content if write fails. After commit `e5fc4b6`, partial sst cannot be deleted because deletion procedure is incorrectly assuming all SSTs being deleted unconditionally have TOC, but partial SSTs only have TMP TOC instead. That happens because parent_path() requires all path components to exist due to its usage of fs::path::canonical. The consequence of this is that space of partial files cannot be reclaimed, making it worse for Scylla to recover from ENOSPC, which could happen by selecting a set of files for compaction with higher chance of suceeeding given the free space. This is fixed by only calling parent_path() on TMP TOC, which is guaranteed to exist prior to calling fsync_directory(). Fixes #10410. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-26 11:00:27 -03:00
Raphael S. Carvalho	0be44b1035	sstables: Fix fsync_directory() fsync_directory() is broken because it's unconditionally performing fsync on parent directory, not on the directory that it was called with. To fix, let's remove wrong parent_path() usage. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-26 11:00:27 -03:00
Raphael S. Carvalho	ca8f5dcdb7	sstables: Rename dirname() to a more descriptive name dirname() is confusing because if it's called on a directory, parent path is retrieved. By renaming it to parent_path(), it's clearer what the function will do exactly. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-26 11:00:27 -03:00
Avi Kivity	582802825a	treewide: use system-#include (angle brackets) for seastar Seastar is an external library from Scylla's point of view so we should use the angle bracket #include style. Most of the source follows this, this patch fixes a few stragglers. Also fix cases of #include which reached out to seastar's directory tree directly, via #include "seastar/include/sesatar/..." to just refer to <seastar/...>. Closes #10433	2022-04-26 14:46:42 +03:00
Takuya ASADA	48b6aec16a	scripts: use "out()" function for all capture_output subprocesses On `acaf0bb` we applied out() just for perftune.py because we had issue #10390 with this script. But the issue can happen with other commands too, let's apply it to all commands which uses capture_output. related #10390 Closes #10414	2022-04-26 13:56:52 +03:00
Benny Halevy	01f41630a5	compaction: time_window_compaction_strategy: reset estimated_remaining_tasks when running out of candidates _estimated_remaining_tasks gets updated via get_next_non_expired_sstables -> get_compaction_candidates, but otherwise if we return earlier from get_sstables_for_compaction, it does not get updated and may go out of sync. Refs #10418 (to be closed when the fix reaches branch-4.6) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10419	2022-04-26 11:26:48 +03:00
Benny Halevy	055141fc2e	multishard_mutation_query: do_query: stop ctx if lookup_readers fails lookup_readers might fail after populating some readers and those better be closed before returning the exception. Fixes #10351 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10425	2022-04-26 11:11:52 +03:00
Botond Dénes	bf1b6ced3c	Merge "Make storage_service::bootstrap less if-y" from Pavel Emelyanov " The method in question performs node bootstrap in several different modes (regular, replacing, rnbo) and several subsequent if-else branches just duplicate each-other. This set merges them making the code easier to read. " * 'br-less-branchy-bootstrap' of https://github.com/xemul/scylla: storage_service: Remove pointless check in replace-bootstrap storage_service: Generalize wait for range setup storage_service: Merge common if-else branches in bootstrap storage_service: Move tables bootstrap-ON upwards	2022-04-26 10:58:30 +03:00
Raphael S. Carvalho	d79fb9a12f	docs: Update compaction controller doc The doc is being updated to reflect the changes in the commit `d8833de3bb` ("Redefine Compaction Backlog to tame compaction aggressiveness"). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-26 10:50:45 +03:00
Benny Halevy	c2f0d75d96	test: database_test: add test_truncate_without_snapshot_during_writes Reproduces https://github.com/scylladb/scylla/issues/10421 with `2325c566d9` (memtable_list: futurize clear_and_add) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-26 07:25:30 +03:00
Benny Halevy	b8263e550a	memtable_list: safely futurize clear_and_add Following `a4be927e23` that reverted `2325c566d9` due to #10421, this patch reintroduces an async version of memtable_list::clear_and_add that calls clear_gently safely after replacing the _memtables vector with a new one so that writes and flushes can continue in he foreground while the old memtables are cleared. Fixes #10281 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-26 07:25:28 +03:00
Benny Halevy	aae532a96b	table: clear: serialize with ongoing flush Get all flush permits to serialize with any ongoing flushes and preventing further flushes during table::clear, in particular calling discard_completed_segments for every table and clearing the memtables in clear_and_add. Fixes #10423 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-25 18:57:07 +03:00
Gleb Natapov	7f26a8eef5	raft: actively search for a leader if it is not known for a tick duration For a follower to forward requests to a leader the leader must be known. But there may be a situation where a follower does not learn about a leader for a while. This may happen when a node becomes a follower while its log is up-to-date and there are no new entries submitted to raft. In such case the leader will send nothing to the follower and the only way to learn about the current leader is to get a message from it. Until a new entry is added to the raft's log a follower that does not know who the leader is will not be able to add entries. Kind of a deadlock. Note that the problem is specific to our implementation where failure detection is done by an outside module. In vanilla raft a leader sends messages to all followers periodically, so essentially it is never idle. The patch solves this by broadcasting specially crafted append reject to all nodes in the cluster on a tick in case a leader is not known. The leader responds to this message with an empty append request which will cause the node to learn about the leader. For optimisation purposes the patch sends the broadcast only in case there is actually an operation that waits for leader to be known. Fixes #10379	2022-04-25 14:51:22 +02:00
Kamil Braun	5308a7d7a3	raft: server: return immediately from `wait_for_leader` if leader is known `wait_for_leader` may be called when leader is known. There's nothing to wait for in this case.	2022-04-25 12:59:55 +02:00
Benny Halevy	db676e9e4a	replica: database: apply: make sure the schema is synced or throw internal error Currently an exception is thrown in the apply stage when the schema is not synced, but it is too late since returning an error doesn't pinpoint which code path was using an unsync'ed schema so move the check earlier, before _apply_stage is called. We need to make sure the schema is synced earlier when the mutation is applied so call on_internal_error to generate a backtrace in testing and still throw an error in production. Typically storage_proxy::mutate_locally implicitly ensures the schema is synced by making a global_schema_ptr for it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220424110057.3957597-1-bhalevy@scylladb.com>	2022-04-25 12:18:47 +02:00
Pavel Solodovnikov	654e6726d1	service: storage_service: coroutinize `node_ops_abort_thread()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:11:20 +03:00
Pavel Solodovnikov	b27c989e62	service: storage_service: coroutinize `node_ops_abort()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:11:14 +03:00
Pavel Solodovnikov	f7e84c6138	service: storage_service: coroutinize `node_ops_done()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:11:08 +03:00
Pavel Solodovnikov	6936dbea49	service: storage_service: coroutinize `node_ops_update_heartbeat()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:11:04 +03:00
Pavel Solodovnikov	1c03d01927	service: storage_service: coroutinize `force_remove_completion()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:10:58 +03:00
Pavel Solodovnikov	fc1dfb0ae1	service: storage_service: coroutinize `start_leaving()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:10:54 +03:00
Pavel Solodovnikov	0a3a7534d6	service: storage_service: coroutinize `start_sys_dist_ks()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:10:49 +03:00
Pavel Solodovnikov	15ea74e41f	service: storage_service: coroutinize `prepare_to_join()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:10:43 +03:00
Pavel Solodovnikov	c739fad5d6	service: storage_service: coroutinize `removenode_add_ranges()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:10:05 +03:00
Pavel Solodovnikov	e392fdda96	service: storage_service: coroutinize `unbootstrap()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:09:56 +03:00
Pavel Solodovnikov	8fa7f47a74	service: storage_service: coroutinize `get_changed_ranges_for_leaving()` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-25 09:09:04 +03:00
Benny Halevy	bcd35af7cf	replica: table: generate_and_propagate_view_updates: pass mutation to make_flat_mutation_reader_from_mutations_v2 With `f5ef687acd` we can consume the single mutation directly, so there's n need to pass it as a vector of size 1. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220424103826.3930895-1-bhalevy@scylladb.com>	2022-04-24 22:19:19 +03:00
Avi Kivity	728479a6ea	Merge 'Fix map subscript crashes when map or subscript is null' from Nadav Har'El In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues #10361, #10399, #10401. The existing test `test_null.py::test_map_subscript_null` reproduced all these bugs sporadically. In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both `m[NULL]` and `NULL[2]` to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., `NULL < 2` is also defined as resulting in NULL. However, this decision differs from Cassandra, where `m[NULL]` is considered an error but `NULL[2]` is allowed. We believe that making `m[NULL]` be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like `m[a]`, where the column `a` can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan. Fixes #10361 Fixes #10399 Fixes #10401 Closes #10420 * github.com:scylladb/scylla: expressions: don't dereference invalid map subscript in filter expressions: fix invalid dereference in map subscript evaluation test/cql-pytest: improve tests for map subscripts and nulls	2022-04-24 21:16:10 +03:00
Avi Kivity	a4be927e23	Revert "memtable_list: futurize clear_and_add" This reverts commit `2325c566d9`. It causes a use-after-free of a memtable. Fixes #10421.	2022-04-24 21:09:48 +03:00
Asias He	953af38281	streaming: Allow drop table during streaming Currently, if a table is dropped during streaming, the streaming would fail with no_such_column_family error. Since the table is dropped anyway, it makes more sense to ignore the streaming result of the dropped table, whether it is successful or failed. This allows users to drop tables during node operations, e.g., bootstrap or decommission a node. This is especially useful for the cloud users where it is hard to coordinate between a node operation by admin and user cql change. This patch also fixes a possible user after free issue by not passing the table reference object around. Fixes #10395 Closes #10396	2022-04-24 17:43:20 +03:00
Tzach Livyatan	607ccf0393	Update doc project name to scylla dev Closes #10342	2022-04-24 17:40:54 +03:00
Nadav Har'El	fbb2a41246	expressions: don't dereference invalid map subscript in filter If we have the filter expression "WHERE m[?] = 2", the existing code simply assumed that the subscript is an object of the right type. However, while it should indeed be the right type (we already have code that verifies that), there are two more options: It can also be a NULL, or an UNSET_VALUE. Either of these cases causes the existing code to dereference a non-object as an object, leading to bizarre errors (as in issue #10361) or even crashes (as in issue #10399). Cassandra returns a invalid request error in these cases: "Unsupported unset map key for column m" or "Unsupported null map key for column m". We decided to do things differently: * For NULL, we consider m[NULL] to result in NULL - instead of an error. This behavior is more consistent with other expressions that contain null - for example NULL[2] and NULL<2 both result in NULL as well. Moreover, if in the future we allow more complex expressions, such as m[a] (where a is a column), we can find the subscript to be null for some rows and non-null for other rows - and throwing an "invalid query" in the middle of the filtering doesn't make sense. * For UNSET_VALUE, we do consider this an error like Cassandra, and use the same error message as Cassandra. However, the current implementation checks for this error only when the expression is evaluated - not before. It means that if the scan is empty before the filtering, the error will not be reported and we'll silently return an empty result set. We currently consider this ok, but we can also change this in the future by binding the expression only once (today we do it on every evaluation) and validating it once after this binding. Fixes #10361 Fixes #10399 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-04-24 16:05:34 +03:00
Nadav Har'El	808a93d29b	expressions: fix invalid dereference in map subscript evaluation When we have an filter such as "WHERE m[2] = 3" (where m is a map column), if a row had a null value for m, our expression evaluation code incorrectly dereferences an unset optional, and continued processing the result of this dereference which resulted in undefined behavior - sometimes we were lucky enough to get "marshaling error" but other times Scylla crashed. The fix is trivial - just check before dereferencing the optional value of the map. We return null in that case, which means that we consider the result of null[2] to be null. I think this is a reasonable approach and fits our overall approach of making null dominate expressions (e.g., the value of "null < 2" is also null). The test test_filtering.py::test_filtering_null_map_with_subscript, which used to frequently fail with marshaling errors or crashes, now passes every time so its "xfail" mark is removed. Fixes #10417 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-04-24 14:58:56 +03:00
Nadav Har'El	189b8845fe	test/cql-pytest: improve tests for map subscripts and nulls The test test_null.py::test_map_subscript_null turned out to reproduce multiple bugs related to using map subscripts in filtering expressions. One was issue #10361 (m[null] resulted in a bizarre error) or #10399 (m[null] resulted in a crash), and a different issue was #10401 (m[2] resulted in a bizarre error or a crash if m itself was null). Moreover, the same test uncovered different bugs depending how it was run - alone or with other tests - because it was using a shared table. In this patch we introduce two separate tests in test_filtering.py which are designed to reproduce these separate bugs instead of mixing them into one test. The new tests also cover a few more corners which the previous test (which focused on nulls) missed - such as UNSET_VALUE. The two new tests (and the old test_map_subscript_null) pass on Cassandra so still assume that the Cassandra behavior - that m[null] should be an error - is the correct behavior. We may want to change the desired behavior (e.g., to decide that m[null] be null, not an error), and change the tests accordingly later - but for now the tests follow Cassandra's behavior exactly, and pass on Cassandra and fail on Scylla (so are marked xfail). The bugs reproduced by these tests involve randomness or reading uninitialized memory, so these tests sometimes pass, sometimes fail, and sometimes even crash (as reported in #10399 and #10401). So to reproduce these bugs run the tests multiple times. For example: test/cql-pytest/run --count 100 --runxfail test_filtering.py::test_filtering_null_map_with_subscript Refs #10361 Refs #10399 Refs #10401 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-04-24 13:26:26 +03:00
Avi Kivity	8624718983	Merge "row_cache: update reader implementations to v2" from Botond " cache_flat_mutation_reader gets a native v2 implementation. The underlying mutation representation is not changed: range deletions are still stored as v1 range_tombstones in mutation_partition. These are converted to range tombstone changes during reading. This allows for separating the change of a native v2 reader implementation and a native v2 in-memory storage format, enabling the two to be done at separate times and incrementally. This means there is still conversion ingoing when reading from cache and when populating, but when reading from underlying, the stream can now be passed through as-is without conversions. Also, any future v2 related changes to the in-memory storage will now be limited to the cache reader implementation itself. In the process, the non-forwarding reader, whose only user is the cache, is also converted to v2. " Performance results reported by Botond: " build/release/test/perf/perf_simple_query -c1 -m2G --flush -- duration=20 BEFORE median 130421.76 tps ( 71.1 allocs/op, 12.1 tasks/op, 47462 insns/op) median absolute deviation: 319.64 maximum: 131028.33 minimum: 127502.55 AFTER median 133297.41 tps ( 64.1 allocs/op, 12.2 tasks/op, 45406 insns/op) median absolute deviation: 2964.24 maximum: 137581.56 minimum: 123739.4 Getting rid of those upgrade/downgrade was good for allocs and ops. Curiously there is a 0.1 rise in number of tasks though. " * 'row-cache-readers-v2/v1' of https://github.com/denesb/scylla: row_cache: update reader implementations to v2 range_tombstone_change_generator: flush(): add end_of_range readers/nonforwardable: convert to v2 read_context: fix indentation read_context: coroutinize move_to_next_partition() row_cache: cache_entry::read(): return v2 reader row_cache: return v2 readers from make_reader*() readers/delegating_v2: s/make_delegating_reader_v2/make_delegating_reader/	2022-04-23 19:10:43 +03:00
Botond Dénes	5e97fb9fc4	row_cache: update reader implementations to v2 cache_flat_mutation_reader gets a native v2 implementation. The underlying mutation representation is not changed: range deletions are still stored as v1 range_tombstones in mutation_partition. These are converted to range tombstone changes during reading. This allows for separating the change of a native v2 reader implementation and a native v2 in-memory storage format, enabling the two to be done at separate times and incrementally.	2022-04-21 14:57:04 +03:00
Botond Dénes	5cc5fd4d23	range_tombstone_change_generator: flush(): add end_of_range Allowing to flush all range tombstone changes, including those that have a position equal to the passed in upper bound, when finishing off a read-range, e.g. a clustering range from a slice.	2022-04-21 14:37:10 +03:00
Botond Dénes	7626beb729	readers/nonforwardable: convert to v2 It has a single user, the row cache, which for now has to upgrade/downgrade around the nonforwardable reader, but this will go away in the next patches when the row cache readers are converted to v2 proper.	2022-04-21 14:34:00 +03:00
Botond Dénes	b061acb668	Merge 'Remove queue reader v1' from Mikołaj Sielużycki The patchset embeds the mutation_fragment upgrading logic from v1 to v2 into the mutation_fragment_queue. This way the mutation fragments coming to the mutation_fragment_queue can be v1, but the underlying query_reader receives mutation_fragment_v2, eliminating the last usage of query_reader (v1). The last commit removes query_reader, query_reader_handle and associated factory functions. tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test) Closes #10371 * github.com:scylladb/scylla: readers: Remove queue_reader v1 and associated code. repair: Make mutation_fragment_queue internally upgrade fragments to v2 repair: Make mutation_fragment_queue::impl a seastar::shared_ptr	2022-04-21 12:34:48 +03:00
Mikołaj Sielużycki	f74fd0dd80	readers: Remove queue_reader v1 and associated code.	2022-04-20 17:56:34 +02:00
Mikołaj Sielużycki	339b60e5b0	repair: Make mutation_fragment_queue internally upgrade fragments to v2	2022-04-20 17:55:58 +02:00
Mikołaj Sielużycki	eeb2b458de	repair: Make mutation_fragment_queue::impl a seastar::shared_ptr It makes mutation_fragment_queue copyable and makes the pointer to pending mutation fragments in next commit stable. This allows moving the mutation_fragment_queue without breaking the underlying upgrading_consumer.	2022-04-20 17:51:58 +02:00
Botond Dénes	46481264e9	read_context: fix indentation Broken by the previous patch (patches actually -- it was half-indent on half-indent before that).	2022-04-20 10:59:09 +03:00
Botond Dénes	28f90728a3	read_context: coroutinize move_to_next_partition() Makes the code more readable and the impending v2 transition less noisy.	2022-04-20 10:59:09 +03:00
Botond Dénes	2a0d7e8a1d	row_cache: cache_entry::read(): return v2 reader Push the conversion down one level. Soon we will make cache flat mutation reader a v2 reader, this keeps the related noise separate.	2022-04-20 10:59:09 +03:00
Botond Dénes	0b035c9099	row_cache: return v2 readers from make_reader*() And adjust callers. The factory functions just sprinkle upgrade_to_v2() on returned readers for now. One test in row_cache_test.cc had to be disabled, because the upgrade to v2 wrapper we now have over cache readers doesn't allow it to directly control the reader's buffer size and so the test fails. There is a FIXME left in the test code and the test will be re-enabled once a native v2 reader implementation allows us to get rid of the upgrade wrapper.	2022-04-20 10:59:09 +03:00
Botond Dénes	c3c71b3aa5	readers/delegating_v2: s/make_delegating_reader_v2/make_delegating_reader/ The argument type (v1 or v2 reader) is enough to disambiguate and overloading the v1 method makes a transition to v2 more seamless.	2022-04-20 10:59:09 +03:00
Nadav Har'El	cc40685c28	test/cql-pytest: add test for filtering with IN restriction It turns out that Cassandra does not allow IN restrictions together with filtering, except, curiously, when the restriction is on a clustering key. There is no real reason for this limitation - the error message even says it is not yet supported. Scylla, on the other hand, does support this case. Of course it's not enough that we support it - we need to support it correctly... But we don't have a full regression test that this support is correct - in filtering_test.cc we test it with clustering and regular columns - but not partition key columns. So this patch adds a simple cql-pytest test that this sort of filtering works in Scylla correctly for partition, clustering and regular columns (and also confirms that these cases don't work, yet, on Cassandra). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220420075553.1008062-1-nyh@scylladb.com>	2022-04-20 09:56:22 +02:00
Konstantin Osipov	a3b790b413	test.py: add a dependency on python3-aiohttp and tabulate Satisfy the build system requirements. [avi: regenerate frozen toolchain]	2022-04-19 18:22:50 +03:00
Konstantin Osipov	097fbc7c5d	.gitignore: ignore mypy_cache, the python lint cache	2022-04-19 16:48:47 +03:00
Pavel Emelyanov	41392a59bb	storage_service: Remove pointless check in replace-bootstrap The method in question is called in the branch where the replace address is checked to be present, no need in extra explicit check. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-19 13:27:52 +03:00
Pavel Emelyanov	49481b1a21	storage_service: Generalize wait for range setup Both the if is_replacing()/else branches call gossiper wating method as their first steps. Can be done once. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-19 13:27:52 +03:00
Pavel Emelyanov	d213e6ffd1	storage_service: Merge common if-else branches in bootstrap There are three modes in there -- bootstrap, b.s. with RBNO and b.s. for replacing. All three are checked two times in a row, but can be done once. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-19 13:27:52 +03:00
Pavel Emelyanov	b0df3a32b4	storage_service: Move tables bootstrap-ON upwards This call just places a boolean flag on all. It won't hurt if it lasts while the node is performing pre-bootstrap checks, but it allows making the whole method less branchy. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-19 13:27:52 +03:00
Avi Kivity	469bca5369	storage_proxy: coroutinize mutate_locally (vector overload) The do_with() means we have an unconditional allocation, so we can justify the coroutine's allocation (replacing it). Meanwhile, coroutine::parallel_for_each() reduces an allocation if mutate_locally() blocks. Closes #10387	2022-04-19 10:59:16 +03:00
Botond Dénes	3051fc3cbc	Merge 'Fix some errors and issues found by gcc 12' from Avi Kivity gcc 12 checks some things that clang doesn't, resulting in compile errors. This series fixes some of theses issues, but still builds (and tests) with clang. Unfortunately, we still don't have a clean gcc build due to an outstanding bug [1]. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056 Closes #10386 * github.com:scylladb/scylla: build: disable warnings that cause false-positive errors with gcc 12 utils: result_loop: remove invalid and incorrect constraint service: forward_service: avoid using deprecated std::bind1st and std::not1 repair: explicityl ignore tombstone gc update response treewide: abort() after switch in formatters db: view: explicitly ignore unused result compaction: leveled_compaction_strategy: avoid compares between signed and unsigned compaction_manager: compaction_reenabler: disambiguate compaction_state api: avoid function specialization in req_param alternator: ttl: avoid specializing class templates in non-namespace scope alternator: executor: fix signed/unsigned comparison in is_big()	2022-04-19 10:25:38 +03:00
Botond Dénes	4d972b8d31	Merge 'storage_proxy: convert rpc handlers from lambdas to member functions' from Avi Kivity Currently, rpc handlers are all lambdas inside storage_proxy::init_messaging_service(). This means any stack trace refers to storage_proxy::init_messaging_service::lambda#n instead of a meaningful function name, and it makes init_messaging_service() very intimidating. Fix that by moving all such lambdas to regular member functions. The first two patches remove unnecessary captures to make it easy; the final patch coverts the lambdas to member functions. Closes #10388 * github.com:scylladb/scylla: storage_proxy: convert rpc handlers from lambdas to member functions storage_proxy: don't capture messaging_service in server callbacks storage_proxy: don't capture migration_manager in server callbacks	2022-04-19 08:20:49 +03:00
Takuya ASADA	acaf0bb88a	scripts: print perftune.py error message when capture_output=True We currently does not able to get any error message from subprocess when we specified capture_output=True on subprocess.run(). This is because CalledProcessError does not print stdout/stderr when it raised, and we don't catch the exception, we just let python to cause Traceback. Result of that, we only able to know exit status and failed command but not able to get stdout/stderr. This is problematic especially working on perftune.py bug, since the script should caused Traceback but we never able to see it. To resolve this, add wrapper function "out()" for capture output, and print stdout/stderr with error message inside the function. Fixes #10390 Closes #10391	2022-04-18 14:06:51 +03:00
Avi Kivity	27093d32d1	Merge 'gms: gossiper: coroutinize `apply_state` functions' from Pavel Solodovnikov Mostly trivial conversions to coroutines in the gossiper to facilitate code readability. Closes #10389 * github.com:scylladb/scylla: gms: gossiper: coroutinize `apply_state_locally` gms: gossiper: coroutinize `apply_state_locally_without_listener_notification` gms: gossiper: coroutinize `do_apply_state_locally` gms: gossiper: coroutinize `apply_new_states`	2022-04-18 13:48:07 +03:00
Avi Kivity	7129ddfa67	build: disable warnings that cause false-positive errors with gcc 12 gcc 12 generates some incorrect warnings (that we treat as errors). Silence them so we can build.	2022-04-18 12:27:18 +03:00
Avi Kivity	160bbb00dd	utils: result_loop: remove invalid and incorrect constraint Checking a concept in a requires-expression requires an additional requires keyword. Moreover, the constraint is incorrect (at least all callers pass a T, not a result<T>), so remove it. Found by gcc 12.	2022-04-18 12:27:18 +03:00
Avi Kivity	e55f5fab53	service: forward_service: avoid using deprecated std::bind1st and std::not1 Switch to newer alterantives std::bind_front, std::not_fn.	2022-04-18 12:27:18 +03:00
Avi Kivity	5da586271f	repair: explicityl ignore tombstone gc update response The response struct is empty and we have nothing to do with it. Cast it to void to avoid a gcc warning.	2022-04-18 12:27:18 +03:00
Avi Kivity	1e1c0226a6	treewide: abort() after switch in formatters It is typical in switch statements to select on an enum type and rely on the compliler to complain if an enum value was missed. But gcc isn't satisified since the enum could have a value outside the declared list. Call abort() in this impossible situation to pacify it.	2022-04-18 12:27:18 +03:00
Avi Kivity	a1df583dea	db: view: explicitly ignore unused result Otherwise, gcc complains.	2022-04-18 12:27:18 +03:00
Avi Kivity	eb436ac940	compaction: leveled_compaction_strategy: avoid compares between signed and unsigned These can overflow. Here, there is no such risk, but switch to unsigned to avoid the warning.	2022-04-18 12:27:18 +03:00
Avi Kivity	fa7172fcad	compaction_manager: compaction_reenabler: disambiguate compaction_state compaciton_state is used both as a type and a function, which gcc does not like. Disambiguate by fully qualifying the type name.	2022-04-18 12:27:18 +03:00
Avi Kivity	de6631656c	api: avoid function specialization in req_param Function specializations are not allowed (you're supposed to use overloads), but clang appears to allow them. Here, we can't use an overload since the type doesn't appear in the parameter list. Use a constraint instead.	2022-04-18 12:27:18 +03:00
Avi Kivity	40beb48176	alternator: ttl: avoid specializing class templates in non-namespace scope The C++ standard disallows class template specialization in non-namespace scopes. Clang apparently allows it as an extension. Fix by not using a template - there are just two specializations and no generic implementation. Use regular classes and std::conditional_t to choose between the two.	2022-04-18 12:27:18 +03:00
Avi Kivity	b5e8e32c01	alternator: executor: fix signed/unsigned comparison in is_big() Signed/unsigned comparisons are subject to C promotion rules. In is_big() in this case the comparison is safe, but gcc warns. Use a cast to silence the warning. The sign/unsigned mix and int/size_t size differences still look bad, it would be good to revisit this code, but that is left for another patch.	2022-04-18 12:23:18 +03:00
Piotr Sarna	fea18943cd	schema_tables: drop leftover change to system_schema.keyspaces Series `59d56a3fd7` introduced an accidental backward incompatible regression by adding a column to system_schema.keyspaces and then not even using it for anything. It's a leftover from the original hackathon implementation and should never reach master in the first place. Fortunately, the series isn't part of any stable release yet. Fixes #10376 Tests: manual, verifying that the system_schema.keyspaces table no longer contains the extraneous column. Closes #10377	2022-04-18 12:00:43 +03:00
Avi Kivity	36aee57978	storage_proxy: convert rpc handlers from lambdas to member functions Currently, rpc handlers are all lambdas inside storage_proxy::init_messaging_service(). This means any stack trace refers to storage_proxy::init_messaging_service::lambda#n instead of a meaningful function name, and it makes init_messaging_service() very intimidating. Fix that by moving all such lambdas to regular member functions. This is easy now that they don't capture anything except `this`, which we provide during registration via std::bind_front(). A few #includes and forward declarations had to be added to storage_proxy.hh. This is unfortunate, but can only be solved by splitting storage_proxy into a client part and a server part.	2022-04-17 19:03:06 +03:00
Avi Kivity	f7e8109b16	storage_proxy: don't capture messaging_service in server callbacks We'd like to make the server callbacks member functions, rather than lambdas, so we need to eliminate their captures. This patch eliminats 'ms' by referringn to the already existing member '_messaging' instead.	2022-04-17 17:55:05 +03:00
Avi Kivity	4cac2eb43e	storage_proxy: don't capture migration_manager in server callbacks We'd like to make the server callbacks member functions, rather than lambdas, so we need to eliminate their captures. This patch eliminates 'mm' by making it a member variable and capturing 'this' instead. In one case 'mm' was used by a handle_write() intermediate lambda so we have to make that non-static and capture it too. uninit_messaging_service() clears the member variable to preserve the same lifetime 'mm' had before, in case that's important.	2022-04-17 17:54:51 +03:00
Avi Kivity	86dfe75268	Update seastar submodule * seastar acf7e3523b...5e86362704 (10): > Merge "Respect taskset-configured cpumask" from Pavel E Ref #9505. > rpc_tester: Run CPU hogs on server side too > std-coroutine: include <coroutine> for LLVM-15 > Revert "Merge "tests: perf: measure coroutines performance" from Benny" > test: perf_tests: remove [[gnu::always_inline]] attribute from coroutine perf tests > Merge "tests: perf: measure coroutines performance" from Benny > Merge "Extend RPC tester" from Pavel E > rpc: Mark connection trivial getters const noexcept > seastar-addr2line: Allow use of llvm-addr2line as the command > file: append_challenged_posix_file: Serialize allocate() to not block concurrent reads or writes	2022-04-17 17:11:31 +03:00
Pavel Solodovnikov	b25c4fee01	gms: gossiper: coroutinize `apply_state_locally` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-17 11:51:18 +03:00
Pavel Solodovnikov	746f1179eb	gms: gossiper: coroutinize `apply_state_locally_without_listener_notification` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-17 11:38:33 +03:00
Pavel Solodovnikov	b7322c3f5d	gms: gossiper: coroutinize `do_apply_state_locally` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-17 11:29:26 +03:00
Pavel Solodovnikov	c48dcf607a	gms: gossiper: coroutinize `apply_new_states` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-17 11:28:42 +03:00
Kamil Braun	b1b22f2c2b	service: raft: don't support/advertise USES_RAFT feature The code would advertise the USES_RAFT feature when the SUPPORTS_RAFT feature was enabled through a listener registered on the SUPPORTS_RAFT feature. This would cause a deadlock: 1. `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)` locks the gossiper (it's called for the first time from sstables format selector). 2. The function calls `on_change` listeners. 3. One of the listeners is the one for SUPPORTS_RAFT. 4. The listener calls `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`. 5. This tries to lock the gossiper. In turn, depending on timing, this could hang the startup procedure, which calls `add_local_application_state` multiple times at various points, trying to take the lock inside gossiper. This prevents us from testing raft / group 0, new schema change procedures that use group 0, etc. For now, simply remove the code that advertises the USES_RAFT feature. Right now the feature has no other effect on the system than just becoming enabled. In fact, it's possible that we don't need this second feature at all (SUPPORTS_RAFT may be enough), but that's work-in-progress. If needed, it will be easy to bring the enabling code back (in a fixed form that doesn't cause a deadlock). We don't remove the feature definitions yet just in case. Refs: #10355	2022-04-15 16:08:25 +02:00
Avi Kivity	4a5082bfc8	main: fix discarded future during prometheus start sequence Probably not triggerable since it will be a while before we recognize a signal to exit. But a FIXME is a FIXME. Closes #10374	2022-04-15 16:40:31 +03:00
Avi Kivity	d90415434e	main: wait for memory_threshold_guard start We start the memory threshold guard (that enables large memory allocation warnings post-boot) but don't wait for it. I can't imagine it can hurt, but it does carry a FIXME label. Closes #10375	2022-04-15 16:37:47 +03:00
Botond Dénes	75786c42cb	Merge 'Add repair unit tests/v1' from Mikołaj Sielużycki This patch series splits up parts of repair pipeline to allow unit testing various bits of code without having to run full dtest suite. The reason why repair pipeline has no unit tests is that by definition repair requires multiple nodes, while unit test environment works only for a single node. However, it is possible to explicitly define interfaces between various parts of the pipeline, inject dependencies and test them individually. This patch series is focused on taking repair_rows_on_wire (frozen mutation representation of changes coming from another node) and flushing them to an sstable. The commits are split into the following parts: - pulling out classes to separate headers so that they can be included (potentially indirectly) from the test, - pulling out repair_meta::to_repair_rows_list and part of repair_meta::flush_rows_in_working_row_buf so that they can be tested, - refactoring repair_writer so that the actual writing logic can be injected as dependency, - creating the unit test. tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test) Closes #10345 * github.com:scylladb/scylla: repair: Add unit test for flushing repair_rows_on_wire to disk. repair: Extract mutation_fragment_queue and repair_writer::impl interfaces. repair: Make parts of repair_writer interface private. repair: Rename inputs to flush_rows. repair: Make repair_meta::flush_rows a free function. repair: Split flush_rows_in_working_row_buf to two functions and make one static. repair: Rename inputs to to_repair_rows_list. repair: Make to_repair_rows_list a free function. repair: Make repair_meta::to_repair_rows_list a static function repair: Fix indentation in repair_writer. repair: Move repair_writer to separate header. repair: Move repair_row to a separate header. repair: Move repair_sync_boundary to a separate header. repair: Move decorated_key_with_hash to separate header. repair: Move row_repair hashing logic to separate class and file.	2022-04-14 18:17:03 +03:00
Kamil Braun	41f5b7e69e	Merge branch 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla into next * 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla: main: allow joining raft group0 before waiting for gossiper to settle service: raft_group0: make `join_group0` re-entrant service: storage_service: add `join_group0` method raft_group_registry: update gossiper state only on shard 0 raft: don't update gossiper state if raft is enabled early or not enabled at all gms: feature_service: add `cluster_uses_raft_mgmt` accessor method db: system_keyspace: add `bootstrap_needed()` method db: system_keyspace: mark getter methods for bootstrap state as "const"	2022-04-14 16:42:20 +02:00
Botond Dénes	737cc798ca	Merge "Add flat_mutation_reader_from_mutation_v2" from Benny Halevy " Optimize consuming from a single partition. This gives us significant improvement with single, small mutations, as shown with perf_mutation_readers, compared to the vector-based flat_mutation_reader_from_mutations_v2. These are expected to be common on the write path, and can be optimized for view building. results from: perf_mutation_readers -c1 --random-seed=840478750 (userspace cpu-frequency governer, 2.2GHz) test iterations median mad min max Before: combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns After: combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns The grand plan is to follow up with make_flat_mutation_reader_from_frozen_mutation_v2 so that we can read directly from either a mutation or frozen_mutation without having to unfreeze it e.g. in table::push_view_replica_updates. Test: unit(dev) Perf: perf_mutation_readers(release) " * tag 'flat_mutation_reader_from_mutation-v3' of https://github.com/bhalevy/scylla: perf: perf_mutation_readers: add one_mutation case test: mutation_query_test: make make_source static mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2 mutation readers: add make_flat_mutation_reader_from_mutation_v2 readers: delete slice_mutation.hh test: flat_mutation_reader_test: mock_consumer: add debug logging test: flat_mutation_reader_test: mock_consumer: make depth counter signed	2022-04-14 17:23:21 +03:00
Botond Dénes	fa75d58cf0	Merge "Make snitch start/stop code look classical" from Pavel Emelyanov " There's a generic way to start-stop services in scylla, that includes 5 "actions" (some are optional and/or implicit though) service_config cfg = ... sharded<service>.start(cfg) service.invoke_on_all(&service::start) service.invoke_on_all(&service::shutdown) service.invoke_on_all(&servuce::stop) sharded<service>.stop() and most of the service out there conforms to that scheme. Not snitch (spoiler: and not tracing), for which there's a couple of helpers that do all that magic behind the scenes, "configuring" snitch is done with the help of overloaded constructors. The latter is extra complicated with the need to register snitch drivers in class-registry for each constructor overload. Also there's an external shards synchronization on stop. This set brings snitch start/stop code to the described standard: the create/stop helpers are removed, creation acceps the config structure, per-shard start/stop (snitch has no drain for now) happens in the simple invoke-on-all manner. The intended side effect of this change is the ability to add explicit dependencies to snitch (in the future, not in this set). tests: unit(dev) " * 'br-snitch-config' of https://github.com/xemul/scylla: snitch: Remove create_snitch/stop_snitch snitch: Simplify stop (and pause_io) snitch: Move io_is_stopped to property-file driver snitch: Remove init_snitch_obj() snitch: Move instance creation into snitch_ptr constructor snitch: Make config-based construction of all drivers snitch: Declare snitch_ptr peering and rework container() method snitch: Introduce container() method	2022-04-14 16:56:32 +03:00
Pavel Solodovnikov	d4b717afa7	main: allow joining raft group0 before waiting for gossiper to settle A node can join group0 without waiting for gossiper if it is either a fresh node, or it's an existing node, which is already part of some group0 (i.e. have `group0_id` persisted in system tables). In that case the second `join_group0()` call inside the `storage_service::join_token_ring` will be a no-op. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-14 12:20:50 +03:00
Benny Halevy	f5ef687acd	perf: perf_mutation_readers: add one_mutation case Measure performance of the single-mutation reader: make_flat_mutation_reader_from_mutation_v2. Comparable to the `one_row` case that consumes the single mutation using the multi-mutatio reader: make_flat_mutation_reader_from_mutations_v2 perf_mutation_readers shows ~20-30% improvement of make_flat_mutation_reader_from_mutation_v2 the same single mutation, just given as a single-item vector to make_flat_mutation_reader_from_mutations_v2. test iterations median mad min max Before: combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns After: combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-14 11:39:05 +03:00
Benny Halevy	a4b69fe7b6	test: mutation_query_test: make make_source static No need for it to be public. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-14 11:15:19 +03:00
Benny Halevy	ddb5166b82	mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2 Extract the common parts of the single mutation reader and the vector-based variant into mutation_reader_base and reuse from both readers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-14 11:15:17 +03:00
Benny Halevy	e85241d5b6	mutation readers: add make_flat_mutation_reader_from_mutation_v2 Optimize reading from a single partition. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-14 11:14:43 +03:00
Benny Halevy	394eb1271d	readers: delete slice_mutation.hh slice_mutations() is currently used only by readers/mutation_readers.cc so there's no need to expose it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-14 08:41:31 +03:00
Benny Halevy	ee2c7948f3	test: flat_mutation_reader_test: mock_consumer: add debug logging Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-14 08:41:31 +03:00
Benny Halevy	38cdfca824	test: flat_mutation_reader_test: mock_consumer: make depth counter signed We want to return stop_iteration::yes once we crossed the initial depth threshold, with an unsigned depth counter, it might wraparound and look > 1. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-14 08:41:31 +03:00
Tomasz Grabiec	d293bd9579	Merge "Enable forwarding in raft randomized_nemesis_test" from Kamil The test will now, with probability 1/2, enable forwarding of entries by followers to leaders. This is possible thanks to the new abort_source& APIs which we use to ensure that no operations are running on servers before we destroy them. Some adjustments were required to the server abort procedure in order to prevent rare hangs (see first patch). We also translate some low-level exceptions coming from seastar primitives to high-level Raft API exceptions (second patch). * kbr/nemesis-enable-fd-v1: test: raft: randomized_nemesis_test: enable entry forwarding test: raft: randomized_nemesis_test: increase logging level on some rare operations raft: server: translate abort_requested_exception to raft::request_aborted raft: fsm: when stopping, become follower to reject new requests	2022-04-13 18:40:23 +02:00
Piotr Sarna	61057446f7	Merge 'forward_service: retry failed forwarder call' from Michał Sala This pull request adds support for retrying failed forwarder calls (currently used to parallelize `select count() from ...` queries). Failed-to-forward sub-queries will be executed locally (on a super-coordinator). This local execution is meant as a fallback for a forward_requests that could not be sent to its destined coordinator (e.g. due gossiper not reacting fast enough). Local execution was chosen as the safest one - it does not require sending data to another coordinator. Due to problems with misscompilations, some parts of the `forward_service` were uncoroutinized. Fixes: #10131 Closes #10329 github.com:scylladb/scylla: forward_service: uncoroutinize dispatch method forward_service: uncoroutinize retrying_dispatcher forward_service: rety a failed forwarder call forward_service: copy arguments/captured vars to local variables	2022-04-13 09:41:35 +02:00
Nadav Har'El	6cafffe281	test/cql-pytest: reproduce internal server error on null subscript The restriction "WHERE m[NULL] = 2" should result in an invalid request error, but currently results in an ugly internal server error. This test reproduces it, and since the bug is still in the code - is marked as xfail. Refs #10361 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220412134118.829671-1-nyh@scylladb.com>	2022-04-13 08:49:48 +03:00
Nadav Har'El	ae0e1574dc	test/cql-pytest: reproducer for CONTAINS NULL bug This is a reproducer for issue #10359 that a "CONTAINS NULL" and "CONTAINS KEY NULL" restrictions should not match any set, but currently do match non-empty or all sets. The tests currently fail on Scylla, so marked xfail. They also fails on Cassandra because Cassandra considers such a request an error, which we consider a mistake (see #4776) - so the tests are marked "cassandra_bug". Refs #10359. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220412130914.823646-1-nyh@scylladb.com>	2022-04-13 08:49:23 +03:00
Nadav Har'El	5d87ead9f1	test/cql-pytest: add more tests comparing against NULL We already have a test showing that WHERE v=NULL ALLOW FILTERING is allowed in Scylla (unlike Cassandra), and matches nothing. Here we add two further tests that confirm that: 1. Not only is v=NULL allowed - v<NULL, v<=NULL, and so on, is also allowed and matches nothing. 2. The ALLOW FILTERING is required in in those requests. Without it, both Scylla and Cassandra generate the same "ALLOW FILTERING is required" error. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220411214503.770413-1-nyh@scylladb.com>	2022-04-13 08:48:55 +03:00
Avi Kivity	987e6533d2	transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3 we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since the client won't understand the v4 errors), but we still send the new error codes. This causes the client to become confused. Fix by updating the error codes. A better fix is to move the error code from the constructor parameter list and hard-code it in the constructor, but that is left for a follow-up after this minimal fix. Fixes #5610. Closes #10362	2022-04-12 19:19:52 +03:00
Avi Kivity	8aec146dec	Merge "Remove qctx from repair" from Pavel E " Repair code keeps its history in system keyspace and uses the qctx global thing to update and query it. This set replaces the qctx with the explicit reference on the system_keyspace object. tests: unit(dev), dtest.repair_test(dev) " * 'br-repair-vs-qctx' of https://github.com/xemul/scylla: repair, system_keyspace: Query repair_history with a helper repair: Update loader code to use system_keyspace entry repair, system_keyspace: Update repair_history with a helper repair: Keep system keyspace reference	2022-04-12 17:08:41 +03:00
Tomasz Grabiec	0c365818c3	utils/chunked_managed_vector: Fix sigsegv during reserve() Fixes the case of make_room() invoked with last_chunk_capacity_deficit but _size not in the last reserved chunk. Found during code review, no user impact. Fixes #10364. Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>	2022-04-12 16:37:11 +03:00
Tomasz Grabiec	01eeb33c6e	utils/chunked_vector: Fix sigsegv during reserve() Fixes the case of make_room() invoked with last_chunk_capacity_deficit but _size not in the last reserved chunk. Found during code review, no known user impact. Fixes #10363. Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>	2022-04-12 16:35:17 +03:00
Avi Kivity	546ee814dd	Merge 'schema_tables, sstables: return instead of throwing' from Piotr Sarna This miniseries rewrites a few unnecessary throws into forwarding the exception directly. It's partially possible thanks to the new `co_await coroutine::return_exception` mechanism which allows returning from a coroutine early, without explicitly calling co_return (`d5843f6e88`). Closes #10360 * github.com:scylladb/scylla: sstables: : remove unnecessary throws schema_tables: remove unnecessary throws	2022-04-12 15:18:14 +03:00
Piotr Sarna	bce2933d99	sstables: : remove unnecessary throws Throws are translated to passing the exceptions directly.	2022-04-12 13:09:54 +02:00
Piotr Sarna	91f130bd9c	schema_tables: remove unnecessary throws Throws are translated to passing the exception directly.	2022-04-12 13:09:27 +02:00
Pavel Emelyanov	05eb9c9416	repair, system_keyspace: Query repair_history with a helper Querying the table is now done with the help of qctx directly. This patch replaces it with a querying helper that calls the consumer function with the entry struct as the argument. After this change repair code can stop including query_context and mess with untyped_result_set. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-12 14:04:21 +03:00
Pavel Emelyanov	59f4aa0934	repair: Update loader code to use system_keyspace entry Patch the history entry loader to use the recently introduced history entry. This is just to reduce the churn in the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-12 13:59:55 +03:00
Pavel Emelyanov	9940016e05	repair, system_keyspace: Update repair_history with a helper Current code works directly on the qctx which is not nice. Instead, make it use the system keyspace reference. To make it work, the patch adds a helper method and introduces a helper struct for the table entry. This struct will also be used to query the table (next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-12 13:57:57 +03:00
Pavel Emelyanov	e501ebd6c2	repair: Keep system keyspace reference Repair updates (and queries on start) the system.repair_history table and thus depends on the system_keyspace object Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-12 13:57:08 +03:00
Avi Kivity	aa7c6dfaa9	Merge 'Commitlog: refactor file handling - simplify file management + make bookkeep safer' from Calle Wilund Adds a "named_file" wrapper type in commitlog, encapsulating file and disk size, the latter being updated automatically on write/truncate/allocate/delete operations. Use this instead of loose vars in segments, and also in recycle/delete lists. Having the data propagate with the objects means we can dispose of re-reading sizes from disk, which in turn means we know what "our" view of the file sizes is when we try to delete/recycle them -> we can bookkeep accurately (from our view point) without having to resort to the rather horrible recalculation of disk footprint. This series also drops non-recycled segment handling, since it is not used anywhere, and just makes things harder. It also adds a parameter to set flush threshold. These two first patches could be broken out into separate PR:s if need be. Closes #10084 * github.com:scylladb/scylla: commitlog: Fold named_file continuations into caller coroutine frame commitlog: Use named named_file objects in delete/dispose/recycle lists commitlog: Use named_file size tracking instead of segment var commitlog: Use named_file in segment commitlog: Add "named_file" file wrapping type commitlog: Make flush threshold a config parameter commitlog: kill non-recycled segment management	2022-04-12 11:28:36 +03:00
Raphael S. Carvalho	f05ae92849	compaction: move compaction::enable_garbage_collected_sstable_writer() into protected namespace Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220411181322.192830-2-raphaelsc@scylladb.com>	2022-04-12 11:21:18 +03:00
Raphael S. Carvalho	3741e7fb6d	compaction: LCS: kill unused bootstrapping code With off-strategy, we no longer need LCS explicitly switching to STCS mode, and even without off-strategy, the dynamic fan-in approach in compaction manager will cause LCS to automatically switch to STCS under heavy write load. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220411181322.192830-1-raphaelsc@scylladb.com>	2022-04-12 11:21:18 +03:00
Mikołaj Sielużycki	b16e12f3a1	repair: Add unit test for flushing repair_rows_on_wire to disk. The unit test executes a simplified repair scenario by: - producing a random stream of mutation mutation_fragments, - convering them to repair_rows_on_wire, - convering them to list of repair_rows using the conversion logic extracted in previous commits from repair_meta, - flushing the rows to an sstable using the logic extracted in previous commits from repair_meta, - comparing the sstable contents with the originally produced mutation fragments. The test checks only the flushing part and is not concerned with any other piece of the repair pipeline.	2022-04-12 09:22:10 +02:00
Mikołaj Sielużycki	39205917a8	repair: Extract mutation_fragment_queue and repair_writer::impl interfaces.	2022-04-12 09:22:03 +02:00
Mikołaj Sielużycki	a52126d861	repair: Make parts of repair_writer interface private.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	826e0e9d8a	repair: Rename inputs to flush_rows.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	4dd32064a3	repair: Make repair_meta::flush_rows a free function.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	046e8c31db	repair: Split flush_rows_in_working_row_buf to two functions and make one static. It allows pulling out the logic of writing internal representation of repair mutations to disk. This in turn is needed to unit test this functionality without spinning up clusters, which significantly improves developer iteration time.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	ca53a7fcc9	repair: Rename inputs to to_repair_rows_list.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	c7a7680c7d	repair: Make to_repair_rows_list a free function.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	69fc74ffbe	repair: Make repair_meta::to_repair_rows_list a static function It allows pulling out the logic of convering on-the-wire representation of repair mutations to an internal representation used later for flushing repair mutations to disk. This in turn is needed to unit test the functionality without spinning up clusters, which significantly improves developer iteration time.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	4ba48e5739	repair: Fix indentation in repair_writer.	2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki	3ff738db6b	repair: Move repair_writer to separate header.	2022-04-12 09:20:03 +02:00
Mikołaj Sielużycki	04986e8c8e	repair: Move repair_row to a separate header.	2022-04-12 08:50:34 +02:00
Mikołaj Sielużycki	7b0cbdeac5	repair: Move repair_sync_boundary to a separate header.	2022-04-12 08:50:34 +02:00
Mikołaj Sielużycki	f9c75952ea	repair: Move decorated_key_with_hash to separate header.	2022-04-12 08:50:34 +02:00
Mikołaj Sielużycki	0fa703de3e	repair: Move row_repair hashing logic to separate class and file.	2022-04-12 08:50:34 +02:00
Calle Wilund	0e2a3e02ae	commitlog: Fold named_file continuations into caller coroutine frame Saves a continuation. That matters very little. But... Uses a special awaiter type on returns from the "then(...)"-wrapping named_file methods (which use a then([...update]) to keep internal size counters up-to-date, making the continuation instead a stored func into the returned awaiter, executed on successul resume of the caller co_await.	2022-04-11 16:34:00 +00:00
Calle Wilund	ed8f0df105	commitlog: Use named named_file objects in delete/dispose/recycle lists Changes delete/close queue, as well as deletetion queue into one, using named_file objects + marker. Recycle list now also contains said named file type. This removes the need to re-eval file sizes on disk when deleting etc, which in turn means we can dispose of recalculate_footprint on errors, thus making things simpler and safer.	2022-04-11 16:34:00 +00:00
Calle Wilund	cdd4066006	commitlog: Use named_file size tracking instead of segment var I.e. "auto-keep-track" of disk footprint	2022-04-11 16:34:00 +00:00
Calle Wilund	320c49e8d3	commitlog: Use named_file in segment Uses named_file instead of file+string in segments. Does not do anything particularly useful with it.	2022-04-11 16:34:00 +00:00
Calle Wilund	97bf7b1fc8	commitlog: Add "named_file" file wrapping type For keeping track of file, name and size, even across close/rename/delete.	2022-04-11 16:34:00 +00:00
Calle Wilund	7dd7760e8d	commitlog: Make flush threshold a config parameter	2022-04-11 16:34:00 +00:00
Calle Wilund	d478896d46	commitlog: kill non-recycled segment management It has been default for a while now. Makes no sense to not do it. Even hints can use it (even if it makes no difference there)	2022-04-11 16:34:00 +00:00
Raphael S. Carvalho	8427ec056c	gms: gossiper: don't duplicate knowledge of minimum time for gossip to settle Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220409022435.58070-2-raphaelsc@scylladb.com>	2022-04-11 19:19:02 +03:00
cvybhu	5c199cad45	cql3: expr: possible_lhs_values: Handle subscript This commit makes subscript an invalid argument to possible_lhs_values. Previously this function simply ignored subscripts and behaved as if it was called on the subscripted column without a subscript. This behaviour is unexpected and potentially dangerous so it would be better to forbid passing subscript to possible_lhs_values entirely. Trying to handle subscript correctly is impossible without refactoring the whole function. The first argument is a column for which we would like to know the possible values. What are possible values of a subscripted column c where c[0] = 1? All lists that have 1 on 0th position? If we wanted to handle this nicely we would have to change the arguments. Such refectoring is best left until the time when this functionality is actually needed, right now it's hard to predict what interface will be needed then. Signed-off-by: cvybhu <jan.ciolek@scylladb.com> Closes #10228	2022-04-11 19:05:09 +03:00
Gleb Natapov	a3e8ae0979	storage_proxy: fix silencing of remote read errors Filtering remote rpc errors based on exception type did not work because the remote errors were reported as std::runtime_error and all rpc exceptions inherit from it. New rpc propagates remote errors using special type rpc::remote_verb_error now, so we can filter on that instead. Fixes #10339 Message-Id: <YlQYV5G6GksDytGp@scylladb.com>	2022-04-11 18:53:25 +03:00
Botond Dénes	08bcbd25e7	Merge 'toolchain: speed up prepare' from Avi Kivity This series speeds up tools/toolchain/prepare in a few ways: - builds images in parallel - allows running on any arch as host - reduces work in building the image - removes unneeded layers Closes #10348 * github.com:scylladb/scylla: tools: toolchain: prepare: sqush intermediate container layers tools: toolchain: update container image first thing tools: toolchain: prepare: build arch images in parallel tools: toolchain: prepare: aloow running on non-x86	2022-04-11 15:47:10 +03:00
Avi Kivity	fda99de15b	Update seastar submodule * seastar 05cdfc2d30...acf7e3523b (3): > http reply: avoid copying content > rpc: deliver remote verb exceptions as rpc::remote_verb_error instead of std::runtime_error > rpc: drop unneeded code	2022-04-11 15:12:43 +03:00
Pavel Emelyanov	828a951886	snitch: Remove create_snitch/stop_snitch After previous patches both, create_snitch() and stop_snitch() no look like the classica sharded service start/stop sequence. Finally both helpers can be removed and the rest of the user can just call start/stop on locally obtained sharded references. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:43:25 +03:00
Pavel Emelyanov	20e623f16d	snitch: Simplify stop (and pause_io) Both first stop/pause snitch driver on io-ing shard, then proceed with the rest. This sequence is pretty pointless and here's why. The only non-trivial stop()/pause_io() method out there is in the property-file snitch driver. In it, both methods check if the current shard is the io-ing one, if no -- return back the resolved future, if yes -- go ahead and stop/pause some IO. With this, for all shards but io-ing one there's no point in starting after io-ing one is stopped, they all can start (and finish) in parallel. So what this patch does is just removes the pre-stop/pause kicking of the io-ing shard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:43:23 +03:00
Pavel Emelyanov	2e42578dc8	snitch: Move io_is_stopped to property-file driver This whole engine is only used by that driver, there's no point in it sitting on the base class Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:43:20 +03:00
Pavel Emelyanov	28ecdc66ad	snitch: Remove init_snitch_obj() Now it's just a wrapper around sharded<snitch_ptr>::start() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:43:16 +03:00
Pavel Emelyanov	b3eaae629e	snitch: Move instance creation into snitch_ptr constructor Current API to create snitch is not like other services -- there's a dedicated helper that does sharded<>.start() + invoke_on_all(&start) calls. These helpers complicate do-globalization of snitch and rework of services start-stop sequence, things get simpler if snitch uses the same start-stop API as all the others. The first step towards this change is moving the non-waiting parts of snitch initialization code from init_snitch_obj() into snitch_ptr constructor. A note on this change: after patch #2 the snitch_ptr<->driver linkage connects local objects with each other, not container() of any. This is important, because connecting container() would be impossible inside constructor, as the container pointer is initialized by seastar _after_ the service constructor itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:38:35 +03:00
Pavel Emelyanov	633746b87d	snitch: Make config-based construction of all drivers Currently snitch drivers register themselves in class-registry with all sorts of construction options possible. All those different constuctors are in fact "config options". When later snitch will declare its dependencies (gossiper and system keyspace), it will require patching all this registrations, which's very inconvenient. This patch introduces the snitch_config struct and replaces all the snitch constructors with the snitch_driver(snitch_config cfg) one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:38:34 +03:00
Pavel Emelyanov	fa59ccb89d	snitch: Declare snitch_ptr peering and rework container() method This patch makes the snitch base class reference local snitch_ptr, not its sharded<> container and, respectively, makes the base container() method return _backreference->container() instead. The motivation of this change is, again, in the next patch, which will move snitch_ptr<->driver_object linkage into snitch_ptr constructor. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:38:32 +03:00
Pavel Emelyanov	552a08ecd0	snitch: Introduce container() method Some snitch drivers want the peering_sharded_service::container() functionality, but they can't directly use it, because the driver class is in fact the pimplification behind the sharded<snitch_ptr> service. To overcome this there's a _my_distributed pointer on the driver base class that points back to sharded<snitch_ptr> object. This patch replaces the direct _my_distributed usage with the container() method that does it and also asserts that the pointer in question is initialized (some drivers already do it, some don't). Other than making the code more peering_sharded_service-like, this patch allows changing _my_distributed into _backreference that points to this shard's snitch_ptr, see next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 14:38:27 +03:00
Botond Dénes	270aba0f51	Merge "Abort database stopping barriers on exception" by Pavel Emelyanov " The database::shutdown() and ::drain() methods are called inside the invoke_on_all()s synchronizing with each other via the cross-shard _stop_barrier. If either shard throws in between all others may get stuck waiting for the barrier to collect all arrivals. To fix it the throwing shard should wake up others, resolving the wait somehow. The fix is actually patch #4, the first and the second are the abort() method for the barrier itself. Fixes: #10304 tests: unit(dev), manual " * 'br-barrier-exception-2' of https://github.com/xemul/scylla: database: Abort barriers on exception database: Coroutinize close_tables test: Add test for cross_shard_barrier::abort() cross-shard-barrier: Add .abort() method	2022-04-11 13:48:43 +03:00
Pavel Emelyanov	f63f1c3d69	database: Abort barriers on exception The database::shutdown() and ::drain() methods are called inside the container().invoke_on_all() and synchronize with each other via the cross-shard _stop_barrier. If either shard throws in between all others may get stuck waiting for the barrier to collect all arrivals. The fix is to abort the barrier on exception thus making all the shards sitting in shutdown or drain to bail out with exceptions too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-11 13:47:02 +03:00
Piotr Sarna	6d937f26ba	Update seastar submodule * seastar 2a2a1305...05cdfc2d (5): > Revert "core: reactor: fix a typo in `smp_pollfn::poll()`" > core: reactor: fix a typo in `smp_pollfn::poll()` > coroutine/exception: make it work with co_await > perftune.py: arfs: allow toggling on/off and allow auto-detection > coroutine: introduce as_future	2022-04-11 12:18:10 +02:00
Nadav Har'El	d9ec5ed46c	test/cql-pytest: add test for blobAsInt() et al for various blob lengths Recently I added a test that verified that blobAsInt() accepts a zero- byte blob and return an "empty" integer. I was asked by one of the reviewers - what happens if we try to pass a three byte blob to blobAsInt()? Here is a new test that demonstrates that the answer is: Besides the 0-byte blob, blobAsInt() only allows a 4-byte blob. Trying 3 or 5 bytes will result in an invalid query error being returned. The test passes on both Cassandra and Scylla, confirming their behavior is the same. The test checks all fixed-sized integer types - int (4 bytes), bigint (8 bytes), smallint (2 bytes) and tinyint (1 byte). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220411093803.651881-1-nyh@scylladb.com>	2022-04-11 12:44:22 +03:00
Raphael S. Carvalho	5cc46b3691	compaction: STCS: kill unused avg_size() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220408184419.100827-3-raphaelsc@scylladb.com>	2022-04-11 11:24:07 +03:00
Raphael S. Carvalho	6ab570d115	compaction: STCS: only proceed to trim bucket if interesting In practice, a bucket that needs trimming will be interesting, but this could be made clearer in the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220408184419.100827-2-raphaelsc@scylladb.com>	2022-04-11 11:24:07 +03:00
Raphael S. Carvalho	4f6003d335	compaction: STCS: simplify most_interesting_bucket() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220408184419.100827-1-raphaelsc@scylladb.com>	2022-04-11 11:24:07 +03:00
Nadav Har'El	84143c2ee5	alternator: implement Select option of Query and Scan This patch implements the previously-unimplemented Select option of the Query and Scan operators. The most interesting use case of this option is Select=COUNT which means we should only count the items, without returning their actual content. But there are actually four different Select settings: COUNT, ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, and ALL_PROJECTED_ATTRIBUTES. Five previously-failing tests now pass, and their xfail mark is removed: * test_query.py::test_query_select * test_scan.py::test_scan_select * test_query_filter.py::test_query_filter_and_select_count * test_filter_expression.py::test_filter_expression_and_select_count * test_gsi.py::test_gsi_query_select_1 These tests cover many different cases of successes and errors, including combination of Select and other options. E.g., combining Select=COUNT with filtering requires us to get the parts of the items needed for the filtering function - even if we don't need to return them to the user at the end. Because we do not yet support GSI/LSI projection (issue #5036), the support for ALL_PROJECTED_ATTRIBUTES is a bit simpler than it will need to be in the future, but we can only finish that after #5036 is done. Fixes #5058. The most intrusive part of this patch is a change from attrs_to_get - a map of top-level attributes that a read needs to fetch - to an optional<attrs_to_get>. This change is needed because we also need to support the case that we want to read no attributes (Select=COUNT), and attrs_to_get.empty() used to mean that we want to read all attributes, not no attributes. After this patch, an unset optional<attrs_to_get> means read all attributes, a set but empty attrs_to_get means read no attributes, and a set and non-empty attrs_to_get means read those specific attributes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405113700.9768-2-nyh@scylladb.com>	2022-04-11 10:04:32 +02:00
Nadav Har'El	9c1ebdceea	alternator: forbid empty AttributesToGet In DynamoDB one can retrieve only a subset of the attributes using the AttributesToGet or ProjectionExpression paramters to read requests. Neither allows an empty list of attributes - if you don't want any attributes, you should use Select=COUNT instead. Currently we correctly refuse an empty ProjectionExpression - and have a test for it: test_projection_expression.py::test_projection_expression_toplevel_syntax However, Alternator is missing the same empty-forbidding logic for AttributesToGet. An empty AttributesToGet is currently allowed, and basically says "retrieve everything", which is sort of unexpected. So this patch adds the missing logic, and the missing test (actually two tests for the same thing - one using GetItem and the other Query). Fixes #10332 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405113700.9768-1-nyh@scylladb.com>	2022-04-11 10:21:02 +03:00
Nadav Har'El	86d01542de	test/alternator: test another example of nested function calls In the existing test we noticed that list_append(if_not_exists(...)) is allowed, but list_append(list_append(...)) is not. I wasn't sure whether if_not_exists(if_not_exists(..)) will be allowed - and this test verifies that it is - it works on both Scylla and DynamoDB, and gives the same results on both. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220407122729.155648-1-nyh@scylladb.com>	2022-04-11 09:56:02 +03:00
Nadav Har'El	3456cbcfcf	test/cql-pytest: split test_null.py into test_null and test_empty We had in test_null.py a mixture of tests for null values and the "null" CQL keyword - and tests for empty values. Null and empty values are not the same thing, and there is no reason to keep the tests for the two things in the same file and further confuse these two distinct concepts. This patch just moves code from test_null.py into a new test_empty.py - there are no functional changes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220407090348.137583-2-nyh@scylladb.com>	2022-04-11 09:54:54 +03:00
Nadav Har'El	cf79d84efa	test/cql-pytest: add regression test for "empty" integer In https://github.com/scylladb/scylla-rust-driver/issues/278 we noted that beyond the concept of a null integer value (which has size -1), there is also an empty integer value (size 0). This patch adds a test that it works as expected. And we see that it does - Scylla stores such a value fine, and the Python driver retrieves it the same as a null (arguably, this is fine - the important point is to see that we don't get a crash or an error). The test passes - I just added it as a regression test for the future. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220407090348.137583-1-nyh@scylladb.com>	2022-04-11 09:54:53 +03:00
Avi Kivity	65720bcfd1	tools: toolchain: prepare: sqush intermediate container layers Without this, the image contains awkward container layers with one file (from the ADD commands). It's not a disaster, just pointless.	2022-04-10 19:00:36 +03:00
Avi Kivity	4bc5f1ba98	tools: toolchain: update container image first thing Otherwise, rpm dependency resolution starts by installing an older version of gcc (to satisfy an older preinstalled libgcc dependency), then updates it. After the change, we install the updated gcc in the first place.	2022-04-10 18:48:07 +03:00
Avi Kivity	69af7a830b	tools: toolchain: prepare: build arch images in parallel To speed up the build, run each arch in parallel, using bash's awkward job control.	2022-04-10 18:45:08 +03:00
Avi Kivity	39ccd744de	tools: toolchain: prepare: aloow running on non-x86 `prepare` builds a multiarch image using qemu emulation. It turns out that aarch64 emulation is slowest (due to emulating pointer authentication) so it makes sense to run it on an aarch64 host. To do that, we need only to adjust the check for qemu installation. Unfortunately, docker arch names and Linux arch names are different, so we have to add an ungainly translation, but otherwise it is a simple loop.	2022-04-10 18:17:00 +03:00
Avi Kivity	59d56a3fd7	Merge 'Add keyspace storage options' from Piotr Sarna This series is part of the shared storage project. The STORAGE option is designed to hold a map of options used for customizing storage for given keyspace. The option is kept in a system_schema.scylla_keyspaces table. This option is guarded with a schema feature, because it's kept in a new schema table: `system_schema.scylla_keyspaces`. Example of the contents of the new table: ```cql cassandra@cqlsh> select * from system_schema.scylla_keyspaces; keyspace_name \| storage_options \| storage_type ---------------+------------------------------------------------+-------------- ksx \| {'bucket': '/tmp/xx', 'endpoint': 'localhost'} \| S3 ``` Native storage options are not kept in the table, as this format doesn't hold any extra options and it would therefore just be a waste of storage. Closes #10144 * github.com:scylladb/scylla: test: regenerate schema_change_test for storage options case test: improve output of schema_change_test regeneration docs: add a paragraph on keyspace storage options test: add test cases for keyspace storage options database,cql3: add STORAGE option to keyspaces db: add keyspace-storage-options experimental feature db,schema_tables: add scylla_keyspaces table db,gms: add SCYLLA_KEYSPACE schema feature db,gms: add KEYSPACE_STORAGE_OPTIONS feature	2022-04-10 17:23:56 +03:00
Avi Kivity	379892142d	Merge 'Coroutinize view_update_builder::build_some' from Benny Halevy Simplify view_update_builder::build_some by turning it into a coroutine, and make view_updates::move_to async (also using a coroutine) so it may yield in-between building the updates, since freezing each mutation can be cpu intensive and preparing many updates synchronously may cause reactor stalls. Test: unit(dev) DTest: materialized_views_test.py(dev) Closes #10344 * github.com:scylladb/scylla: db: view_updates: coroutinize move_to db: view_update_builder: build_some: maybe yield between updates db: view_update_builder: build_some: fixup indentation db: view_update_builder: coroutinize build_some	2022-04-10 16:13:58 +03:00
Raphael S. Carvalho	7b1589cb3d	tests: chunked_managed_vector_test: Test correctness when crossing chunk boundary While reviewing "utils/chunked_managed_vector: Fix corruption in case there is more than one chunk", I was worried that there could be a correctness issue when pop_back() pops off the first element of the last chunk, but turns out I made an off-by-one error in my theory. Anyway, I wrote a unit test to verify my assumption and I found worth submitting it upstream. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220408133555.12397-2-raphaelsc@scylladb.com>	2022-04-08 16:44:16 +02:00
Raphael S. Carvalho	2c11673246	utils/chunked_managed_vector: expose max_chunk_capacity() That's useful for tests which want to verify correctness when the vector is performing operations across the chunk boundary. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220408133555.12397-1-raphaelsc@scylladb.com>	2022-04-08 16:44:00 +02:00
Benny Halevy	6454c8d67f	db: view_updates: coroutinize move_to And allow yielding in-between freezing each update mutation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-08 11:29:25 +03:00
Benny Halevy	0e570d6ffa	db: view_update_builder: build_some: maybe yield between updates `update.move_to` freezes the mutation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-08 11:22:41 +03:00
Benny Halevy	243ba2e976	db: view_update_builder: build_some: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-08 11:21:42 +03:00
Benny Halevy	3e376155ef	db: view_update_builder: coroutinize build_some Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-08 11:20:35 +03:00
Piotr Sarna	151d8f7c58	test: regenerate schema_change_test for storage options case Keyspace storage options series adds a new schema table: system_schema.scylla_keyspaces. The regenerated cases ensure that this new table is taken into account when the schema feature is available.	2022-04-08 09:17:01 +02:00
Piotr Sarna	4705a5fa42	test: improve output of schema_change_test regeneration Schema change test operates on pre-generated sstables, and sometimes this set of sstables needs to be regenerated. In order to make the regeneration process more ergonomic, the output is now directly copyable as valid C++ representation of UUIDs.	2022-04-08 09:17:01 +02:00
Piotr Sarna	20de52d96c	docs: add a paragraph on keyspace storage options A new CQL extension: allowing to specify keyspace storage options, is now described in our design notes.	2022-04-08 09:17:01 +02:00
Piotr Sarna	97c9729487	test: add test cases for keyspace storage options The test cases check if it's possible to set and/or alter storage options for keyspaces with CQL, and whether the changes are reflected in the schema tables.	2022-04-08 09:17:01 +02:00
Piotr Sarna	58529591a9	database,cql3: add STORAGE option to keyspaces The STORAGE option is designed to hold a map of options used for customizing storage for given keyspace. The option is kept in a system_schema.scylla_keyspaces table. The option is only available if the whole cluster is aware of it - guarded by a cluster feature. Example of the table contents: ``` cassandra@cqlsh> select * from system_schema.scylla_keyspaces; keyspace_name \| storage_options \| storage_type ---------------+------------------------------------------------+-------------- ksx \| {'bucket': '/tmp/xx', 'endpoint': 'localhost'} \| S3 ```	2022-04-08 09:17:01 +02:00
Piotr Sarna	3272b4826f	db: add keyspace-storage-options experimental feature Specifying non-standard keyspace options is experimental, so it's going to be protected by a configuration flag.	2022-04-08 09:17:01 +02:00
Piotr Sarna	7f02b188b7	db,schema_tables: add scylla_keyspaces table The table holds scylla-specific information on keyspaces. The first columns include storage_type and storage_options, which will be used later to store storage information.	2022-04-08 09:17:00 +02:00
Piotr Sarna	120980ac8e	db,gms: add SCYLLA_KEYSPACE schema feature This schema feature will be used to guard the upcoming system_schema.scylla_keyspaces schema table.	2022-04-08 09:17:00 +02:00
Piotr Sarna	567c0d0368	db,gms: add KEYSPACE_STORAGE_OPTIONS feature The feature represents the ability to store storage options in keyspace metadata: represented as a map of options, e.g. storage type, bucket, authentication details, etc.	2022-04-08 09:17:00 +02:00
Tomasz Grabiec	41fe01ecff	utils/chunked_managed_vector: Fix corruption in case there is more than one chunk If reserve() allocates more than one chunk, push_back() should not work with the last chunk. This can result in items being pushed to the wrong chunk, breaking internal invariants. Also, pop_back() should not work with the last chunk. This breaks when there is more than one chunk. Currently, the container is only used in the sstable partition index cache. Manifests by crashes in sstable reader which touch sstables which have partition index pages with more than 1638 partition entries. Introduced in `78e5b9fd85` (4.6.0) Fixes #10290 Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>	2022-04-07 21:26:35 +03:00
Benny Halevy	40ad057b6c	database: delete db_apply_executor forward declaration The class is long gone, since version 3.0. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220407094632.2647967-1-bhalevy@scylladb.com>	2022-04-07 17:11:38 +03:00
Pavel Solodovnikov	293c5f39ee	service: raft_group0: make `join_group0` re-entrant Detect if we have already finished joining group0 before and do nothing in that case. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-07 12:36:40 +03:00
Pavel Solodovnikov	057a12e213	service: storage_service: add `join_group0` method Just delegates work to `service::raft_group0::join_group0()` so that it can be used in `main` to activate raft group0 early in some cases (before waiting for gossiper to settle). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-07 12:36:33 +03:00
Pavel Solodovnikov	0d5e2157e1	raft_group_registry: update gossiper state only on shard 0 Since `gossiper::add_local_application_state` is not safe to call concurrently from multiple shards (which will cause a deadlock inside the method), call this only on shard 0 in `_raft_support_listener`. This fixes sporadic hangs when starting a fresh node in an empty cluster where node hangs during startup. Tests: unit(dev), manual Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-07 12:33:40 +03:00
Pavel Solodovnikov	7903d2afa8	raft: don't update gossiper state if raft is enabled early or not enabled at all There is a listener in the `raft_group_registry`, which makes the gossiper to re-publish supported features app state to the cluster. We don't need to do this in case `USES_RAFT_CLUSTER_MANAGEMENT` feature is enabled before the usual time, i.e. before the gossiper settles. So, short-circuit the listener logic in that case and do nothing. Also, don't do anything if raft group registry is not enabled at all, this is just a generic safeguard. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-07 12:31:29 +03:00
Pavel Solodovnikov	ccb59ba6c7	gms: feature_service: add `cluster_uses_raft_mgmt` accessor method Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-04-07 12:30:21 +03:00
Wojciech Mitros	97408078a1	dependencies: add rust The main reason for adding rust dependency to scylla is the wasmtime library, which is written in rust. Although there exist c++ bindings, they don't expose all of its features, so we want to do that ourselves using rust's cxx. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> [avi: update toolchain] [avi: remove example, saving for a follow-on]	2022-04-07 12:26:05 +03:00
Botond Dénes	ad075b27a4	test/lib/mutation_diff: s/colordiff/diff/ Colordiff is problematic when writing the diff into a file for later examination. Use regular diff instead. One can still get syntax highlighting by writing the output into `.diff` file (which most editors will recognize). Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220407080944.324108-1-bdenes@scylladb.com>	2022-04-07 12:07:24 +03:00
Michael Livshin	da7c7fd3dc	delete code of the unused normalizing_reader class Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220406161107.2376568-3-michael.livshin@scylladb.com>	2022-04-07 09:29:41 +03:00
Michael Livshin	d8598d048a	enormous_table_reader: inherit from flat_mutation_reader_v2::impl (completely mechanical change) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220406161107.2376568-2-michael.livshin@scylladb.com>	2022-04-07 09:29:41 +03:00
Michael Livshin	702ad7447a	enormous_table_reader: remove the duplicate _schema field flat_mutation_reader{,_v2}::impl already contains one, which makes for very exciting debugging experience (and no, clang does not mind at all). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220406161107.2376568-1-michael.livshin@scylladb.com>	2022-04-07 09:29:41 +03:00
Pavel Emelyanov	9066224cf4	table: Don't export compaction manager reference There's a public call on replica::table to get back the compaction manager reference. It's not needed, actually. The users of the call are distributed loader which already has database at hand, and a test that creates itw own instance of compaction manager for its testing tables and thus also has it available. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220406171351.3050-1-xemul@scylladb.com>	2022-04-07 09:27:45 +03:00
Pavel Emelyanov	2cab2a32b8	database: Coroutinize close_tables To make next patch a bit simpler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-06 18:43:32 +03:00
Pavel Emelyanov	401c0edea2	test: Add test for cross_shard_barrier::abort() The tests runs a loop of arrivals each of which can randomly throw before arriving. As the result the test expects all shards to resolve into exception in the same phase. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-06 18:21:59 +03:00
Pavel Emelyanov	8d7a7cbe21	cross-shard-barrier: Add .abort() method The method makes all the .arrive_and_wait()s in the current phase to resolve with barrier_aborted_exception() exceptional future. The barrier turns into a broken state and is not supposed to serve any subsequence arrivals anyhow reasonably. The .abort() method is re-entrable in two senses. The first is that more than one shard can abort a barrier, which is pretty natural. The second is that the exception-safety fuses like that imply that if the arrive_and_wait() resolves into exception the caller will try to abort() the barrier as well, even though the phase would be over. This case is also "supported". Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-06 18:21:59 +03:00
Botond Dénes	18be2e9faf	Merge "Remove gossiper->snitch kicking" from Pavel Emelyanov " Gossiper calls snitch->gossiper_starting() when being enabled. This generates a dependency loop -- snitch needs gossiper to gossip its states and get DC/RACK, gossiper needs snitch to do this kick. This set removes this notification. The new approach is to kick the snitch to gossip its states in the same places where gossiper is enabled() so that only the snitch->gossiper dependency remains. As a side effect the set ditches a bunch of references to global snitch instance. tests: unit(dev) " * 'br-snitch-gossiper-starting' of https://github.com/xemul/scylla: snitch: Remove gossiper_starting() snitch: Remove gossip_snitch_info() property-file snitch: Re-gossip states with the help of .get_app_states() property-file snitch: Reload state in .start() ec2 multi-region snitch: Register helper in .start() snitch, storage service: Gossip snitch info once snitch: Introduce get_app_states() method property-file snitch: Use _my_distributed to re-shard storage service: Shuffle snitch name gossiping	2022-04-06 17:41:36 +03:00
Piotr Sarna	2683b54402	Merge 'CQL3: Optional FINALFUNC and INITCOND for UDA' from Michał Jadwiszczak Makes final function and initial condition to be optional while creating UDA. No final function means UDA returns final state and default initial condition is `null`. Both items were optional in cql's grammar but they were treated as required in code. Additionally I've added check if state function returns state. Fixes #10324 Closes #10331 * github.com:scylladb/scylla: CQL3: check sfunc return type in UDA cql-pytest: UDA no final_func/initcond tests cql3: allow no final_func and no initcond in UDA	2022-04-06 16:04:47 +02:00
Michael Livshin	a90e02c302	skeleton_reader: inherit from flat_mutation_reader_v2::impl (completely mechanical change) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220406122912.2248111-1-michael.livshin@scylladb.com>	2022-04-06 16:55:54 +03:00
Michael Livshin	6001a0fef1	multi_partition_reader: inherit from flat_mutation_reader_v2::impl (completely mechanical change) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220406122122.2246058-1-michael.livshin@scylladb.com>	2022-04-06 16:55:07 +03:00
Michał Sala	28970389bc	forward_service: uncoroutinize dispatch method Done to mitigate potential misscompilations.	2022-04-06 15:01:31 +02:00
Michał Sala	edc32a7118	forward_service: uncoroutinize retrying_dispatcher Done to mitigate potential misscompilations.	2022-04-06 14:52:59 +02:00
Michał Sala	59ff51c824	forward_service: rety a failed forwarder call Failed-to-forward sub-queries will be executed locally (on a super-coordinator). This local execution is meant as a fallback for forward_requests that could not be sent to its destined coordinator (e.g. due gossiper not reacting fast enough). Local execution was chosen as the safest one - it does not require sending data to another coordinator.	2022-04-06 14:44:55 +02:00
Benny Halevy	17358ac2a0	cmake: CMakeLists.txt: rename flat_mutation_reader.cc to readers/mutation_readers.cc It was moved in 31d84a254c00b36dc2576e06ee288e28a13238195. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220406110512.3731011-3-bhalevy@scylladb.com>	2022-04-06 14:10:34 +03:00
Benny Halevy	4b3d0643a8	cmake: CMakeLists.txt: remove conncetion_notifier.cc It was removed in `3aa05f7f03`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220406110512.3731011-2-bhalevy@scylladb.com>	2022-04-06 14:10:33 +03:00
Benny Halevy	8d95e12ecd	cmake: CMakeLists.txt: update source paths Those were moved to subdirectories. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220406110512.3731011-1-bhalevy@scylladb.com>	2022-04-06 14:10:32 +03:00
Avi Kivity	82733aeadb	Merge 'Perf: Add extended template version of timed_perf + use in CL perf' from Calle Wilund Adds sub-template for time_parallel with templated result type + optional per-iteration post-process func. Idea is that Res may be a subtype of perf_result, with additional stats, initiated on init, and post-process function can fix up and apply stats -> we can add stats to result. Then uses this mighty construct to add some IO stats to CL perf. Closes #10334 * github.com:scylladb/scylla: perf_commitlog: Add bytes + bytes written stats perf: Add aio_writes mixin for perf_results test/perf/perf.hh: Make templated version of test routine to allow extended stats	2022-04-06 12:52:53 +03:00
Nadav Har'El	0f3cd6ad18	test/cql-pytest: fix fails_without_raft tests on Cassandra We had a Python typo ("false" instead of "False") which prevented tests with the fails_without_raft marker for running on Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405170337.36321-1-nyh@scylladb.com>	2022-04-06 11:20:25 +03:00
Jadw1	b560286ffe	CQL3: check sfunc return type in UDA Thre return type of state function is now checked while creating UDA. Appropriate test added to cql-pytest.	2022-04-06 09:25:17 +02:00
Jadw1	977d6ac8b0	cql-pytest: UDA no final_func/initcond tests Cql-pytests to check if UDA works properly without final function or initial condition.	2022-04-06 09:25:12 +02:00
Jadw1	c921efd1b3	cql3: allow no final_func and no initcond in UDA Makes final function and initial condition to be optional while creating UDA. No final function means UDA returns final state and defeult initial condition is `null`. Fixes: #10324	2022-04-06 09:08:50 +02:00
Kamil Braun	424411ee5f	test: raft: randomized_nemesis_test: enable entry forwarding The test will now, with probability 1/2, enable forwarding of entries by followers to leaders. This is possible thanks to the new abort_source& APIs which we use to ensure that no operations are running on servers before we destroy them.	2022-04-05 19:29:26 +02:00
Nadav Har'El	cfe04e6437	test/cql-pytest: nicer error message if a test can't find nodetool When testing Scylla, cql-pytest does not need an external nodetool command - it uses the REST API instead because it is much faster and there is no need to install anything. However, if cql-pytest is run against Cassandra, the tests do want to use the "nodetool" utility and want to know what it is. The tests use either the NODETOOL environment variable, or if that doesn't exist, look for "nodetool" in the path. If nodetool wasn't found in that way, before this patch, we got an ugly error message with long irrelevant Python backtraces. It wasn't easy to understand that what actually happened was that the user forgot to set the NODETOOL environment variable. This patch cleans up this error handling. Now, if nodetool cannot be found, every test that tries to run nodetool will report just a one- line error message, clearly explaining what went wrong and how to fix it: Error: Can't find nodetool. Please set the NODETOOL environment variable to the path of the nodetool utility. To reiterate, when testing Scylla, nodetool is not needed even after this patch. These errors will not happen even if you don't have the nodetool utility. You only need nodetool if you plan to test Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405171835.43992-1-nyh@scylladb.com>	2022-04-05 20:29:02 +03:00
Kamil Braun	f31c61c7c9	test: raft: randomized_nemesis_test: increase logging level on some rare operations Increase the logging level on the few operations which happen at the end of the test but make debugging a bit easier if the test hangs for some reason.	2022-04-05 19:19:59 +02:00
Kamil Braun	ad3141d3e0	raft: server: translate abort_requested_exception to raft::request_aborted The `wait_for_leader` function would throw a low-level `abort_requested_aborted` exception from seastar::shared_promise. Translate it to the high-level raft::request_aborted so we can reduce the number of different exception types which cross the Raft API boundary. Also, add comments on Raft API functions about the exception thrown when requests are aborted.	2022-04-05 19:18:53 +02:00
Kamil Braun	7da586b912	raft: fsm: when stopping, become follower to reject new requests After enabling add_entry forwarding in randomized_nemesis_test, the test would sometimes hang on _rpc->abort() call due to add_entry messages from followers which waited on log_limiter_semaphore on the leader preventing _rpc from finishing the abort; the log_limter_semaphore would not get unblocked because the part of the server was already stopped. Prevent log_limiter_semaphore from being waited on when stopping the server by becoming a follower in fsm::stop.	2022-04-05 19:11:44 +02:00
Calle Wilund	af28fb6d94	perf_commitlog: Add bytes + bytes written stats Used extended perf_result used with aio_writes + aio_write_bytes to include some IO stats for the benchmark.	2022-04-05 13:43:57 +00:00
Calle Wilund	5b60a6cf7c	perf: Add aio_writes mixin for perf_results Can be used with time_parallel_ex. Adds measurements for aio writes/aio written bytes.	2022-04-05 13:42:36 +00:00
Calle Wilund	12ab34a3d9	test/perf/perf.hh: Make templated version of test routine to allow extended stats Adds sub-template for time_parallel with templated result type + optional per-iteration post-process func. Idea is that Res may be a subtype of perf_result, with additional stats, initiated on init, and post-process function can fix up and apply stats -> we can add stats to result.	2022-04-05 13:30:42 +00:00
Avi Kivity	0d5fd526a5	Merge "tools/scylla-sstable alternative schema load method for system tables" from Botond " Examining sstables of system tables is quite a common task. Having to dump the schemas of such tables into a schema.cql is annoying knowing that these schemas are readily available in scylla, as they are hardcoded. This mini-series adds a method to make use of this fact, by adding a new option: `--system-schema`, which takes the name of a system table and looks up its schema. Tests: unit(dev) " * 'scylla-sstable-system-schema/v1' of https://github.com/denesb/scylla: tools/scylla-sstable: add alternative schema load method for system tables tools/schema_loader: add load_system_schema() db/system_distributed_keyspace: add all tables methods tools/scylla-sstable: reorganize main help text	2022-04-05 15:48:29 +03:00
Avi Kivity	6cfc1d6f6a	Update seastar submodule * seastar 798ec50701...2a2a13058e (2): > condition_variable: Add "has_waiters()" accessor + test > Merge "RPC tester" from Pavel E	2022-04-05 13:47:51 +03:00
Gleb Natapov	7bf557332f	storage_service: remove maybe from maybe_start_sys_dist_ks There is nothing "maybe" about it now. Message-Id: <Ykv/bj8MvKh0UU23@scylladb.com>	2022-04-05 12:49:56 +03:00
Benny Halevy	abbf5de68c	frozen_mutation: introduce consume method Allowing to consume the frozen_mutation directly to a stream rather than unfreezing it first and then consuming the unfrozen mutation. Streaming directly from the frozen_mutation saves both cpu and memory, and will make it easier to be made async as a follow, to allow yielding, e.g. between rows. This is used today only in to_data_query_result which is invoked on the read-repair path. Refs #10038 Fixes #10021 Test: unit(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220405055807.1834494-1-bhalevy@scylladb.com>	2022-04-05 10:51:21 +03:00
Nadav Har'El	67e0590bbc	alternator: remove old TODO (with test verifying it) We had an old TODO in the Alternator "Scan" operation code which suggested that we may need to do something to limit the size of pages when a row limit ("Limit") isn't given. But we do already have a built-in limit on page sizes (1 MB), so this TODO isn't needed and can be removed. But I also wanted to make sure we have a test that this limit works: We already had a test that this 1 MB limit works for a single-partition Query (test_query.py::test_query_reverse_longish - tested both forward and reversed queries). In this patch I add a similar test for a whole- table Scan. It turns out that although page size is limited in this case as well, it's not exactly 1 MB... For small tables can even reach 3 MB. I consider this "good enough" and that we can drop the TODO, but also opened issue #10327 to document this surprising (for me) finding. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220404145240.354198-1-nyh@scylladb.com>	2022-04-05 09:23:23 +03:00
Nadav Har'El	56936d3c16	test/alternator: add reproducers for scan of long string of tombstones This patch adds two xfailing tests for issue #7933. That issue is about what Scan or Query paging does when encountering a very long string of consecutive tombstones (partition or row tombstones). Ideally, in that case the scan could stop on one of these tombstones after already processing too many. But as these two tests demonstrate, the scan can't stop in the middle of a long string of tombstones - and as a result retrieving a single page can take an unbounded amount of time, which is wrong. Currently the tests are marked `@veryslow` (they each take more than a minute) because they each create a huge number of tombstones to demonstrate a huge amount of work for a single page. When we fix issue #7933 and have a much smaller limit on the number of tombstones processed in a single page, we can hopefully make these tests much shorter and remove the `@veryslow` tag. The `@veryslow` tags means that although these tests can be used manually (with `--runveryslow`) they will not yet be run as part of the usual regression tests. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220403070706.250147-1-nyh@scylladb.com>	2022-04-05 09:11:38 +03:00
Raphael S. Carvalho	840500fc4d	compaction: Make cleanup for Leveled strategy bucket-aware Bucket awareness in cleanup was introduced in `a69d98c3d0`. STCS and TWCS already support it, and now LCS will receive it. The goal of bucket awareness is to reduce writeamp in cleanup, therefore reducing operation time. Additionally, garbage collection becomes more efficient as shadowed data can now be potentially compacted with the data that shadows it, assuming they're on the same level. The implementation for LCS is simple. Will reuse the procedure for STCS for returning jobs in level 0. And one job will be returned for each non-empty level > 0. What allows us to do it is our incremental selection approach used in compaction, that sets a limit on memory usage and disk space requirement. Fixes #10097. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220331173417.211257-1-raphaelsc@scylladb.com>	2022-04-05 09:10:21 +03:00
Benny Halevy	2d80057617	range_tombstone_list: insert_from: correct rev.update range_tombstone in not overlapping case 2nd std::move(start) looks like a typo in `fe2fa3f20d`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220404124741.1775076-1-bhalevy@scylladb.com>	2022-04-04 22:26:29 +02:00
Tomasz Grabiec	0a3aba36e6	Merge 'range_tombstone_change_generator: flush: emit closing range_tombstone_change' from Benny Halevy When the highest tombstone is open ended, we must emit a closing range_tombstone_change at position_in_partition::after_all_clustered_rows(). Since all consumers need to do it, implement the logic in the range_tombstone_change_generator itself. It turned out that mutation::consume doesn't do that, hence this series, and 5a09e5234ef4e1ee673bc7fca481defbbb2c0384 in particular, fix the issue. Change 028b2a8cdfdc12721b2be23d175cbc756d2507de exposes the issue by generating a richer set of random range_tombstone that include open-ended range tombstones. Fixes #10316 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10317 * github.com:scylladb/scylla: test: random_mutation_generator: make more interesting range tombstones reader: upgrading_consumer: let range_tombstone_change_generator emit last closing change range_tombstone_change_generator: flush: emit end_position when upper limit is after all clustered rows range_tombstone_change_generator: flush: use tri_compare rather than less range_tombstone_change_generator: flush: return early if empty	2022-04-04 19:07:45 +02:00
Michał Sala	e170961b4d	forward_service: copy arguments/captured vars to local variables Copying captured variables into local variables (that live in a coroutine's frame) is a mitigation of suspected lifetime issues. Arguments of forward_service::dispatch are also copied (to prevent potential undefined behavior or miss-compilation triggered by referencing the arguments in a capture list of a lambda that produces a coroutine).	2022-04-04 16:58:08 +02:00
Benny Halevy	b3e2bbe5bd	test: random_mutation_generator: make more interesting range tombstones Include also singular prefix and semi-bounded range tombstones. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-04 17:34:49 +03:00
Piotr Grabowski	63fa5ac915	generic_server.hh: add missing include Add missing include of "<list>" which caused compile errors on GCC: In file included from generic_server.cc:9: generic_server.hh:91:10: error: ‘list’ in namespace ‘std’ does not name a template type 91 \| std::list<gentle_iterator> _gentle_iterators; \| ^~~~ generic_server.hh:19:1: note: ‘std::list’ is defined in header ‘<list>’; did you forget to ‘#include <list>’? 18 \| #include <seastar/net/tls.hh> +++ \|+#include <list> 19 \| Note that there are some GCC compilation problems still left apart from this one. Closes #10328	2022-04-04 17:31:55 +03:00
Lukasz Sojka	5727f196e3	Add big batch logs tests Tests for warning and error lines in logfile when user executes big batch (above preconfigured thresholds in scylla.yaml). Signed-off-by: Lukasz Sojka <lukasz.sojka@scylladb.com> Closes #10232	2022-04-04 17:25:13 +03:00
Takuya ASADA	f95a531407	docker: run scylla as root Previous versions of Docker image runs scylla as root, but `cb19048` accidently modified it to scylla user. To keep compatibility we need to revert this to root. Fixes #10261 Closes #10325	2022-04-04 17:25:13 +03:00
Pavel Emelyanov	9fdb49c86a	Merge 'fix hang on shutdown while ddl query is running and there is no quorum' from Gleb A node that runs DDL query while its cluster does not have a quorum cannot be shutdown since the query is not abortable. The series makes it abortable and also fixes the order in which components are shutdown to avoid the deadlock. * gleb/raft_shutdown_v4 of git@github.com:scylladb/scylla-dev.git: migration_manager: drain migration manager before stopping protocol servers on shutdown migration_manager: pass abort source to raft primitives storage_proxy: relax some read error reporting	2022-04-04 17:25:13 +03:00
Benny Halevy	002be743f6	reader: upgrading_consumer: let range_tombstone_change_generator emit last closing change When flushing range tombstones up to position_in_partition::after_all_clustered_rows(), the range_tombstone_change_generator now emits the closing range_tombstone_change, so there's no need for the upgrading_consumer to do so too. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-04 17:00:53 +03:00
Benny Halevy	cd171f309c	range_tombstone_change_generator: flush: emit end_position when upper limit is after all clustered rows When the highest tombstone is open ended, we must emit a closing range_tombstone_change at position_in_partition::after_all_clustered_rows(). Since all consumers need to do it, implement the logic int the range_tombstone_change_generator itself. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-04 17:00:53 +03:00
Benny Halevy	2c5a6b3894	range_tombstone_change_generator: flush: use tri_compare rather than less less is already using tri_compare internally, and we'll use tri_compare for equality in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-04 17:00:53 +03:00
Benny Halevy	18a80a98b8	range_tombstone_change_generator: flush: return early if empty Optimize the common, empty case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-04-04 17:00:53 +03:00
Takuya ASADA	41edc045d9	docker: revert scylla-server.conf service name change We changed supervisor service name at `cb19048`, but this breaks compatibility with scylla-operator. To fix the issue we need to revert the service name to previous one. Fixes #10269 Closes #10323	2022-04-03 19:18:18 +03:00
Avi Kivity	538fabc05d	Update tools/java submodule * tools/java ac5d6c840d...9bc83b7a32 (1): > fix metadata printing for ME sstables	2022-04-03 18:33:04 +03:00
Alexey Kartashov	d86c3a8061	dist/docker: fix incorrect locale value Docker build script contains an incorrect locale specification for LC_ALL setting, this commit fixes that. Fixes #10310 Closes #10321	2022-04-03 14:24:54 +03:00
David Garcia	934beb6e20	docs: update theme 1.2.1 Related issue scylladb/sphinx-scylladb-theme#395 ScyllaDB Sphinx Theme 1.2 is now released partying_face We’ve added automatic checks for broken links and introduced numerous UI updates. You can read more about all notable changes here. Closes #10313	2022-04-03 13:45:07 +03:00
Nadav Har'El	0a67c87438	Update seastar submodule * seastar 44389842...798ec507 (4): > CONTRIBUTING: update to note that pull requests are accepted (#1036) > semaphore: improve documentation of timeout and abort errors > condition_variable: fix cv.signal with active "when" wait would switch fiber > abortable_fifo: stop dereferencing null pointers Fixes #10319 with "abortable_fifo: stop dereferencing null pointers".	2022-04-03 13:41:41 +03:00
Benny Halevy	5ca73019dd	shard_reader_v2: do_fill_buffer: reserve buffer space ahead To prevent unneeded reallocations, just reserve the pre-known number of entries before pushing them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220402130847.625085-2-bhalevy@scylladb.com>	2022-04-03 11:28:32 +03:00
Benny Halevy	8ab57aa4ab	shard_reader_v2: do_fill_buffer: maybe yield when copying result Prevent a reactor stall with e.g. large number of range tombstones. Fixes #10314 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220402130847.625085-1-bhalevy@scylladb.com>	2022-04-03 11:05:14 +03:00
Tomasz Grabiec	9c96a37143	Merge "raft: nemesis test: use abort_source for time-outs" from Kamil When a Raft API call such as `add_entry`, `set_configuration` or `modify_config` takes too long, we need to time-out. There was no way to abort these calls previously so we would do that by discarding the futures. Recently the APIs were extended with `abort_source` parameters. Use this. Also improve debuggability if the functions throw an exception type that we don't expect. Previously if they did, a cryptic assert would fail somewhere deep in the generator code, making the problem hard to debug. Also collect some statistics in the test about the number of successful and failed ops. I used it to manually check whether there was a difference in how often operations fail with using the out timeout method and the new timeout method (there doesn't seem to be any). * kbr/nemesis-abort-source: test: raft: randomized_nemesis_test: on timeout, abort calls instead of discarding them raft: server: translate semaphore_aborted to request_aborted test: raft: logical_timer: add abortable version of `sleep_until` test: raft: randomized_nemesis_test: collect statistics on successful and failed ops	2022-04-01 16:25:23 +02:00
Pavel Emelyanov	886a275192	Merge 'replica/table: remove v1 reader factory methods' from Botond Only users are internal and tests. Tests: unit(dev) * replica-table-remove-make-reader-v1/v2 of github.com/denesb/scylla.git replica/table: remove v1 reader factory methods tests: move away from table::make_reader() replica/table: add short make_reader_v2() variant:	2022-04-01 13:57:10 +03:00
Botond Dénes	9338affb8e	replica/table: remove v1 reader factory methods	2022-04-01 13:52:08 +03:00
Botond Dénes	c8ea0715e9	tests: move away from table::make_reader() Use v2 equivalents instead.	2022-04-01 13:39:26 +03:00
Botond Dénes	5aa97ccf0d	replica/table: add short make_reader_v2() variant:	2022-04-01 13:39:26 +03:00
Pavel Emelyanov	05a32328fc	snitch: Remove gossiper_starting() No longer used Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:09 +03:00
Pavel Emelyanov	41332e183a	snitch: Remove gossip_snitch_info() No longer in use Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:09 +03:00
Pavel Emelyanov	38b0ee9822	property-file snitch: Re-gossip states with the help of .get_app_states() This is the last place that still uses gossip_snitch_info(). It can be reworked to use the get_app_states(), then the former helper can be removed. Another motivation for this is to stop using the _gossiper_started boolean from the base class. This, in turn, will allow to remove the whole gossiper_starting() notification altogether. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:09 +03:00
Pavel Emelyanov	6f71baa472	property-file snitch: Reload state in .start() In its .start() helper the property-file driver does everything but registers the reconnectable helper (like the ec2 m.r. one from the previous patch did). Similarly to ec2 m.r. snitch this one can also register its helper in .start(), before gossiper_starting() is called. One thing to care about in this driver is that some tests start this snitch without starting gossiper, thus an extra protection against not initialized gossiper is needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:09 +03:00
Pavel Emelyanov	2400c87e74	ec2 multi-region snitch: Register helper in .start() This driver registers reconnectable helper in it gossiper_starting() callback. It can be done earlier -- in the snitch .start() one, as gossiper doesn't notify listeners until its started for real (event its shardow round doesn't kick them). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:05 +03:00
Pavel Emelyanov	f9af6fb430	snitch, storage service: Gossip snitch info once Nowadays snitch states are put into gossiper via .gossiper_starting() call by gossiper. This, in turn, happens in two places -- on node ring join code and on re-enabling gossiper via the API call. The former can be performed by the ring joining code with the help of recently introduced snitch.get_app_states() helper. The latter call is in fact not needed. Re-gossiped are DC, RACK and for some drivers the INTERNAL_IP states that don't change throughout snitch lifetime and are preserved in the gossiper pre-loaded states. Thus, once the snitch states are applied by storage service ring join code, the respective states udpate can be removed from the snitch gossiper_starting() implementations. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:05 +03:00
Pavel Emelyanov	4853959903	snitch: Introduce get_app_states() method This virtual method returns back the list of app states that snitch drivers need to gossip around. The exact implementation copies the gossip_snitch_info() logic of the respective drivers and is unused. Next patches will make use of it (spoiler: the latter method will be removed after that). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:05 +03:00
Pavel Emelyanov	028bb84b0f	property-file snitch: Use _my_distributed to re-shard The driver in question wants to execute some of its actions on shard 0 and it calls smp::invoke(0, ...) for this. The invoked lambda thus needs to refer to global snitch instance. There's nicer and shorter way of re-sharding for snith drivers -- the sharded<snith_ptr>* _my_distributed field on the base class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:05 +03:00
Pavel Emelyanov	b8e876681d	storage service: Shuffle snitch name gossiping No functional changes, just have the local snitch reference in the ring joining code. This simplifies next patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-04-01 13:16:05 +03:00
Botond Dénes	a325d3434a	Merge "make_slicing_filtering_reader(): return flat mutation reader v2" from Michael Livshin " Tests: unit(dev) " * 'slicing-filtering-v2' of https://github.com/cmm/scylla: make_slicing_filtering_reader(): return flat mutation reader v2 mutation_readers: refactor generic partition slicing logic	2022-04-01 11:08:25 +03:00
Nadav Har'El	758f8f01d7	test/alternator: turn REST API finding into a fixture In test_tracing.py and util.py, we already have three duplicates of code which looks for the Scylla REST API. We'll soon want to add even more uses of this REST API, so it's good time to add a single fixture, "rest_api", which can be use in all tests that need the Scylla REST API instead of duplicating the same code. A test using the "rest_api" fixture will be skipped if the server isn't Scylla, or its port 10000 is not available or not responsive. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220331195337.64352-1-nyh@scylladb.com>	2022-04-01 10:51:59 +03:00
Raphael S. Carvalho	61c67105d2	compaction_manager: move internal stop functions into private namespace They don't belong to public interface. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220331202255.237688-1-raphaelsc@scylladb.com>	2022-04-01 10:50:27 +03:00
Botond Dénes	e19cf6760a	tools/scylla-sstable: add alternative schema load method for system tables By providing its name via the new `--system-schema` option. The schema will be loaded from the internal hardcoded definition.	2022-04-01 10:12:33 +03:00
Botond Dénes	095bb0d992	tools/schema_loader: add load_system_schema() Allowing to load (or rather lookup) system schemas by name.	2022-04-01 10:10:31 +03:00
Botond Dénes	53b00ecefe	db/system_distributed_keyspace: add all tables methods Add methods to get the schema of all distributed and distribyted everywhere tables respectively.	2022-04-01 10:10:31 +03:00
Botond Dénes	be788140ff	tools/scylla-sstable: reorganize main help text Currently the main help is a big wall of text. This makes it hard to quickly jump to the section of interest. This patch reorganizes it into clear sections, each with a title. Sections are now also ordered according to the part they reference in the command-line. This should make it easier for answers to questions regarding a certain topic to be quickly found, without having to read a lot of text.	2022-04-01 10:10:31 +03:00
Michael Livshin	830aa041a8	make_slicing_filtering_reader(): return flat mutation reader v2 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-03-31 19:59:53 +03:00
Michael Livshin	aac51be0cc	mutation_readers: refactor generic partition slicing logic There are at least 1 actual and 1 potential users for it; this change converts the existing one. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-03-31 19:59:53 +03:00
Avi Kivity	af07519928	Merge "Remove reader from mutations v1" from Botond " First migrate all users to the v2 variant, all of which are tests. However, to be able to properly migrate all tests off it, a v2 variant of the restricted reader is also needed. All restricted reader users are then migrated to the freshly introduced v2 variant and the v1 variant is removed. Users include: * replica::table::make_reader_v2() * streaming_virtual_table::as_mutation_source() * sstables::make_reader() * tests This allows us to get rid of a bunch of conversions on the query path, which was mostly v2 already. With a few tests we did kick the can down the road by wrapping the v2 reader in `downgrade_to_v1()`, but this series is long enough already. Tests: unit(dev), unit(boost/flat_mutation_reader_test:debug) " * 'remove-reader-from-mutations-v1/v3' of https://github.com/denesb/scylla: readers: remove now unused v1 reader from mutations test: move away from v1 reader from mutations test/boost/mutation_reader_test: use fragment_scatterer test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh test/boost: mutation_fragment_test: refactor fragment_scatterer readers: remove now unused v1 reversing reader test/boost/flat_mutation_reader_test: convert to v2 frozen_mutation: fragment_and_freeze(): convert to v2 frozen_mutation: coroutinize fragment_and_freeze() readers: migrate away from v1 reversing reader db/virtual_table: use v2 variant of reversing and forwardable readers replica/table: use v2 variant of reversing reader sstables/sstable: remove unused make_crawling_reader_v1() sstables/sstable: remove make_reader_v1() readers: add v2 variant of reversing reader readers/reversing: remove FIXME readers: reader from mutations: use mutation's own schema when slicing	2022-03-31 13:29:11 +03:00
Avi Kivity	5fc093ad42	Merge 'wasm: manage memory using exports from the client' from Wojciech Mitros This patch adds importing the `malloc` and `free` method from the wasm client, and using them for allocating wasm memory for UDF arguments and freeing its result. When the methods are not exported, the old behaviour is used instead. To make that possible, this patch also includes a fix to the usage of pages in wasm memory (methods `size` and `grow`) that were used for allocating memory for arguments until now. (The source codes for the examples didn't work on my machine in their original form, so when updating paging I've also added small unrelated modifications) Tests:unit(dev) Closes #10234 * github.com:scylladb/scylla: wasm: add wasm ABI version 2 wasm: add WASI handling wasm: add documentation wasm: add _scylla_abi export for specifying abi for wasm udfs wasm: update ABI for passing parameters to wasm UDFs wasm: move common code to a separate function wasm: use wasm pages for wasm memory	2022-03-31 12:33:55 +03:00
Wojciech Mitros	56c5459c50	wasm: add null handling for wasm udf As the name suggests, for UDFs defined as RETURNS NULL ON NULL INPUT, we sometimes want to return nulls. However, currently we do not return nulls. Instead, we fail on the null check in init_arg_visitor. Fix by adding null handling before passing arguments, same as in lua. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #10298	2022-03-31 12:27:38 +03:00
Piotr Sarna	c0fd53a9d7	cql3: fix qualifying restrictions with IN for indexing When a query contains IN restriction on its partition key, it's currently not eligible for indexing. It was however erroneously qualified as such, which lead to fetching incorrect results. This commit fixes the issue by not allowing such queries to undergo indexing, and comes with a regression test. Fixes #10300 Closes #10302	2022-03-31 11:04:17 +03:00
Nadav Har'El	3eaafbbdf7	test/cql-pytest: mark a test failing on Cassandra with cassandra_bug We have a test for the LIKE restriction with ALLOW FILTERING. Cassandra does not yet support this combination (it only supports LIKE with SASI indexes), so this test fails on Cassandra, suggesting either the test is wrong, or Cassandra is wrong. In this case, Cassandra is wrong - they have an issue requesting this to be fixed - https://issues.apache.org/jira/browse/CASSANDRA-17198, and even an implementation which is being reviewed. So let's mark this test with "cassandra_bug", meaning it is expected to fail (xfail) when running against Cassandra. When CASSANDRA-17198 is fixed, we can remove the cassandra_bug mark. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220330211734.4103691-1-nyh@scylladb.com>	2022-03-31 09:47:44 +02:00
Botond Dénes	7d49afe78b	readers: remove now unused v1 reader from mutations	2022-03-31 10:36:26 +03:00
Botond Dénes	fd69add579	test: move away from v1 reader from mutations Use the v2 variant instead.	2022-03-31 10:36:23 +03:00
Botond Dénes	2e00ff314d	test/boost/mutation_reader_test: use fragment_scatterer Instead of the open-coded equivalent the test currently has.	2022-03-31 10:25:45 +03:00
Botond Dénes	feecc19d5b	test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh We want to use it in test/boost/mutation_reader_test.cc too.	2022-03-31 10:25:45 +03:00
Botond Dénes	226f01162e	test/boost: mutation_fragment_test: refactor fragment_scatterer Instead of taking an output parameter in the constructor, take just the desired number of mutations to build and return the mutation list from `consume_end_of_stream()`.	2022-03-31 10:25:45 +03:00
Botond Dénes	b8f0ab3b98	readers: remove now unused v1 reversing reader	2022-03-31 10:04:45 +03:00
Botond Dénes	56e3c6add6	test/boost/flat_mutation_reader_test: convert to v2	2022-03-31 10:04:29 +03:00
Gleb Natapov	c17a03727c	migration_manager: drain migration manager before stopping protocol servers on shutdown When protocol servers are stopping they wait for all active queries to complete, but DDL queries use migration manager internally, so if they hang there protocol servers will not be able to stop since migration manager is drained afterwords. The patch moves the migration manager draining before protocol servers stoppage. Since after the patch migration managers is drained before messaging service is stopped we need to make sure that no rpc request triggers new migration manager requests. We do it by making sure that any attempt to issue such a request after aborted will return abort_requested_exception.	2022-03-31 10:00:29 +03:00
Gleb Natapov	e52abdca30	migration_manager: pass abort source to raft primitives We want to be able to abort raft operations on migration manager drain. MM already has an abort source that is signaled on drain, so all that is left is to pass it to raft calls.	2022-03-31 10:00:29 +03:00
Gleb Natapov	1409b885a0	storage_proxy: relax some read error reporting Silence request_aborted read error since it is expected to happen suring shutdown and report remote rpc errors as warnings instead of errors since if they are indeed server they should be handled by the rpc client, but OTOH some non critical errors do expect to happen during shutdown.	2022-03-31 10:00:29 +03:00
Botond Dénes	2e634883d9	frozen_mutation: fragment_and_freeze(): convert to v2	2022-03-31 09:57:48 +03:00
Botond Dénes	7f3986ed1a	frozen_mutation: coroutinize fragment_and_freeze()	2022-03-31 09:57:48 +03:00
Botond Dénes	fc27b6b7ed	readers: migrate away from v1 reversing reader The only internal user is the v1 make reader from mutations, we use a downgrade/upgrade to be able to use the v2 reversing reader there. This is ugly but the v1 reader from mutations is going away soon too, so not a real problem.	2022-03-31 09:57:48 +03:00
Botond Dénes	eb125b98eb	db/virtual_table: use v2 variant of reversing and forwardable readers	2022-03-31 09:57:48 +03:00
Botond Dénes	c10d7bf9f8	replica/table: use v2 variant of reversing reader	2022-03-31 09:57:48 +03:00
Botond Dénes	3b67c25e49	sstables/sstable: remove unused make_crawling_reader_v1()	2022-03-31 09:57:48 +03:00
Botond Dénes	219cb881a4	sstables/sstable: remove make_reader_v1() No external users, only used internally, by make_reader(), who delegates cases currently unsupported by v2 to it. The code needed from make_reader_v1() is inlined into make_reader() and the former is removed.	2022-03-31 09:57:48 +03:00
Botond Dénes	470dc0d013	readers: add v2 variant of reversing reader The v2 format allows for a much simpler reversing mechanism since clustering fragments can simply be reversed as they are read. Fragments are directly pushed in the reader's buffer eliminating a separate move phase. Existing reverse reader unit tests are converted to test the v2 one.	2022-03-31 09:57:48 +03:00
Botond Dénes	06bea6ae6b	readers/reversing: remove FIXME	2022-03-31 09:57:48 +03:00
Botond Dénes	c38a7963b1	readers: reader from mutations: use mutation's own schema when slicing Instead of the schema that is used for the reader. The schema of individual mutations might be different (albeit compatible) and in debug mode this can trigger an assert in mutation partition.	2022-03-31 09:57:48 +03:00
Piotr Sarna	4a3890b79c	cql3: disambiguate an error message for indexes and IN clause A user pointed out a misleading error message produced when an indexed column is queried along with an IN relation on the partition key. The message suggests that such queries are not supported, but they are supported - just without indexing. In particular, with ALLOW FILTERING, such queries are perfectly fine. Closes #10299	2022-03-31 07:04:00 +03:00
Piotr Sarna	85e95a8cc3	cql3: fix misleading error message for service level timeouts The error message incorrectly stated that the timeout value cannot be longer than 24h, but it can - the actual restriction is that the value cannot be expressed in units like days or months, which was done in order to significantly simplify the parsing routines (and the fact that timeouts counted in days are not expected to be common). Fixes #10286 Closes #10294	2022-03-31 07:04:00 +03:00
Raphael S. Carvalho	20a1ef3bee	compaction_backlog_tracker: Raise logging level to error when disabling tracker on exception If exception is caught while updating backlog tracker, the backlog tracker will be disabled for the underlying table, potentially causing compaction to fall behind. That being said, let's raise the log level to error, to give it its due importance and allow tests to detect the problem. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220330151421.49054-1-raphaelsc@scylladb.com>	2022-03-31 07:04:00 +03:00
Wojciech Mitros	8a9d55d3a1	wasm: add wasm ABI version 2 Because the only available version of wasm ABI did not allow freeing any allocated memory, a new version of the ABI is introduced. In this version, the host is required to export _scylla_malloc and _scylla_free methods, which are later used for the memory management. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-03-30 20:49:35 +02:00
Wojciech Mitros	5b60cd1eab	wasm: add WASI handling One of the issues that comes with compiling programs to WebAssembly is the lack of a default implementation of a memory allocator. As a result, the only available solutions to the need of memory allocation are growing the wasm memory for each new allocated segment, or implementing one's own memory allocator. To avoid both of these approaches, for many languages, the user may compile a program to a WASI target. By doing so, the compiler adds default implementations of malloc and free methods, and the user can use them for dynamic memory management. This patch enables executing programs compiled with WASI by enabling it in the wasmtime runtime. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-03-30 19:44:36 +02:00
Wojciech Mitros	1f81e05d52	wasm: add documentation The ABI of wasm UDFs changed since the last time the documentation was written, so it's being update in this patch. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-03-30 19:44:30 +02:00
Pavel Solodovnikov	3e2a42fcf0	db: system_keyspace: add `bootstrap_needed()` method The method checks that bootstrap state is equal to `NEEDS_BOOTSTRAP`. This will be used later to check if we are in the state of "fresh" start (i.e. starting a node from scratch). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-30 20:41:35 +03:00
Pavel Solodovnikov	c69c47b2bb	db: system_keyspace: mark getter methods for bootstrap state as "const" The `bootstrap_complete()`, `bootstrap_in_progress()`, `was_decommissioned()` and `get_bootstrap_state()` don't modify internal state, so eligible to be marked as `const`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-30 20:40:02 +03:00
Wojciech Mitros	93b082e8d9	wasm: add _scylla_abi export for specifying abi for wasm udfs Different languages may require different ABIs for passing parameters, etc. This patch adds a requirement for all wasm UDFs to export an _scylla_abi symbol, that is an 32-bit integer with a value specifying the ABI version. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-03-30 19:37:11 +02:00
Wojciech Mitros	a7ee3ccf52	wasm: update ABI for passing parameters to wasm UDFs WebAssembly uses 32-bit address space, while also having 64-bit integers as it native types. As a result, when passing size of an object in memory and its address, it can be combined into one 64-bit value. As a bonus, if the object is null, we can signal it by passing -1 as its size. This patch implements handling of this new ABI and adjusts expamples in test_wasm.py. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-03-30 17:13:25 +02:00
Wojciech Mitros	7fd81e6dae	wasm: move common code to a separate function Both init_nullable_arg_visitor and, in case of abstract_type, init_arg_visitor were the same method with one difference. The common part was moved to init_abstract_arg, and the difference remained in the operator() method. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-03-30 17:13:22 +02:00
Wojciech Mitros	62761a7cf3	wasm: use wasm pages for wasm memory The memory.grow and memory.size wasm methods return the memory size in pages, and memory.size takes its argument in the number of pages. A WebAssembly page has a size of 64KiB, so during memory allocation we have to divide our desired size in bytes by page size and round up. Similarly, when reading memory size we need to multiply the result by 64KiB to get the size in bytes. The change affects current naive allocator for arguments when calling wasm UDFs and the examples in wasm_test.py - both commented code and compiled wasm in text representation. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-03-30 17:13:13 +02:00
Avi Kivity	caa4cddebf	Update tools/java submodule * tools/java b1e09c8b8f...ac5d6c840d (1): > Merge 'Sync with Cassandra 3.11.12' from Michael Livshin	2022-03-30 17:03:55 +03:00
Avi Kivity	9cb43f7029	Merge "Split mutation_reader.hh" from Botond " Following up on the recent split of flat_mutation_reader.hh and friends, this series applies the same treatment to mutation_reader.hh. Each readers gets its own header, while definitions are moved into readers/mutation_readers.cc. There are two exceptions to this: the combined and multishard reader families each make up more than 1K SLOC, so these get their own source file, to avoid a SLOC explosion in mutation_readers.cc. This series is almost completely mechanical, moving code and patching inclusion sites. Tests: unit(dev) " * 'mutation-reader-hh-split/v1' of https://github.com/denesb/scylla: readers: merge fmr_logger and mrlog tree: remove now empty mutation_reader.{hh,cc} tree: remove mutation_reader.hh include mutation_reader: move mrlog (mutation reader logger) to readers/ mutation_reader: move compacting reader into readers/ mutation_reader: move queue reader to readers/ mutation_reader: move mutation source into readers/ mutation_reader: move slicing filtering reader into readers/ mutation_reader: move filtering reader into readers/ readers: move multishard reader & friends to reader/multishard.cc mutation_reader: remove unused remote_fill_buffer_result readers: move combined reader into readers/	2022-03-30 16:33:20 +03:00
Botond Dénes	3289df2e74	readers: merge fmr_logger and mrlog By folding the former to the latter. Now that all the readers are nicely co-located in the same folder, no point in having two distinct logger for them.	2022-03-30 15:44:08 +03:00
Botond Dénes	c9e30b9a6c	tree: remove now empty mutation_reader.{hh,cc}	2022-03-30 15:42:51 +03:00
Botond Dénes	b029bd3db7	tree: remove mutation_reader.hh include In most files it was unused. We should move these to the patch which moved out the last interesting reader from mutation_reader.hh (and added the corresponding new header include) but its probably not worth the effort. Some other files still relied on mutation_reader.hh to provide reader concurrency semaphore and some other misc reader related definitions.	2022-03-30 15:42:51 +03:00
Botond Dénes	20c9e556e1	mutation_reader: move mrlog (mutation reader logger) to readers/	2022-03-30 15:42:51 +03:00
Botond Dénes	b7954138ac	mutation_reader: move compacting reader into readers/	2022-03-30 15:42:51 +03:00
Botond Dénes	11c378a175	mutation_reader: move queue reader to readers/	2022-03-30 15:42:51 +03:00
Botond Dénes	11109f4c45	mutation_reader: move mutation source into readers/	2022-03-30 15:42:51 +03:00
Botond Dénes	7eae66efe0	mutation_reader: move slicing filtering reader into readers/ Only the declaration has to be moved, the definition is already in readers/mutation_readers.cc.	2022-03-30 15:42:51 +03:00
Botond Dénes	f24f2f726a	mutation_reader: move filtering reader into readers/	2022-03-30 15:42:51 +03:00
Botond Dénes	d0ea895671	readers: move multishard reader & friends to reader/multishard.cc Since the multishard reader family weighs more than 1K SLOC, it gets its own .cc file.	2022-03-30 15:42:51 +03:00
Botond Dénes	3505ef8a49	mutation_reader: remove unused remote_fill_buffer_result	2022-03-30 15:42:51 +03:00
Botond Dénes	f8015d9c26	readers: move combined reader into readers/ Since the combined reader family weighs more than 1K SLOC, it gets its own .cc file.	2022-03-30 15:42:51 +03:00
Botond Dénes	0c3d4091a4	Merge "Make TWCS' cleanup bucket aware" from Raphael S. Carvalho " Quoting patch 3/4: "This continues the work in `a69d98c3d0`, by implementing the cleanup method in TWCS to make it bucket aware. Till now, the default impl was used which cleanups on file at a time, starting from the smallest. The cleanup strategy for TWCS is simple. It's simply calling the size tiered cleanup method for each bucket, so there will be one job for each tier in each window. The next strategies to receive this improvement are LCS and ICS (the latter one being only available in enterprise). Refs #10097." Simply put, the goal is to reduce writeamp when performing cleanup on a TWCS table, therefore reducing the operation time. tests: unit(dev). " * 'twcs_cleanup_bucket_aware/v1' of https://github.com/raphaelsc/scylla: tests: sstable_compaction_test: Add test for TWCS' bucket-aware cleanup compaction: TWCS: Implement cleanup method for bucket awareness compaction: TWCS: change get_buckets() signature to work with const qualified functions compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value	2022-03-30 11:45:28 +03:00
Pavel Emelyanov	edd0481b38	Merge 'Scrub compaction: prevent mishandling of range tombstone changes' from Botond With v2 having individual bounds of range tombstone as separate fragments, out-of-order fragments become more difficult to handle, especially in the presence of active range tombstone. Scrub in both SKIP and SEGREGATE mode closes the partition on seeing the first invalid fragment (SEGREAGE re-opens it immediately). If there is an active range tombstone, scrub now also has to take care of closing said tombstone when closing the partition. In a normal stream it could just use the last position-in-partition to create a closing bound. But when out-of-order fragments are on the table this is not possible: the closing bound may be found later in the stream, with a position smaller than that of the current position-in-partition. To prevent extending range tombstone changes like that, Scrub now aborts the compaction on the first invalid fragment seen inside an active range tombstone. Fixing a v2 stream with range tombstone changes is definitely possible, but non-trivial, so we defer it until there is demand for it. This series also makes the mutation fragment stream validator check for open range tombstones on partition-end and adds a comprehensive test-suite for the validator. Fixes: #10168 Tests: unit(dev) * scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git: compaction/compaction: abort scrub when attempting to rectify stream with active tombstone test/boost/mutation_test: add test for mutation_fragment_stream_validator mutation_fragment_stream_validator: validate range tombstone changes	2022-03-30 11:42:52 +03:00
Pavel Emelyanov	ba6d2ecc6f	api: Remove unused argument from set_tables_autocompaction helper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220329093113.5953-1-xemul@scylladb.com>	2022-03-30 11:42:52 +03:00
Raphael S. Carvalho	177a8e8259	compaction_manager: allow sstable to be moved into rewrite_sstable() Caller was already trying to move sstable, but rewrite_sstable() signature was incorrect. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220329022149.250655-1-raphaelsc@scylladb.com>	2022-03-30 11:42:52 +03:00
Kamil Braun	475a408792	test: raft: randomized_nemesis_test: on timeout, abort calls instead of discarding them When a Raft API call such as `add_entry`, `set_configuration` or `modify_config` takes too long, we need to time-out. There was no way to abort these calls previously so we would do that by discarding the futures. Recently the APIs were extended with `abort_source` parameters. Use this. Also improve debuggability if the functions throw an exception type that we don't expect. Previously if they did, a cryptic assert would fail somewhere deep in the generator code, making the problem hard to debug.	2022-03-29 18:25:26 +02:00
Kamil Braun	0f0d75fd66	raft: server: translate semaphore_aborted to request_aborted	2022-03-29 15:10:29 +02:00
Kamil Braun	8d4dd53c25	test: raft: logical_timer: add abortable version of `sleep_until` `sleep_until(time_point tp, abort_source& as)` will sleep until `tp`, or until `as.request_abort()` is called, whatever comes first.	2022-03-29 15:10:29 +02:00
Kamil Braun	6f9dcd784c	test: raft: randomized_nemesis_test: collect statistics on successful and failed ops	2022-03-29 15:10:29 +02:00
Raphael S. Carvalho	a1fd9c1ee8	tests: sstable_compaction_test: Add test for TWCS' bucket-aware cleanup Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-29 09:52:11 -03:00
Raphael S. Carvalho	568bb40127	compaction: TWCS: Implement cleanup method for bucket awareness This continues the work in `a69d98c3d0`, by implementing the cleanup method in TWCS to make it bucket aware. Till now, the default impl was used which cleanups on file at a time, starting from the smallest. The cleanup strategy for TWCS is simple. It's simply calling the size tiered cleanup method for each bucket, so there will be one job for each tier in each window. The next strategies to receive this improvement are LCS and ICS (the latter one being only available in enterprise). Refs #10097. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-29 09:52:06 -03:00
Raphael S. Carvalho	8f4c04c38a	compaction: TWCS: change get_buckets() signature to work with const qualified functions Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-29 09:49:14 -03:00
Raphael S. Carvalho	2a9bfa3e3f	compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value Then caller can decide whether to copy or move candidate set into the function. cleanup_sstables_compaction_task can move candidates as it's no longer needed once it retrieves all descriptors. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-29 09:49:13 -03:00
Botond Dénes	2ae0e0093e	compaction/compaction: abort scrub when attempting to rectify stream with active tombstone	2022-03-29 13:19:05 +03:00
Botond Dénes	316ff9eb86	test/boost/mutation_test: add test for mutation_fragment_stream_validator	2022-03-29 13:19:05 +03:00
Botond Dénes	ef90783007	mutation_fragment_stream_validator: validate range tombstone changes	2022-03-29 13:19:05 +03:00
Botond Dénes	bb4472d5ab	Update seastar submodule * seastar 7a05527b...44389842 (1): > core/circular_buffer: fix iterator comparison for wrapped-around indexes	2022-03-29 09:08:25 +03:00
Takuya ASADA	59adf05951	scylla_sysconfig_setup: avoid perse error on perftune.py --get-cpu-mask Currently, we just passes entire output of perftune.py when getting CPU mask from the script, but it may cause parse error since the script may also print warning message. To avoid that, we need to extract CPU mask from the output. Fixes #10082 Closes #10107	2022-03-28 16:31:14 +03:00
Nadav Har'El	024ecd45a2	test/rest_api: add reproducer for REST API JSON encoding bug This patch adds a reproducer for the JSON encoding in issue #9061. The bug was already fixed (it was a Seastar bug, and Seastar was updated in commit `5d4213e1b8`), but I verified that the test fails before that patch - and passes today. It is useful to have such a test for regressions, as well as for testing backports. Unfortunately, the test isn't pretty. The test uses the toppartitions API, which instead of having a "start" and "stop" request has a single synchronous "start for a given duration" request, and we need to run it with some fixed duration (we took 1 second), and in parallel, one request. Refs #9061. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220323180855.3307931-1-nyh@scylladb.com>	2022-03-28 15:28:45 +03:00
Nadav Har'El	7f89c8b3e3	alternator: clean error shutdown in case of TLS misconfigration The way our boot-time service "controllers" are written, if a controller's start_server() finds an error and throws, it cannot the caller (main.cc) to call stop_server(), and must clean up resources already created (e.g., sharded services) before returning or risk crashes on assertion failures. This patch fixes such a mistake in Alternator's initialization. As noted in issue #10025, if the Alternator TLS configuration is broken - especially the certificate or key files are missing - Scylla would crash on an assertion failure, instead of reporting the error as expected. Before this patch such a misconfiguration will result in the unintelligible: <alternator::server>::~sharded() [Service = alternator::server]: Assertion `_instances.empty()' failed. Aborting on shard 0. After this patch we get the right error message: ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed: std::_Nested_exception<std::runtime_error> (Failed to set up Alternator TLS credentials): std::_Nested_exception<std::runtime_error> (Could not read certificate file conf/scylla.crt): std::filesystem::__cxx11:: filesystem_error (error system:2, filesystem error: open failed: No such file or directory [conf/scylla.crt]) Arguably this error message is a bit ugly, so I opened https://github.com/scylladb/seastar/issues/1029, but at least it says exactly what the error is. Fixes #10025 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>	2022-03-28 15:26:42 +03:00
Avi Kivity	e88dc51894	Update seastar submodule * seastar 3b4e42d74e...7a05527b6f (4): > Revert "core/circular_buffer: fix iterator comparison for wrapped-around indexes" > seastar-cpu-map.sh: switch from pidof to pgrep Fixes #10238 > abort-source: mark `request_abort()` noexcept > core/circular_buffer: fix iterator comparison for wrapped-around indexes	2022-03-28 12:37:10 +03:00
Nadav Har'El	d8c0680585	test/alternator: add regression test for old ALL_NEW bug In commit `964500e47a`, in the middle of a larger series, I fixed a small Alternator bug that I found while working on that series. The bug was that the ReturnValues=ALL_NEW feature moved out the read previous_item, which breaks operations that need previous_item, e.g., an ADD operation. Unfortunately, we never had a regression test for this fix bug, so in this patch I add one. This bug was re-discovered on an old branch by a user, at which point I noticed that we don't have a test for it - so I want to add it now, even though the bug itself is long gone from Scylla master. I verified that the new test indeed fails on old versions of Scylla before the aforementioned commit, and passes when backporting only that commit. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220327074928.3608576-1-nyh@scylladb.com>	2022-03-28 08:40:28 +02:00
Benny Halevy	2325c566d9	memtable_list: futurize clear_and_add Allow yielding to fix a reactor stall from table::clear. Fixes #10281 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220327141259.213688-1-bhalevy@scylladb.com>	2022-03-27 17:25:43 +03:00
Avi Kivity	3c2271af52	Merge "De-globalize system keyspace local cache" from Pavel E " There's a static global sharded<local_cache> variable in system keyspace the keeps several bits on board that other subsystems need to get from the system keyspace, but what to have it in future<>-less manner. Some time ago the system_keyspace became a classical sharded<> service that references the qctx and the local cache. This set removes the global cache variable and makes its instances be unique_ptr's sitting on the system keyspace instances. The biggest obstacle on this route is the local_host_id that was cached, but at some point was copied onto db::config to simplify getting the value from sstables manager (there's no system keyspace at hand there at all). So the first thing this set does is removes the cached host_id and makes all the users get it from the db::config. (There's a BUG with config copy of host id -- replace node doesn't update it. This set also fixes this place) De-globalizing the cache is the prerequisite for untangling the snitch- -messaging-gossiper-system_keyspace knot. Currently cache is initialized too late -- when main calls system_keyspace.start() on all shards -- but before this time messaging should already have access to it to store its preferred IP mappings. tests: unit(dev), dtest.simple_boot_shutdown(dev) " * 'br-trade-local-hostid-for-global-cache' of https://github.com/xemul/scylla: system_keyspace: Make set_local_host_id non-static system_keyspace: Make load_local_host_id non-static system_keyspace: Remove global cache instance system_keyspace: Make it peering service system_keyspace,snitch: Make load_dc_rack_info non-static system_keyspace,cdc,storage_service: Make bootstrap manipulations non-static system_keyspace: Coroutinize set_bootstrap_state gossiper: Add system keyspace dependency cdc_generation_service: Add system keyspace dependency system_keyspace: Remove local host id from local cache storage_service: Update config.host_id on replace storage_service: Indentation fix after previous patch storage_service: Coroutinize prepare_replacement_info() system_distributed_keyspace: Indentation fix after previous patch code,system_keyspace: Relax system_keyspace::load_local_host_id() usage code,system_keyspace: Remove system_keyspace::get_local_host_id()	2022-03-27 17:19:24 +03:00
Avi Kivity	f476bd3a80	Merge "tools: cut schema loader free of replica::database" from Botond " By way of having an implementation of `data_dictionary` and using that. The schema loader only needs a database to parse cql3 statements, which are all coordinator-side objects and hence been largely migrated to use data dictionary instead. A few hard-dependencies on replica:: objects were found and resolved: * index::secondary_index_manager * tombstone_gc The former was migrated to use `data_dictionary::table` instead of `replica::table`. This in turn requires disentangling `replica::data_dictionary_impl` from `replica::database`, as currently the former can only really be used by the latter. What all of this achieves us is that we no longer have to instantiate a `replica::database` object in `tools::load_schema()`. We want to use the standard allocator in tools, which means they cannot use LSA memory at all. Database on the other hand creates memtable and row-cache instances so it had to go. Refs: #9882 Tests: unit(dev, schema_loader_test:debug, cql-pytest/test_tools.py:debug) " * 'tools-schema-loader-database-impl/v2' of https://github.com/denesb/scylla: tools/schema_loader: use own data dictionary impl tombstone_gc: switch to using data dictionary index/secondary_index_manager: switch to using data dictionary replica/table: add as_data_dictionary() replica: disentangle data_dictionary_impl from database replica: move data_dictionary_impl into own header	2022-03-27 17:01:05 +03:00
Pavel Emelyanov	baedf1e4de	system_keyspace: Remove unused argument from maybe_write_in_user_memory Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220325152521.25215-1-xemul@scylladb.com>	2022-03-27 16:43:53 +03:00
Nadav Har'El	653f2df28f	alternator: fix JSON escaping of error responses In the DynamoDB API, error responses are in JSON format with specific fields ("__type" and "message" in the x-amz-json-1.0 format currently used). Alternator tried to be clever and build the string representation of this JSON itself, instead of using RapidJSON. But this optimization was a mistake - if the error message contains characters that need escaping (such as double quotes and newlines), they weren't escaped, and the resulting JSON was malformed. When the client library boto3 read this malformed JSON it got confused, cosidered the entire error response to be a string, which resulted in an ugly error message. The fix is easy - just build the JSON output as usual with RapidJSON instead of trying to optimize using string operation. The patch also includes two tests reproducing this bug and checking its fix. The first test uses boto3 and shows it got confused on the type of error (not understanding that it is a ValidationException). The second test bypasses boto3 and shows exactly where the bug happens - the response is an unparsable JSON. Fixes #10278 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220327132705.3707979-1-nyh@scylladb.com>	2022-03-27 16:32:36 +03:00
Takuya ASADA	bdefea7c82	docker: enable --log-to-stdout which mistakenly disabled Since our Docker image moved to Ubuntu, we mistakenly copy dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not used in Ubuntu (it should be /etc/default). So /etc/default/scylla-server is just default configuration of scylla-server .deb package, --log-to-stdout is 0, same as normal installation. We don't want keep the duplicated configuration file anyway, so let's drop dist/docker/etc/sysconfig/scylla-server and configure /etc/default/scylla-server in build_docker.sh. Fixes #10270 Closes #10280	2022-03-27 14:50:10 +03:00
Avi Kivity	1feec08c2d	Revert "api: storage_service: force_keyspace_compaction: compact one table at a time" This reverts commit `37dc31c429`. There is no reason to suppose compacting different tables concurently on different shards reduces space requirements, apart from non-deterministically pausing random shards. However, when data is badly distributed and there are many tables, it will slow down major compaction considerably. Consider a case where there are 100 tables, each with a 2GB large partition on some shard. This extra 200GB will be compacted on just one shard. With compation rate of 40 MB/s, this adds more than an hour to the process. With the existing code, these compactions would overlap if the badly distributed data was not all in one shard. It is also counter to tablets, where data is not equally ditributed on purpose. Closes #10246	2022-03-25 19:24:50 +03:00
Botond Dénes	a69d98c3d0	Merge "Improve efficiency of cleanup compaction by making it bucket aware" from Raphael S. Carvalho " Cleanup compaction works by rewriting all sstables that need clean up, one at a time. This approach can cause bad write amplification because the output data is being made incrementally available for regular compaction. Cleanup is a long operation on large data sets, and while it's happening, new data can be written to buckets, triggering regular compaction. Cleanup fighting for resources with regular compaction is a known problem. With cleanup adding one file at a time to buckets, regular may require multiple rounds to compact the data in a given bucket B, producing bad writeamp. To fix this problem, cleanup will be made bucket aware. As each compaction strategy has its own definition of bucket, strategies will implement their own method to retrieve cleanup jobs. The method will be implemented such that all files in a bucket B will be cleaned up together, and on completion, they'll be made available for regular at once. For STCS / ICS, a bucket is a size tier. For TWCS, a bucket is a window. For LCS, a bucket is a level. In this way, writeamp problem is fixed as regular won't have to perform multiple rounds to compact the data in a given bucket. Additionally, cleanup will now be able to deduplicate data and will become way more efficient at garbage collecting expired data. The space requirement shouldn't be an issue, as compacting an entire bucket happens during regular compaction anyway. With leveled strategy, compacting an entire level is also not a problem because files in a level L don't overlap and therefore incremental compaction is employed to limit the space requirement. By the time being, only STCS cleanup was made bucket aware. The others will be using a default method, where one file is cleaned up at a time. Making cleanup of other strategies bucket aware is relatively easy now and will be done soon. Refs #10097. " * 'cleanup-compaction-revamp/v3' of https://github.com/raphaelsc/scylla: test: sstable_compaction_test: Add test for strategy cleanup method compaction: STCS: Implement cleanup strategy compaction_manager: Wire cleanup task into the strategy cleanup method compaction_strategy: Allow strategies to define their own cleanup strategy compaction: Introduce compaction_descriptor::sstables_size compaction: Move decision of garbage collection from strategy to task type	2022-03-25 16:30:28 +02:00
Raphael S. Carvalho	5312526e5e	test: sstable_compaction_test: Add test for strategy cleanup method Stresses default and STCS implementations of cleanup method Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-25 11:23:29 -03:00
Raphael S. Carvalho	84101dec9e	compaction: STCS: Implement cleanup strategy This implements cleanup strategy for STCS. It will return one descriptor for each size tier. If a given tier has more than max_threshold elements, more than 1 job will be returned for that tier. Token contiguity is preserved by sorting elements of a tier by token. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-25 11:23:29 -03:00
Raphael S. Carvalho	c7826aa910	compaction_manager: Wire cleanup task into the strategy cleanup method As the cleanup process can now be driven by the compaction strategy, let's move cleanup into a new task type that uses the new compaction_strategy::get_cleanup_compaction_jobs(). By the time being all strategies are using the default method that returns one descriptor for each sstable that needs clean up. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-25 11:23:26 -03:00
Mikołaj Sielużycki	6f1b6da68a	compile: Fix headers so that *-headers targets compile cleanly. Closes #10273	2022-03-25 16:19:26 +02:00
Botond Dénes	9e1a642e1f	tools/schema_loader: use own data dictionary impl And pass it to the cql3 layer when parsing statements. This allows the schema loader to cut itself from replica::database, using a local, much simpler database implementation. This not only makes the code much simpler but also opens up the way to using the standard allocator in tools. The real database uses LSA which is incompatible with the standard allocator (in release builds that is).	2022-03-25 14:40:44 +02:00
Pavel Emelyanov	f326d359de	system_keyspace: Make set_local_host_id non-static The callers are system_keyspace.load_local_host_id and storage service. The former is non-static since previous patch, the latter has its own sys.ks. reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Pavel Emelyanov	26a12ac056	system_keyspace: Make load_local_host_id non-static The only caller is main(), the sharded<> sys_ks is started at this point. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Pavel Emelyanov	725ab5eea1	system_keyspace: Remove global cache instance No users of this variable left, all the code relies on system_keyspace "this" to get it. Respectively, the cache can be a unique_ptr<> on the system_keyspace instance and the global sharded variable can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Pavel Emelyanov	4cb7c48243	system_keyspace: Make it peering service And remove a bunch of (_local)?_cache.invoke_on_all() calls. This is the preparation for removing the global cache instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Pavel Emelyanov	021c026482	system_keyspace,snitch: Make load_dc_rack_info non-static It's snitch code that needs it. It now takes messaging service from gossiper, so it can do the same with system keyspace. This change removes one user of the global sys.ks. cache instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Pavel Emelyanov	c2cf4e3536	system_keyspace,cdc,storage_service: Make bootstrap manipulations non-static The users of get_/set_bootstrap_sate and aux helpers are CDC and storage service. Both have local system_keyspace references and can just use them. This removes some users of global system ks. cache and the qctx thing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Pavel Emelyanov	798caf2ef8	system_keyspace: Coroutinize set_bootstrap_state Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Pavel Emelyanov	3da5f6ac30	gossiper: Add system keyspace dependency The gossiper reads peer features from system keyspace. Also the snitch code needs system keyspace, and since now it gets all its dependencies from gossiper (will be fixed some day, but not now), it will do the same for sys.ks.. Thus it's worth having gossiper->system_keyspace explicit dependency. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 15:08:13 +03:00
Botond Dénes	f501bc8d54	tombstone_gc: switch to using data dictionary But only on the surface, the only internal function needing the database (`needs_repair_before_gc()`) still gets a real database because the replication factor cannot be obtained from the data dictionary currently. Although this might not look like an improvement, it is enough to avoid a `real_database()` call for tables that don't have tombstone gc mode set to repair.	2022-03-25 13:17:58 +02:00
Pavel Emelyanov	62417577ab	cdc_generation_service: Add system keyspace dependency The service uses system keyspace to, e.g., manage the generation id, thus it depends on the system_keyspace instance and deserves the explicit reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:39:32 +03:00
Pavel Emelyanov	a9cbdee82e	system_keyspace: Remove local host id from local cache This value is now write-only, all readers had been patched to use db::config copy Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:39:32 +03:00
Pavel Emelyanov	ec8be45259	storage_service: Update config.host_id on replace The config.host_id value is loaded early on start, but when the storage service prepares to join the cluster to replace a node, it will change that value (with the host id of the target). This change only affect the system keyspace, but not the config copy which is a BUG. fixes: #10243 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:39:32 +03:00
Pavel Emelyanov	3e7a21aa24	storage_service: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:39:29 +03:00
Pavel Emelyanov	b3503aaf6a	storage_service: Coroutinize prepare_replacement_info() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:26:14 +03:00
Pavel Emelyanov	fa4d4beaf1	system_distributed_keyspace: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:25:55 +03:00
Pavel Emelyanov	b8d3048104	code,system_keyspace: Relax system_keyspace::load_local_host_id() usage The method is nowadays called from several places: - API - sys.dist.ks. (to udpate view building info) - storage service prepare_to_join() - set up in main They all, but the last, can use db::config cached value, because it's loaded earlier than any of them (but the last -- that's the loading part itself). Once patched, the load_local_host_id() can avoid checking the cache for that value -- it will not be there for sure. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:23:30 +03:00
Pavel Emelyanov	965d2a0a4f	code,system_keyspace: Remove system_keyspace::get_local_host_id() The host id is cached on db::config object that's available in all the places that need it. This allows removing the method in question from the system_keyspace and not caring that anyone that needs host_id would have to depend on system_keyspace instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:21:59 +03:00
Botond Dénes	9a44c26d7e	index/secondary_index_manager: switch to using data dictionary Instead of directly using replica::table.	2022-03-25 11:44:31 +02:00
Botond Dénes	eff941d22c	replica/table: add as_data_dictionary() To allow converting table instances to data_dictionary::table.	2022-03-25 11:44:31 +02:00
Botond Dénes	4f2d900c9f	replica: disentangle data_dictionary_impl from database Make it a standalone class, instead of private subclass of database. Unfriend database and instead make wrap/unwrap methods public, so anyone can use them.	2022-03-25 11:44:31 +02:00
Botond Dénes	421d4411f8	replica: move data_dictionary_impl into own header As a first step towards disentangling it from database and allowing it to be used by other classes (like table) too.	2022-03-25 11:44:31 +02:00
Yaron Kaikov	5ef1b49cb8	docs/conf.py:update scylla-4.6 to latest Now that Scylla 4.6 is out, it;s the latest release available Closes: https://github.com/scylladb/scylla/issues/10266 Closes #10268	2022-03-24 14:18:10 +02:00
Nadav Har'El	06fdce82aa	test/*/run: clearer error message on wrong SCYLLA setting The test runners cql-pytest/run et al. try to automatically find the last-compile Scylla executable, but this decision can be overriden by the SCYLLA environment variable. If the user sets by mistake SCYLLA to something which is not a valid path of an executable, the result was a long and obscure Python stack trace. So after this patch, if SCYLLA points to something which is not an executable, a clear error is produced immediately, directing the user to set it this variable to a correct executable Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220323164427.3301828-1-nyh@scylladb.com>	2022-03-23 19:01:17 +02:00
Jadw1	1438be5311	CQL3: Bloom filter efficacy test Added CQL pytest to check bloom filter efficacy by inserting `N` rows and reading `M` non-existing keys. Added `bloom_filter_false_positives` method to Python `nodetool` module. Method gets fp number by calling Scylla's API. Fixes: #1055 Closes #10186	2022-03-23 16:51:50 +02:00
Avi Kivity	c89117e5f7	Update seastar submodule * seastar 4e42a6019...3b4e42d74 (20): > condition-variable: Fix broadcast to handle "when(pred)" usage > condition-variable: Adjust when(pred) evaulation to match wait(pred) + tests > coroutine: parallel_for_each: set_callback: destroy future after move > Merge "Calculate max IO lengths as lengths" from Pavel E > future: fix make_exception_future(const exception_ptr&&) > condition-variable: Fix when(<pred>) functions + add test > Merge "semaphore: add abortable get_units" from Benny > condition_variable: restore move-constructability. > coroutine: add a non-preemptive co_await flavor > future: set_callback: consume future input > coroutines: add parallel_for_each > Merge "Add more rl-iosched test cases" from Pavel E > Revert "condition-variable: Fix when(<pred>) functions + add test" > condition-variable: Fix when(<pred>) functions + add test > reactor: fix buildability with SEASTAR_NO_EXCEPTION_HACK > Revert "core: memory: Replace overlooked uses of cpu_mem with get_cpu_mem()" > core: memory: Add a shortcut for stats().free_memory > lw_shared_ptr: rename shared_ptr_no_esft to lw_shared_ptr_no_esft > core: memory: Replace overlooked uses of cpu_mem with get_cpu_mem() > include/seastar/core: define iterator without std::iterator<>	2022-03-23 12:10:16 +02:00
Raphael S. Carvalho	44e9e10414	compaction_strategy: Allow strategies to define their own cleanup strategy Today, all compaction strategies will clean up their files using the incremental approach of one sstable being rewritten at a time. Turns out that's not the best approach performance wise. Let's take STCS for example. As cleanup finishes rewriting one file, the output file is placed into the sstable set. Regular now can compact that file with another that was already there (e.g. produced by flush after cleanup started). Inefficient compactions like this can keep happening as cleanup incrementally places output file into the candidate list for regular. This method will allow strategies to clean up their files in batches. For example, STCS can clean up all files in smallest tiers in single round, allowing the output data to be added at once. So next compaction rounds can be more efficient in terms of writeamp. Another benefit is that deduplication and GC can happen more efficiently. The drawback is the space requirement, as we no longer compact one file a a time. However, the impact is minimized by cleaning up the smallest tier first. With leveled strategy for example, even though 90% of data is in highest level, the space requirement is not a problem because we can apply the incremental compaction on its behalf. The same applies to ICS. With STCS, the requirement is the size of the tier being compacted, but that's already expected by its users anyway. By the time being, all strategies have it unimplemented. so they still use the old behavior where files are rewritten on at a time. This will allow us to incrementally implement the cleanup method for all compaction strategies. Refs #10097. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-23 00:04:03 -03:00
Pavel Emelyanov	b0494b897d	scylla-gdb: Support lw_shared_ptr_no_esft Seastar commit fcb2b901 renamed the struct by adding it the lw_ prefix. tests: unit.gdb(release) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220322151328.23203-1-xemul@scylladb.com>	2022-03-22 17:19:39 +02:00
Calle Wilund	56c383ba8e	test/perf/perf_commitlog: Add a small commitlog throughput test Based on perf_simple_query, just bashes data into CL using normal distribution min/max data chunk size, allowing direct freeing of segments, _but_ delayed by a normal dist as well, to "simulate" secondary delay in data persistance. Needs more stuff. Some baseline measurements on master: --min-flush-delay-in-ms 10 --max-flush-delay-in-ms 200 --commitlog-use-hard-size-limit true --commitlog-total-space-in-mb 10000 --min-data-size 160 --max-data-size 1024 --smp1 median 2065648.59 tps ( 1.1 allocs/op, 0.0 tasks/op, 1482 insns/op) median absolute deviation: 48752.44 maximum: 2161987.06 minimum: 1984267.90 --min-data-size 256 --max-data-size 16384 median 269385.25 tps ( 2.2 allocs/op, 0.7 tasks/op, 3244 insns/op) median absolute deviation: 15719.13 maximum: 323574.43 minimum: 228206.28 --min-data-size 4096 --max-data-size 61440 median 67734.22 tps ( 6.4 allocs/op, 2.9 tasks/op, 9153 insns/op) median absolute deviation: 2070.93 maximum: 82833.17 minimum: 61473.57 --min-data-size 61440 --max-data-size 1843200 median 2281.37 tps ( 79.7 allocs/op, 43.5 tasks/op, 202963 insns/op) median absolute deviation: 128.87 maximum: 3143.84 minimum: 2140.80 --min-data-size 368640 --max-data-size 6144000 median 679.76 tps (225.5 allocs/op, 116.3 tasks/op, 662700 insns/op) median absolute deviation: 39.30 maximum: 1148.95 minimum: 586.86 Actual throughput obviously meaningless, as it is run on my slow machine, but IPS might be relevant. Note that transaction throughput plummets as we increase median data sizes above ~200k, since we then more or less always end up replacing buffers in every call. Closes #10230	2022-03-22 15:18:25 +02:00
Pavel Emelyanov	cb4fe65a78	scripts: Allow specifying submodule branch to refresh from There's a script to automate fetching submodule changes. However, this script alays fetches remote master branch, which's not always the case. For example, for branch-5.0/next-5.0 pair the correct scylla-seastar branch would be the branch-5.0 one, not master. With this change updating a submodule from a custom branch would be like refresh-submodules.sh <submodule>:<branch> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220322093623.15748-1-xemul@scylladb.com>	2022-03-22 15:18:25 +02:00
Pavel Emelyanov	22d131d40a	Revert "scripts: Detect remote branch to fetch submodules from" This reverts commit `87df37792c`. Scylla branches are not mapped to seastar branches 1-1, so getting the upstream scylla branch doesn't point to the correct seastar one.	2022-03-22 15:18:25 +02:00
Avi Kivity	72c6859c25	Merge "readers: get rid of v1 mutation from fragments" from Botond " The only real user is view building, which is converted to v2 and then the v1 version of the mutation from fragments reader is removed. Tests: unit(dev, release) " * 'v2-only-from-fragments-mutations/v1' of https://github.com/denesb/scylla: readers: remove now unused v1 reader from fragments test/boost: flat_mutation_reader_test: remove reader from fragments test replica/table: migrate generate_and_propagate_view_updates() to v2 replica/table: migrate populate_views() to v2 db/view: convert view_update_builder interface to v2 db/view: migrate view_update_builder to v2	2022-03-22 15:18:25 +02:00
Benny Halevy	9871c757e0	HACKING: refer to backtrace.scylladb.com Fixes #10252 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10253	2022-03-22 15:18:25 +02:00
Raphael S. Carvalho	25be958ab9	compaction: Introduce compaction_descriptor::sstables_size This method can be reused in manager, and will be useful for upcoming cleanup task. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-21 12:55:10 -03:00
Raphael S. Carvalho	c25d8f6770	compaction: Move decision of garbage collection from strategy to task type For compaction to be able to purge expired data, like tombstones, a sstable set snapshot is set in the compaction descriptor. That's a decision that belongs to task type. For example, all regular compaction enable GC, whereas scrub for example doesn't for safety reasons. The problem is that the decision is being made by every instantiation of compaction_descriptor in the strategies, which is both unnecessary and also adds lots of boilerplate to the code, making it hard to understand and work with. As sstable set snapshot is an implementation detail, a new method is being added to compaction_descriptor to make the intention clearer, making the interface easier to understand. can_purge_tombstones, used previously by rewrite task only, is being reused for communicating GC intention into task::compact_sstables(). The boilerplate was a pain when adding a new strategy method for the ongoing work on cleanup, described by issue #10097. Another benefit is that we'll now only create a set snapshot when compaction will really run. Before, it could happen that the snapshot would be discarded if the compaction attempt had to be postponed, which is a waste of cpu cycles. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-03-21 12:14:04 -03:00
Botond Dénes	dacc3c08a9	readers/upgrading_consumer: workaround for aarch64 miscompilation On aarch64 the `std::move(mf)` seems to be reordered w.r.t. `flush_tombstones()` in certain circumstances. These circumstances are not clear yet, but while further investigation happens, this patch makes the tests pass on aarch64, unclogging the promotion pipeline. Refs: #10248 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220321122209.71685-1-bdenes@scylladb.com>	2022-03-21 15:07:24 +02:00
Avi Kivity	585c0841c3	Merge 'sstables: enable read ahead for the partition index reader' from Wojciech Mitros Currently, when advancing one of `index_reader`'s bounds, we're creating a new `index_consume_entry_context` with a new underlying file `input_stream` for each new page. For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing. This patch adds a `index_consume_entry_context` to each of `index_reader`'s bounds, so that for each new page, the same file `input_stream` is used. As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the `input_stream`'s read aheads, decreasing the number of blocking reads and increasing the throughput of the `index_reader`. Additionally, we're reusing the `index_consumer` for all pages, calling `index_consumer::prepare` when we need to increase the size of the `_entries` `chunked_managed_vector`. A big difference can be seen when we're reading the entire table, frequently skipping a few rows; which we can test using perf_fast_forward: Before: ``` running: small-partition-skips on dataset small-part Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu -> 1 0 0.899447 4 1000000 1111794 12284 1113248 1096537 975.5 972 124356 1 0 0 0 0 0 0 0 12032202 29103 8967 100.0% -> 1 1 1.805811 4 500000 276884 907 278214 275977 3655.8 3654 135084 2688 0 3161 4548 5935 0 0 0 7225100 140466 27010 75.6% -> 1 8 0.927339 4 111112 119818 357 120465 119461 3654.0 3654 135084 2685 0 2133 4548 6963 0 0 0 1749663 107922 57502 50.2% -> 1 16 0.790630 4 58824 74401 782 74617 73497 3654.0 3654 135084 2695 0 1975 4548 7121 0 0 0 1019189 109349 90832 42.7% -> 1 32 0.717235 4 30304 42251 243 42266 41975 3654.0 3654 135084 2689 0 1871 4548 7225 0 0 0 619876 109199 156751 37.3% -> 1 64 0.681624 4 15385 22571 244 22815 22286 3654.0 3654 135084 2685 0 1870 4548 7226 0 0 0 407671 105798 285688 34.0% -> 1 256 0.630439 4 3892 6173 24 6214 6150 3549.0 3549 135116 2581 0 1313 3927 6505 0 0 0 232541 100803 1022454 29.1% -> 1 1024 0.313303 4 976 3115 219 3126 2766 1956.0 1956 130608 986 0 0 987 1962 0 0 0 81165 41385 1724979 29.1% -> 1 4096 0.083688 4 245 2928 85 3012 2134 738.8 737 17212 492 244 0 247 491 0 0 0 30500 19406 1999263 24.6% -> 64 1 1.509011 4 984616 652491 2746 660930 649745 3673.5 3654 135084 2687 0 4507 4548 4589 0 0 0 11075882 117074 13157 68.9% -> 64 8 1.424147 4 888896 624160 4446 625675 617713 3654.0 3654 135084 2691 0 4248 4548 4848 0 0 0 10019098 117383 13700 66.5% -> 64 16 1.343276 4 800000 595559 5834 605880 589725 3654.0 3654 135084 2698 0 3989 4548 5107 0 0 0 9043830 124022 14206 64.9% -> 64 32 1.249721 4 666688 533469 5056 536638 526212 3654.0 3654 135084 2688 0 3616 4548 5480 0 0 0 7570848 123043 15377 60.9% -> 64 64 1.154549 4 500032 433097 10215 443312 415001 3654.0 3654 135084 2703 0 3161 4548 5935 0 0 0 5718758 110657 17787 53.2% -> 64 256 1.005309 4 200000 198944 1179 199338 196989 3935.0 3935 137216 2966 0 690 4048 5592 0 0 0 2398359 110510 27855 51.3% -> 64 1024 0.441913 4 58880 133239 8094 135471 120467 2161.0 2161 131820 1190 0 0 1192 1848 0 0 0 725092 45449 33740 59.7% -> 64 4096 0.124826 4 15424 123564 5958 126814 95101 795.5 794 17400 553 240 0 312 482 0 0 0 199943 20869 46621 41.9% ``` After: ``` running: small-partition-skips on dataset small-part Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu -> 1 0 0.917468 4 1000000 1089956 1422 1091378 1073112 975.5 972 124356 1 0 0 0 0 0 0 0 12032761 29721 8972 100.0% -> 1 1 1.311446 4 500000 381259 3212 384470 377238 1087.0 1083 138420 2 0 4445 4548 4651 0 0 0 7096216 55681 20869 100.0% -> 1 8 0.467975 4 111112 237432 1446 239372 235985 1121.2 1119 143124 9 0 4344 4548 4752 0 0 0 1619944 23502 28844 98.7% -> 1 16 0.337085 4 58824 174508 3410 178451 171099 1117.5 1120 143276 11 0 4319 4548 4777 0 0 0 883692 19152 37460 96.8% -> 1 32 0.262798 4 30304 115313 1222 116535 112400 1070.2 1066 135620 166 26 4354 4548 4742 0 0 0 483185 18856 54275 94.9% -> 1 64 0.283954 4 15385 54181 531 56177 53650 2022.5 2040 137036 319 19 4351 4548 4745 0 0 0 292766 32998 102276 84.9% -> 1 256 0.207020 4 3892 18800 575 19105 17520 1315.5 1334 136072 418 24 3703 3927 4115 0 0 0 118400 27427 292146 82.1% -> 1 1024 0.164396 4 976 5937 57 5993 5842 1208.2 1195 135384 568 14 932 987 1030 0 0 0 62999 27554 503559 70.0% -> 1 4096 0.085079 4 245 2880 108 2987 2714 635.8 634 26468 248 246 233 247 258 0 0 0 31264 12872 1546404 37.4% -> 64 1 1.073331 4 984616 917346 7614 923983 909314 1812.2 1824 136792 11 20 4544 4548 4552 0 0 0 10971661 54538 9919 99.6% -> 64 8 1.024389 4 888896 867733 6327 870429 845215 3027.2 3072 138212 31 0 4523 4548 4573 0 0 0 9933078 68059 10050 99.5% -> 64 16 0.978754 4 800000 817366 7802 827665 809564 3012.2 3008 139884 39 0 4486 4548 4610 0 0 0 8947041 64050 10302 98.1% -> 64 32 0.837266 4 666688 796267 10312 806579 785370 2275.8 2266 139672 29 0 4465 4548 4631 0 0 0 7458644 50754 10564 97.8% -> 64 64 0.645627 4 500032 774490 4713 779203 768432 1136.8 1137 145428 8 0 4438 4548 4658 0 0 0 5593168 29982 10938 98.4% -> 64 256 0.386192 4 200000 517877 22509 544067 495368 1134.8 1136 145300 109 0 2135 4048 4147 0 0 0 2270291 22840 13682 94.5% -> 64 1024 0.238617 4 58880 246755 55856 305110 190899 1176.0 1118 135324 451 13 625 1192 1223 0 0 0 701262 24418 17323 71.1% -> 64 4096 0.133340 4 15424 115674 14837 117978 99072 974.0 961 27132 366 347 99 312 383 0 0 0 209595 20657 43096 50.4% ``` For single partition reads, the index_reader is modified to behave in practically the same way, as before the change (not reading ahead past the page with the partition). For example, a single partition read from a table with 10 rows per partition performs a single 6KB read from the index file, and the same read is performed before the change (as can be seen in traces below). If we enabled read aheads in that case, we would perform 2 16KB reads. Relevant traces: Before: ``` ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] \| 2021-07-23 15:22:25.847362 \| 127.0.0.1 \| 148 \| 127.0.0.1 ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] \| 2021-07-23 15:22:25.900996 \| 127.0.0.1 \| 53782 \| 127.0.0.1 ``` After: ``` ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] \| 2021-07-23 15:19:37.380033 \| 127.0.0.1 \| 149 \| 127.0.0.1 ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] \| 2021-07-23 15:19:37.433662 \| 127.0.0.1 \| 53777 \| 127.0.0.1 ``` Tests: unit(dev) Closes #9063 * github.com:scylladb/scylla: sstables: index_reader: optimize single partition reads sstables: use read-aheads in the index reader sstables: index_reader: remove unused members from index reader context	2022-03-21 13:47:28 +02:00
Nadav Har'El	f76f6dbccb	secondary index: avoid special characters in default index names In CQL, table names are limited to so-called word characters (letters, numbers and underscores), but column names don't have such a limitation. When we create a secondary index, its default name is constructed from the column name - so can contain problematic characters. It can include even the "/" character. The problem is that the index name is then used, like a table name, to create a directory with that name. The test included in this patch demonstrates that before this patch, this can be misused to create subdirectories anywhere in the filesystem, or to crash Scylla when it fails to create a directory (which it considers an unrecoverable I/O error). In this patch we do what Cassandra does - remove all non-word characters from the indexed column name before constructing the default index name. In the included test - which can run on both Scylla and Cassandra - we verify that the constructed index name is the same as in Cassandra, which is useful to know (e.g., because knowing the index name is needed to DROP the index). Also, this patch adds a second line of defense against the security problem described above: It is now an error to create a schema with a slash or null (the two characters not allowed in Unix filenames) in the keyspace or table names. So if the first line of defense (CQL checking the validity of its commands) fails, we'll have that second line of defense. I verified that if I revert the default-index-name fix, the second line of defense kicks in, and the index creation is aborted and cannot create files in the wrong place to crash Scylla. Fixes #3403 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220320162543.3091121-1-nyh@scylladb.com>	2022-03-20 18:33:48 +02:00
Takuya ASADA	59c72d5d60	scylla_prepare: print Traceback with current user-friendly messages On `e1b15ba`, we introduce user-friendly error message when Exception occured while generating perftune.yaml. However, it becomes difficult to investigate bugs since we dropped traceback. To resolve this problem, let's print both traceback and user-friendly messages. Related #10050 Closes #10140	2022-03-20 16:55:18 +02:00
Michał Chojnowski	f422e18906	cql3: restrictions: statement_restrictions: avoid an unnecessary vector copy A minor optimization. Closes #10231	2022-03-20 15:40:46 +02:00
Tomasz Grabiec	cd5fec8a23	Merge "raft: re-advertise gossiper features when raft feature support changes" from Pavel Prior to the change, `USES_RAFT_CLUSTER_MANAGEMENT` feature wasn't properly advertised upon enabling `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` raft feature. This small series consists of 3 parts to fix the handling of supported features for raft: 1. Move subscription for `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` to the `raft_group_registry`. 2. Update `system.local#supported_features` directly in the `feature_service::support()` method. 3. Re-advertise gossiper state for `SUPPORTED_FEATURES` gossiper value in the support callback within `raft_group_registry`. * manmanson/track_supported_set_recalculation_v7: raft: re-advertise gossiper features when raft feature support changes raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft gms: feature_service: update `system.local#supported_features` when feature support changes test: cql_test_env: enable features in a `seastar::thread`	2022-03-18 12:34:17 +01:00
Pavel Solodovnikov	ebc2178ea5	raft: re-advertise gossiper features when raft feature support changes Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-18 09:54:29 +03:00
Pavel Solodovnikov	011942dcce	raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft Move the listener from feature service to the `raft_group_registry`. Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT` feature when the former is enabled. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-18 09:54:25 +03:00
Pavel Solodovnikov	7ea4d44508	gms: feature_service: update `system.local#supported_features` when feature support changes Also, change the signature of `support()` method to return `future<>` since it's now a coroutine. Adjust existing call sites. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-18 09:54:21 +03:00
Pavel Solodovnikov	724ea7aa38	test: cql_test_env: enable features in a `seastar::thread` Each feature can have an associated `when_enabled` callback registered, which is assumed to run in the thread context, so wrap the `enable()` call in a seastar thread. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-18 09:54:15 +03:00
Avi Kivity	aab052c0d5	Merge 'replica/database: truncate: temporarily disable compaction on table and views before flush' from Benny Halevy Flushing the base table triggers view building and corresponding compactions on the view tables. Temporarily disable compaction on both the base table and all its view before flush and snapshot since those flushed sstables are about to be truncated anyway right after the snapshot is taken. This should make truncate go faster. In the process, this series also embeds `database::truncate_views` into `truncate` and coroutinizes both Refs #6309 Test: unit(dev) Closes #10203 * github.com:scylladb/scylla: replica/database: truncate: fixup indentation replica/database: truncate: temporarily disable compaction on table and views before flush replica/database: truncate: coroutinize per-view logic replica/database: open-code truncate_view in truncate replica/database: truncate: coroutinize run_with_compaction_disabled lambda replica/database: coroutinize truncate compaction_manager: add disable_compaction method	2022-03-17 17:24:20 +02:00
Pavel Emelyanov	87df37792c	scripts: Detect remote branch to fetch submodules from There's a script to automate fetching submodule changes. However, this script alays fetches remote master branch, which's not always the case. The correct branch can be detected by checking the current remote tracking scylla branch which should coincide with the submodule one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220317085018.11529-1-xemul@scylladb.com>	2022-03-17 12:21:29 +02:00
Avi Kivity	77f330f393	Merge "readers: retire v1 generating reader implementation" from Botond " The generating reader is a reader which converts a functor returning mutation fragments to a mutation reader. We currently have 2 generating reader implementations: one operating with a v1 functor and one with a v2 one. This patch-set converts the v1 functor based one to a v2 reader, by adapting the v1 functor to a v2 functor and reusing the v2 reader implementation. Tests are also added to both variants. Tests: unit(dev) " * 'generating-reader-v2/v1' of https://github.com/denesb/scylla: test/boost: mutation_reader_test: add tests for generating reader test: export squash_mutations() into lib/mutation_source_test.hh readers: add next partition adaptor readers: implement generating_reader from v1 generator via adaptor readers: upgrade_to_v2(): reimplement in terms of upgrading_consumer readers: add upgrading_consumer readers: generating_reader: use noncopyable_function<> readers: merge generating.hh into generating_v2.hh readers/generating.hh: return v2 reader from make_generating_reader()	2022-03-17 12:19:25 +02:00
Botond Dénes	d15999a58e	readers: remove now unused v1 reader from fragments	2022-03-17 11:03:16 +02:00
Botond Dénes	3ea1240fb9	test/boost: flat_mutation_reader_test: remove reader from fragments test	2022-03-17 11:03:16 +02:00
Botond Dénes	e12c543d3f	replica/table: migrate generate_and_propagate_view_updates() to v2	2022-03-17 10:51:25 +02:00
Botond Dénes	4b9219a209	replica/table: migrate populate_views() to v2	2022-03-17 10:51:05 +02:00
Botond Dénes	909be0b9d7	db/view: convert view_update_builder interface to v2 The constructor and the make_ factory method now take v2 readers. Immediate users are patched, with conversions if needed.	2022-03-17 10:50:50 +02:00
Botond Dénes	0740019e4d	db/view: migrate view_update_builder to v2 To avoid noise, the interface is left as v1 and inbound readers are converted in the constructor.	2022-03-17 10:47:55 +02:00
Botond Dénes	c450508954	Merge "Introduce sharded<system_keyspace> instance" from Pavel Emelyanov " Making the system-keyspace into a standard sharded instance will help to fix several dependency knots. First, the global qctx and local-cache both will be moved onto the sys-ks, all their users will be patched to depend on system-keyspace. Now it's not quite so, but we're moving towards this state. Second, snitch instance now sits in the middle of another dependency loop. To untie one the preferred ip and dc/rack info should be moved onto system keyspace altogether (now it's scattered over several places). The sys-ks thus needs to be a sharded service with some state. This set makes system-keyspace sharded instance, equipps it with all the dependencies it needs and passes it as dependency into storage service, migration manager and API. This helps eliminating a good portion of global qctx/cache usage and prepares the ground for snitch rework. tests: unit(dev) v1: unit(debug), dtest.simple_boot_shutdown(dev) " * 'br-sharded-system-keyspace-instance-2' of https://github.com/xemul/scylla: (25 commits) system_keyspace: Make load_host_ids non-static system_keyspace: Make load_tokens non-static system_keyspace: Make remove_endpoint and update_tokens non-static system_keyspace: Coroutinize update_tokens system_keyspace: Coroutinize remove_endpoint system_keyspace: Make update_cached_values non-static system_keyspace: Coroutinuze update_peer_info system_keyspace: Make update_schema_version non-static schema_tables: Add sharded<system_keyspace> argument to update_schema_version_and_announce replica: Push sharded<system_keyspace> down to parse_system_tables api: Carry sharded<system_keyspace> reference along storage_service: Keep sharded<system_keyspace> reference migration_manager: Keep sharded<system_keyspace> reference system_keyspace: Remove temporary qp variable system_keyspace: Make get_preferred_ips non-static system_keyspace: Make cache_truncation_record non-static system_keyspace: Make check_health non-static system_keyspace: Make build_bootstrap_info non-static system_keyspace: Make build_dc_rack_info non-static system_keyspace: Make setup_version non-static ...	2022-03-17 08:16:29 +02:00
Botond Dénes	9f95042c2b	test/boost: mutation_reader_test: add tests for generating reader	2022-03-17 08:08:01 +02:00
Botond Dénes	4243cd395d	test: export squash_mutations() into lib/mutation_source_test.hh This method used to be a static one in boost/flat_mutation_reader_test.cc. Turns out it is useful for other tests based on the mutation source test suite, so move it into the header of the latter to make it accessible.	2022-03-17 08:08:01 +02:00
Botond Dénes	9e3d8cb06f	readers: add next partition adaptor Provides a wrapper with a `next_partition()` implementation for readers that can't have one. Mainly for testing purposes.	2022-03-17 08:08:01 +02:00
Botond Dénes	3594f836fc	readers: implement generating_reader from v1 generator via adaptor Adaptor converts the `noncopyable_function<future<mutation_fragment_opt>>` to the v2 equivalent, so we can have a single generating reader implementation. The adaptor uses the upgrading_consumer reusable upgrade component to implement the actual upgrade.	2022-03-17 08:08:01 +02:00
Botond Dénes	47b806393b	readers: upgrade_to_v2(): reimplement in terms of upgrading_consumer Use the reusable upgrading_consumer introduced in the previous patch as the v2 upgrade implementation.	2022-03-17 08:08:01 +02:00
Botond Dénes	ffeeb83edf	readers: add upgrading_consumer Upgrading a v1 stream to a v2 one is a common task that currently requires duplicating the upgrade logic in all components that wan to do this. This patch extract the upgrade logic from `upgrade_to_v2()` into a reusable component to promote code reuse.	2022-03-17 08:08:01 +02:00
Botond Dénes	fcf15fda94	readers: generating_reader: use noncopyable_function<> std::function<> requires the functor it wraps to be copyable, which is an unnecessarily strict requirement. To relax this, we use noncopyable_function<> instead. Since the former seems to lack some disambiguation magic of the latter, we add `_v1` and `_v2` postfixes to manually disambiguate.	2022-03-17 06:53:44 +02:00
Botond Dénes	35bbd54946	readers: merge generating.hh into generating_v2.hh Both variants return a v2 reader and we are going to keep both for a time to come.	2022-03-17 06:52:28 +02:00
Botond Dénes	7844ff9912	readers/generating.hh: return v2 reader from make_generating_reader() For now, the (v1) reader is just upgraded to v2 behind the scenes.	2022-03-17 06:51:20 +02:00
Gleb Natapov	a1604aa388	raft: make raft requests abortable This patch adds an ability to pass abort_source to raft request APIs ( add_entry, modify_config) to make them abortable. A request issuer not always want to wait for a request to complete. For instance because a client disconnected or because it no longer interested in waiting because of a timeout. After this patch it can now abort waiting for such requests through an abort source. Note that aborting a request only aborts the wait for it to complete, it does not mean that the request will not be eventually executed. Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>	2022-03-16 18:38:01 +01:00
Benny Halevy	a1d0f089c8	replica: distributed_database: populate_column_family: trigger offstrategy compaction only for the base directory In https://github.com/scylladb/scylla/issues/10218 we see off-strategy compaction happening on a table during the initial phases of `distributed_loader::populate_column_family`. It is caused by triggering offtrategy compaction too early, when sstables are populated from the staging directory in `a144d30162`. We need to trigger offstrategy compaction only of the base table directory, never the staging or quarantine dirs. Fixes #10218 Test: unit(dev) DTest: materialized_views_test.py::TestInterruptBuildProcess Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220316152812.3344634-1-bhalevy@scylladb.com>	2022-03-16 18:57:00 +02:00
Botond Dénes	0ea6dddc00	test/boost/mutation_reader_test: remove unused puppet_reader All users use puppet_reader_v2 now. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220316135525.211753-1-bdenes@scylladb.com>	2022-03-16 18:57:00 +02:00
Botond Dénes	afc824a109	test/boost/flat_mutation_reader_test: test_flat_mutation_reader_consume_single_partition: make ckrange conditionally inclusive Depending on the bound weight of the position of the last fragment we expect to read. Currently the range is unconditionally exclusive, which might lead to an artificial difference between the read and expected data, due to a fragment being possibly omitted. Fixes #10229. Tests: unit(boost/flat_mutation_reader_test:test_flat_mutation_reader_consume_single_partition) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220304133515.74586-1-bdenes@scylladb.com>	2022-03-16 18:57:00 +02:00
Avi Kivity	975b0c0b03	Merge "tools/scylla-sstable: add validate-checksums and decompress" from Botond " This patchset adds two new operations to scylla-sstable: * validate-checksums - helps identifying whether an sstable is intact or not, but checking the digest and the per-chunk checksums against the data on disk. * decompress - helps when one wants to manually examine the content of a compressed sstable. Refs: #497 Tests: unit(dev) " * 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla: tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/ tools/scylla-sstable: add decompress operation tools/scylla-sstables: add validate-checksums operation sstables/sstable: add validate_checksums() sstables/sstable: add raw_stream option to data_stream() sstables/sstable: make data_stream() and data_read() public utils/exceptions: add maybe_rethrow_exception()	2022-03-16 18:56:48 +02:00
Avi Kivity	c4a992564b	Merge "Assorted fixes for Fedora 36 build" from Pavel S " This mini-series contains a few trivial fixes to be able to build scylla on Fedora 36 Pre-Release, which will soon enter "Beta" state. It's mostly fixes due to some changes to external dependencies, e.g. boost.outcome and libfmt. Tests: unit(dev) " * 'fc36_build_fixes_v1' of https://github.com/ManManson/scylla: schema: fix build issues with libstdc++ 12 treewide: fix compilation issues with fmtlib 8.1.0+ utils/result.hh: add missing header includes for boost.outcome	2022-03-16 18:56:02 +02:00
Pavel Emelyanov	e8ba395fea	system_keyspace: Make load_host_ids non-static Same as previous patch -- just use the reference from storage service Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	28204cd83d	system_keyspace: Make load_tokens non-static Called from storage service that has system-keyspace instances Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	8f977814bc	system_keyspace: Make remove_endpoint and update_tokens non-static Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	3f0f94b081	system_keyspace: Coroutinize update_tokens While at it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	7b2f142e2d	system_keyspace: Coroutinize remove_endpoint Not to capture 'this' all over the method in next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	7d0d5642c0	system_keyspace: Make update_cached_values non-static The update_table() helper template too. And the update_peer_info as well. It can stop using global qctx and cache after that Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	5a1f7193b0	system_keyspace: Coroutinuze update_peer_info Not to carry 'this' over captures in next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	c15359165d	system_keyspace: Make update_schema_version non-static It's called from two places -- .setup() and schema_tables code. Both have the instance hanging around, so the method can be de-marked static and set free from global qctx Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	b80d5f8900	schema_tables: Add sharded<system_keyspace> argument to update_schema_version_and_announce All its (indirect) callers had been patched to have it, now it's possible to have the argument in it. Next patch will make use of it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	009c449cc3	replica: Push sharded<system_keyspace> down to parse_system_tables The method needs to call merge_schema() that will need system keyspace instance at hand. The parse_s._t. method is boot-time one, pushing the main-local instance through it is fine Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	bd4beeeebe	api: Carry sharded<system_keyspace> reference along There's an APi call to recalculate schema version that needs the system_keyspace instance at hand Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	f18a80852e	storage_service: Keep sharded<system_keyspace> reference Storage service uses system keyspace on boot heavily Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	42e733bdf7	migration_manager: Keep sharded<system_keyspace> reference The main target here is system_keyspace::update_schema_version() which is now static, but needs to have system_keyspace at "this". Migration manager is one of the places that calls that method indirectly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	835f39e0ba	system_keyspace: Remove temporary qp variable After the previous patches it's possible to clean local stack of .setup() a little bit Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	ece4448ea9	system_keyspace: Make get_preferred_ips non-static And mark the method private while at it, because it is such Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	a095e8d92d	system_keyspace: Make cache_truncation_record non-static This one is a bit more tricky that its four preceeders. The qctx's qp().execute_cql() is replaced with qp().execute_internal() for symmetry with the rest. Without data args it's the same. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	f54473427d	system_keyspace: Make check_health non-static Yet another same step. Drop static keyword and patch out globals. Get config.cluster_name from _db while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	5944d5f663	system_keyspace: Make build_bootstrap_info non-static The same -- drop static and forget global qctx and cache Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	08174eb868	system_keyspace: Make build_dc_rack_info non-static Same here -- remove static and patch out global qctx and cache. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	66beaad1e5	system_keyspace: Make setup_version non-static Just remove static mark and stop using global qctx. Grab config from _db instead of argument while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	00a345c4d8	system_keyspace: Copy execute_internal() from query context Before patching system_keyspace methods to use query processor from its instance, the respective call is needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 14:24:40 +03:00
Pavel Emelyanov	f4c185c30e	system_keyspace: Make setup method non-static It's called only on start and actively uses both qctx and local cache. Next patches will fix the whole setup code to stop using global qctx/cache. For now setup invocation is left in its place, but it must really happen in start() method. More patching is needed to make it work. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 13:57:59 +03:00
Pavel Emelyanov	b761e558b5	system_keyspace: Keep local_cache reference on board For now it's a reference, but all users of the cache will be eventually switched into using system_keyspace. In cql-test-env cache starting happens earlier than it was before, but that's OK, it just initializes empty instances. In main cache starts at the same time as before patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 13:57:59 +03:00
Pavel Emelyanov	1bcb6c13a5	system_keyspace: Move minimal_setup into start Start happens at exactly the same place. One thing to take care of is that it happens on all shards. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 13:57:59 +03:00
Pavel Emelyanov	7ef69b8189	system_keyspace: Make sharded object The db::system_keyspace was made a class some time ago, time to create a standard sharded<> object out of it. It needs query processor and database. None of those depensencies is started early enough, so the object for now starts in two steps -- early instances creation and late start. The instances will carry qctx and local_cache on board and all the services that need those two will depend on system-keyspace. Its start happens at exactly the same place where system_keyspace::setup happens thus any service that will use system_keyspace will be on the same safe side as it is now. In the further future the system_keyspace will be equpped with its own query processor backed by local replica database instance, instead of the whole storage proxy as it is now. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-16 13:57:59 +03:00
Pavel Solodovnikov	9a5aae654f	schema: fix build issues with libstdc++ 12 Switch from using `std::map::insert` to `std::map::emplace` in the `get_sharder()` function, since we are constructing a temporary value anyway. Also, use `std::make_pair` instead of initializer list because for some reason Clang 13 w/ libstdc++ 12 argues about not being able to find a suitable overload. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-16 12:34:22 +03:00
Pavel Solodovnikov	95c8d65949	treewide: fix compilation issues with fmtlib 8.1.0+ Due to `fd62fba985` scoped enums are not automatically converted to integers anymore, this is the intended behavior, according to the fmtlib devs. A bit nicer solution would be to use `std::to_underlying` instead of a direct `static_cast`, but it's not available until C++23 and some compilers are still missing the support for it. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-16 12:31:50 +03:00
Pavel Solodovnikov	95dc534d0c	utils/result.hh: add missing header includes for boost.outcome Looks like internal boost.outcome headers don't include some of needed dependencies, so do that manually in our headers. For some reason it worked before, but started to fail when building on Fedora 36 setup. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-03-16 12:28:47 +03:00
Raphael S. Carvalho	0cc717ee86	compaction_manager: Retrieve and register files in rewrite_sstables() atomically The atomicity was lost in commit `a2a5e530f0`. Registration of compacting SSTables now happens in rewrite_sstables_compaction_task ctor, but that's risky because a regular compaction could pick those same files if run_with_compaction_disabled() defers after the callback passed to it returns, and before run__w__c__d() caller has a chance to run. The deferring point is very much possible, because submit() (submits a regular job) is called when run__w__c__d() reenables compaction internally. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220315182857.121479-1-raphaelsc@scylladb.com>	2022-03-16 09:58:16 +02:00
Raphael S. Carvalho	58e520ab1d	compaction: Move run_off_strategy_compaction() into compaction manager Compaction manager is calling back the table to run off-strategy compaction, but the logic clearly belongs to manager which should perform the operation independently and only call table to update its state with the result. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220315174504.107926-2-raphaelsc@scylladb.com>	2022-03-16 09:55:52 +02:00
Raphael S. Carvalho	1bae803a8b	table: Add maintenance_sstable_set() Let's expose maintenance set, to allow the implementation of off-strategy compaction to be moved into the compaction manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220315174504.107926-1-raphaelsc@scylladb.com>	2022-03-16 09:55:51 +02:00
Jan Ciolek	2e7009f427	cql3: expr: is_supported_by: Return false for subscripted values is_supported_by checks whether a given restriction can be supported by some index. Currently when a subscripted value, e.g `m[1]` is encountered, we ignore the fact that there is a subscript and ask whether an index can support the `m` itself. This looks like unintentional behaviour leftover from the times when column_value had a sub field, which could be easily forgotten about. Scylla doesn't support indexes on collection elements at all, so simply returning false there seems like a good idea. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #10227	2022-03-15 20:19:33 +02:00
Mikołaj Sielużycki	a8cb7bf677	readers: Make result_collector use queue reader handle v2. Transitively modifies streaming_virtual_table::as_mutation_source, along with the tests. Closes #10223	2022-03-15 17:02:28 +02:00
Botond Dénes	15a44805ba	tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/ It more accurately describes what the role of this flag actually is.	2022-03-15 14:52:20 +02:00
Botond Dénes	d10700a83d	tools/scylla-sstable: add decompress operation Useful when one wants to manually check the content of a compressed sstable.	2022-03-15 14:52:20 +02:00
Botond Dénes	488d145dc8	tools/scylla-sstables: add validate-checksums operation Useful for determining whether sstables have been corrupted by factors outside of scylla, e.g. the I/O subsystem.	2022-03-15 14:52:20 +02:00
Botond Dénes	ddf9dee9d8	sstables/sstable: add validate_checksums() Sstables have two kind of checksums: per-chunk checksums and full-checksum (digest) calculated over the entire content of Data.db. The full-checksum (digest) is stored in Digest.crc (component_type::Digest). When compression is used, the per-chunk checksum is stored directly inside Data.db, after each compressed chunk. These are validated on read, when decompressing the respective chunks. When no compression is used, the per-chunk checksum is stored separately in CRC.db (component_type::CRC). Chunk size is defined and stored in said component as well. In both compressed and uncompressed sstables, checksums are calculated on the data that is actually written to disk, so in case of compressed data, on the compressed data. This method validates both the full checksum and the per-chunk checksum for the entire Data.db.	2022-03-15 14:52:15 +02:00
Botond Dénes	bf335c9e7a	sstables/sstable: add raw_stream option to data_stream() Optionally provide access to the underlying data as-is, without decompression.	2022-03-15 14:47:27 +02:00
Botond Dénes	9bc80b42cd	sstables/sstable: make data_stream() and data_read() public	2022-03-15 14:42:45 +02:00
Botond Dénes	7a75862570	utils/exceptions: add maybe_rethrow_exception() Helps with the common coroutine exception-handling idiom: std::exception_ptr ex; try { ... } catch (...) { ex = std::current_exception(); } // release resource(s) maybe_rethrow_exception(std::move(ex)); return result;	2022-03-15 14:42:45 +02:00
Botond Dénes	61028ad718	evicatble_reader: avoid preemption pitfall around waiting for readmission Permits have to wait for re-admission after having been evicted. This happens via `reader_permit::maybe_wait_readmission()`. The user of this method -- the evictable reader -- uses it to re-wait admission when the underlying reader was evicted. There is one tricky scenario however, when the underlying reader is created for the first time. When the evictable reader is part of a multishard query stack, the created reader might in fact be a resumed, saved one. These readers are kept in an inactive state until actually resumed. The evictable reader shares it permit with the to-be-resumed reader so it can check whether it has been evicted while saved and needs to wait readmission before being resumed. In this flow it is critical that there is no preemption point between this check and actually resuming the reader, because if there is, the reader might end up actually recreated, without having waited for readmission first. To help avoid this situation, the existing `maybe_wait_readmission()` is split into two methods: * `bool reader_permit::needs_readmission()` * `future<> reader_permit::wait_for_readmission()` The evictable reader can now ensure there is no preemption point between `needs_readmission()` and resuming the reader. Fixes: #10187 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220315105851.170364-1-bdenes@scylladb.com>	2022-03-15 14:37:22 +02:00
Avi Kivity	39504dea61	Merge "Convert result builders to v2" from Botond " Namely the query result writer and the reconcilable result builder, used for building results for regular queries and mutation queries (used in read repair) respectively. With this, there are no users left for the v1 output of the compactor, so we remove that, making the compactor v2 all-the-way (and simpler). This means that for regular queries, a downgrade phase is eliminated completely, as regular queries don't store range tombstone in their result, so no need to convert them. Tests: unit(dev, release, debug) " * 'result-builders-v2/v1' of https://github.com/denesb/scylla: reconcilable_result_builder: remove v1 support query_result_builder: remove v1 support mutation_compactor: drop v1 related code-paths mutation_compactor: drop v1 support altogether from the API tree: migrate to the v2 consumer APIs test/boost/mutation_test: remove v1 specific test code querier: switch to v2 compactor output reconcilable_result_builder: add v2 support query_result_writer: add v2 support query_result_builder: make consume(range_tombstone) noop	2022-03-15 14:32:58 +02:00
Benny Halevy	70e1fdb0c8	replica/database: truncate: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-15 14:02:35 +02:00
Benny Halevy	5ca45b5c32	replica/database: truncate: temporarily disable compaction on table and views before flush Flushing the base table triggers view building and corresponding compactions on the view tables. Temporarily disable compaction on both the base table and all its view before flush and snapshot since those flushed sstables are about to be truncated anyway right after the snapshot is taken. This should make truncate go faster. Refs #6309 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-15 14:02:29 +02:00
Mikołaj Sielużycki	7ce0d380d4	readers: Update tests to use make_queue_reader_v2. Closes #10220	2022-03-15 13:56:50 +02:00
Benny Halevy	20fcf05586	replica/database: truncate: coroutinize per-view logic Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-15 12:51:20 +02:00
Benny Halevy	c6d72a1814	replica/database: open-code truncate_view in truncate truncate-views is called only internally from database::truncate. Next step will be to disable compactions on the base table and view before flush and snapshot. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-15 12:47:46 +02:00
Benny Halevy	613e65e5d0	replica/database: truncate: coroutinize run_with_compaction_disabled lambda Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-15 12:47:13 +02:00
Benny Halevy	0318f48110	replica/database: coroutinize truncate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-15 12:46:41 +02:00
Piotr Dulikowski	5d7b2c6515	utils/result_try: prevent exceptions from being caught multiple times The `result_try` and `result_futurize_try` are supposed to handle both failed results and exceptions in a way similar to a try..catch block. In order to catch exceptions, the metaprogramming machinery invokes the fallible code inside a stack of try..catch blocks, each one of them handling one exception. This is done instead of creating a single try..catch block, as to my knowledge it is not possible to create a try..catch block with the number of "catch" clauses depending on a variadic template parameter pack. Unfortunately, a "try" with multiple "catches" is not functionally equivalent to a "try block stack". Consider the following code: try { try { return execute_try_block(); } catch (const derived_exception&) { // 1 } } catch (const base_exception&) { // 2 } If `execute_try_block` throws `derived_exception` and the (1) catch handler rethrows this exception, it will also be handled in (2), which is not the same behavior as if the try..catch stack was "flat". This causes wrong behavior in `result_try` and `result_futurize_try`. The following snippet has the same, wrong behavior as the previous one: return utils::result_try([&] { return execute_try_block(); }, utils::result_catch<derived_exception>([&] (const auto&& ex) { // 1 }), utils::result_catch<base_exception>([&] (const auto&& ex) { // 2 }); This commit fixes the problem by adding a boolean flag which is set just before a catch handler is executed. If another catch handler is accidentally matched due to exception rethrow, the catch handler is skipped and exception is automatically rethrown. Tests: unit(dev, debug) Fixes: #10211 Closes #10216	2022-03-15 11:42:42 +02:00
Benny Halevy	e5538cf52e	test: mutation_write_test: test_timestamp_based_splitting_mutation_writer: no need to downgrade reader to v1 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220315083425.2786228-2-bhalevy@scylladb.com>	2022-03-15 11:41:11 +02:00
Benny Halevy	90edddd7e3	everywhere: use make_flat_mutation_reader_from_mutations_v2 Rather than upgrade_to_v2(make_flat_mutation_reader_from_mutations) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220315083425.2786228-1-bhalevy@scylladb.com>	2022-03-15 11:41:10 +02:00
Benny Halevy	297a37f640	compaction_manager: add disable_compaction method Returns a RAII class compaction_reenabler that conditionally reenables compaction for the given table when destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-15 11:00:49 +02:00
Nadav Har'El	189ff5414f	test/cql-pytest: implement test_tools.py without run-script cooperation In commit `afab1a97c6`, we added test_tools.py - tests for the various tools embedded in the Scylla executable. These tests need to know where the Scylla executable is, and also where its sstables are stored. For this, the commit added two test parameters - "--scylla-path" and "--workdir" - with which the "run" script communicated this knowledge to the test. However, that implementation meant that these tests only work if the test was run via the test/cql-pytest/run script - they won't work if the user ran Scylla/pytest manually, or through some other script not passing these options. This patch drops the "--scylla-path" and "--workdir" parameters, and instead the test figures out this information on its own: 1. To find the Scylla executable, we begin by looking (using the local_process_id(cql) function from the previous patch) for a local process which listens to our CQL connection, and then find the executable's path using /proc. 2. To find the Scylla data directory (which is what we really need, not workdir which is just a shortcut to set all directories!), we retrieve this configuration from the system.config table through CQL. I tested that test_tools.py now works not only through test/cql-pytest/run but also if I run Scylla manually and then run "pytest test_tools.py" without any extra parameters. Fixes #10209 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220314151125.2737815-2-nyh@scylladb.com>	2022-03-14 20:25:22 +02:00
Nadav Har'El	8ed0909cc3	test/cql-pytest: add mechanism and example of testing Scylla log messages Generally, cql-pytest tests do not, and should not rely on looking up messages in the Scylla log file: Relying on such messages makes it impossible to run the same test against Cassandra or even a remotely- installed Scylla, and the tests tend to break when logging (which is not considered part of our API) changes. Moreover, usually what our dtests achieve by looking at the log - e.g., figuring out when some event has happened - can be achieved through official CQL APIs, and this is what normal users do anyway (users don't normally dig through the log to figure out when their operation completed). However, sometimes we do want to write a test to confirm that during a certain operation, a certain log message gets written to Scylla's log. A desire to do this was raised by @fruch and @soyacz, so in this patch I provide a mechanism to do this, and a trivial example - which checks that a "Creating ..." message appears on the log whenever a table is created, and "Dropping ..." when the table is deleted. As is explained in detail in patches in the comment, Scylla's log file is found automatically, without relying on Scylla's runner (such as the script test/cql-pytest/run) communicating to the test where the log file is. If the log file can't be found - e.g., we're testing a remote Scylla, or if this isn't Scylla, the tests are skipped. I would like all logfile-testing tests to be in the same file, test_logs.py. As I explained above, I think it is a mistake for general tests to check the log file just because they can. I think that the only tests that should use the log file are tests deliberately written to check what gets logged - and those can be collected in the same file. As part of this patch, we add the utility function local_process_id(cql) to find (if we can) the local process which listens to the connection "cql". This utility function will later be useful in more places - for example test_tools.py needs to find Scylla's executable. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220314151125.2737815-1-nyh@scylladb.com>	2022-03-14 20:25:20 +02:00
Lukasz Sojka	c65f1c3b47	test/cql-pytest: add warnings test cql client should return warnings when batch exceedes certain size. This test verifies if response contains them. Test covers issue: https://github.com/scylladb/scylla/issues/10196 Signed-off-by: Lukasz Sojka <lukasz.sojka@scylladb.com> Closes #10197	2022-03-14 19:49:06 +02:00
Benny Halevy	37dc31c429	api: storage_service: force_keyspace_compaction: compact one table at a time To make major compaction more resilient to low- disk space conditions, `342bfbd65a` sorted the tables based on their live disk space used. However, each shard still makes progress in its own pace. This change serializes major compaction between tables so we still compact in parallel on all shards, but one (distributed) table at a time. As a follow-up, we can consider serializing even at the single shard level when disk space is critically low, so we can't even risk parallel compaction across all shards. Refs scylladb/scylla-dtest#2653 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220313153814.2203660-1-bhalevy@scylladb.com>	2022-03-14 15:39:23 +02:00
Raphael S. Carvalho	1a2332a0ba	compaction: Move release_exhausted out of the compaction descriptor With compact_sstables() now living in compaction_manager::task, release_exhausted no longer has to live inside compaction_descriptor, which is a good direction because implementation detail is being removed from the interface. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220311023410.250149-2-raphaelsc@scylladb.com>	2022-03-14 15:39:23 +02:00
Raphael S. Carvalho	fce9d869b4	compaction: Move table::compact_sstables() into compaction manager Table submits compaction request into manager, which in turn calls back table to run the compaction when the time has come, i.e.: table -> compaction manager -> table -> execute compaction But manager should not rely on table to run compaction, as compaction execution procedure sits one layer below the manager and should be accessed directly by it, i.e: table -> compaction manager -> execute compaction This makes code easier to understand and update_compaction_history() can now be noop for unit tests using table_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220311023410.250149-1-raphaelsc@scylladb.com>	2022-03-14 15:39:23 +02:00
Botond Dénes	964d9e033d	Merge "raft_group_registry: drain_on_shutdown" from Benny Halevy " This series hardens raft_group_registry::stop_servers and uses it to drain_on_shutdown, called before the database is stopped in cql_test_env. (Not needed for main). raft_group_registry deferred_stop is introduced right after the service is started to make sure it's properly stopped even if there's an exception at any point while starting. Test: unit(dev) " * tag 'raft_group_registry-drain_on_shutdown-v1' of https://github.com/bhalevy/scylla: cql_test_env: raft_group_registry::drain_on_shutdown before stopping the database raft_group_registry: harden stop_servers raft_group_registry: delete unused _shutdown_gate	2022-03-14 14:22:46 +02:00
Avi Kivity	e7fb71020b	Merge 'replica: Optimize empty_flat_reader out of the hot path' from Michał Chojnowski When row_cache::make_reader() and memtable::make_flat_reader() see that the query result is empty, they return empty_flat_reader, which is a trivial implementation of flat_mutation_reader. Even though empty_flat_reader doesn't do anything meaningful, it still needs to be created, handled in merging_reader and destroyed. Turns out this is costly. This patch series replaces hot path uses of empty_flat_reader with an empty optional. Performance effects: `perf_simple_query --smp 1` TPS: 138k -> 168k allocs/op: 80.2 -> 71.1 insns/op: 49.9k -> 45.1k `perf_simple_query --smp 1 --enable-cache=1 --flush` TPS: 125k -> 150k allocs/op: 79.2 -> 71.1 insns/op: 51.7k -> 47.2k For a cassandra-stress benchmark (localhost, 100% cache reads) this translates to a TPS increase from ~42k to ~48k per hyperthread. Note that this optimization is effective for single-partition reads where the queried partition is only in cache/sstables or only in memtables. Other queries (e.g. where the partition is in both cache in memtables and needs to be merged) are unaffected. Closes #10204 * github.com:scylladb/scylla: replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader() row_cache: Add row_cache::make_reader_opt() replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader() memtable: Add memtable::make_flat_reader_opt() [avi: adjust #include for readers/ split]	2022-03-14 14:07:00 +02:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Benny Halevy	26b1be0b8f	test: lib: random_mutation_generator: accept optional random seed Provide an easy way to instrument a particular test case to use a given random number seed (that's curretly already printed to the test log). Refs #5349 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210907114537.3464004-1-bhalevy@scylladb.com>	2022-03-14 13:09:36 +02:00
Michał Chojnowski	83efb508d6	replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader() The former is significantly cheaper when there is nothing to be read.	2022-03-14 12:02:49 +01:00
Michał Chojnowski	6c6519a909	row_cache: Add row_cache::make_reader_opt()	2022-03-14 12:02:49 +01:00
Michał Chojnowski	f211ef9d71	replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader() The former is significantly cheaper when there is nothing to be read.	2022-03-14 12:02:49 +01:00
Michał Chojnowski	218f2b6e98	memtable: Add memtable::make_flat_reader_opt() When there is nothing to read, make_flat_reader() returns an empty (no-op) reader. But it turns out that constructing, combining and destroying that empty reader is quite costly. As an optimization, add an alternative version which returns an empty optional instead.	2022-03-14 12:02:49 +01:00
Benny Halevy	8481852c91	cql_test_env: raft_group_registry::drain_on_shutdown before stopping the database We're currently stopping raft_gr before shutting the database down, but we fail to do that if anything goes wrong before that, e.g. if distributed_loader::init_non_system_keyspaces fails. This change splits drain_on_shutdown out of stop() to stop the raft groups before the database is stopped and does the rest in a deferred_stop placed right after the rafr_gr registry is strated. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-14 11:49:44 +02:00
Benny Halevy	ac307d6a62	raft_group_registry: harden stop_servers stop_servers should never fail since it's called on the shutdown path. Use a local gate in stop_servers() to wait on all background raft group server aborts. Also, handle theoretical exceptions from server::abort() to guarantee success. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-14 11:49:44 +02:00
Benny Halevy	ab30feb71d	raft_group_registry: delete unused _shutdown_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-14 11:49:44 +02:00
Piotr Dulikowski	2415a1d169	abstract_read_resolver: bring back cancelling timeout timer on read failure Recent PR #10092 (propagating read timeouts on coordinator without throwing) accidentally removed a line which cancelled `abstract_read_resolver`'s `_timeout` timer after a read failure. Because of that, it might happen that after a read failure the timer is triggered and the `_done_promise` is set twice which triggers an assert in seastar. This commit brings back the line which cancels the timeout timer. Fixes: #10193 Closes #10206	2022-03-14 09:43:32 +01:00
Nadav Har'El	383aa326cc	cql-pytest: translate Cassandra's tests for BATCH operations This is a translation of Cassandra's CQL unit test source file validation/operations/BatchTest.java into our our cql-pytest framework. This test file includes 13 tests for various types of BATCH operations. All tests pass on Scylla - no known or new bugs were reproduced. Two of the tests involve very slow testing of TTLs, so after verifying they work I marked them "skip" for now (we can always turn them on later, perhaps after reducing the length or number of the sleeps). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220313121634.2611423-1-nyh@scylladb.com>	2022-03-14 09:43:02 +01:00
Piotr Sarna	83ec505fab	cql3: add tracing indexed aggregate queries Commit `1c99ed6ced` added tracing logs about the index chosen for the query, but aggregate queries have a separate code path, which wasn't taken into account. After this patch, tracing for aggregate queries also includes this additional information. Closes #10195	2022-03-11 15:27:03 +02:00
Raphael S. Carvalho	67a7b7a3f4	compaction: rename interrupt() to a descriptive name interrupt() makes it sound like it's interrupting the compaction, but it's actually called on interrupt, to handle the interrupt scenario. Let's rename it to on_interrupt(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220311000128.189840-1-raphaelsc@scylladb.com>	2022-03-11 10:16:34 +02:00
Botond Dénes	0632114a9b	reconcilable_result_builder: remove v1 support Amounts to making the range tombstone consume() overload private. It is still used internally to consume the downgraded (from v2) range tombstones.	2022-03-11 09:24:46 +02:00
Botond Dénes	21584262be	query_result_builder: remove v1 support Amounts to dropping (the noop) range tombstone consume() overload.	2022-03-11 09:24:17 +02:00
Botond Dénes	279682056d	mutation_compactor: drop v1 related code-paths	2022-03-11 09:24:05 +02:00
Botond Dénes	924ff6a503	mutation_compactor: drop v1 support altogether from the API Fully mechanical change. Drop all v1 types, template types. Internal code is left unchanged, will be made v2 only in the next patch.	2022-03-11 09:24:05 +02:00
Botond Dénes	87ac2e9ab0	tree: migrate to the v2 consumer APIs	2022-03-11 09:24:05 +02:00
Botond Dénes	eacdfb2cb7	test/boost/mutation_test: remove v1 specific test code From test_compactor_range_tombstone_spanning_many_pages, preparing for the retirement of the v1 output of the compactor.	2022-03-11 09:24:05 +02:00
Botond Dénes	0b5217052d	querier: switch to v2 compactor output The change is mostly mechanical: update all compactor instances to the _v2 variant and update all call-sites, of which there is not that many. As a consequence of this patch, queries -- both single-partition and range-scans -- now do the v2->v1 conversion in the consumers, instead of in the compactor.	2022-03-11 09:24:05 +02:00
Botond Dénes	4629f7d7b5	reconcilable_result_builder: add v2 support Add a `consume()` overload for range tombstone changes and convert them internally to range tombstones, as the underlying reconcilable result is still v1.	2022-03-11 09:24:05 +02:00
Botond Dénes	728c14549f	query_result_writer: add v2 support Add a consume() overload which takes a range tombstone change and drops it just like the existing range tombstone overload does: query results don't care about range tombstones.	2022-03-11 09:22:14 +02:00
Botond Dénes	d61f934c50	query_result_builder: make consume(range_tombstone) noop The downstream consumer (mutation_querier) already ignores range tombstones, so no point forwarding them to it. This makes adding v2 support easier too as range tombstone changes can be similarly dropped.	2022-03-11 08:39:12 +02:00
Michał Sala	c8413631af	forward_service: change implicit lambda capture list to explicit one Changing the capture list of a lambda in forward_service::execute_on_this_shard from [&] to an explicit one enables grater readability and prevents potential bugs. Closes #10191	2022-03-10 17:30:06 +02:00
Botond Dénes	e9ba8ad43a	Merge "Configure gossiper the "classical" way" from Pavel Emelyanov The services' configuration should be performed with the help of service-specific config that's filled by the service creator. This is not the case for gossiper that grabs the db::config and keeps reference on it throughout its lifetime. This set brings the gossiper configuration to the described form by putting the needed config bits onto gossip_config (that already exists and is partially used for gossiper configuration). And two live-updateable options need extra care. tests: unit(dev), dtest.simple_boot_shutdown(dev) * 'br-gossiper-no-db-config' of https://github.com/xemul/scylla: gossiper: Remove db::config reference from gossiper gossiper: Keep live-updateable options on gossiper gossiper: Keep immutable options on gossip_config	2022-03-10 16:35:41 +02:00
Botond Dénes	ab440e1a07	mutation_writer: drop now unused v1 variants of bucket_writer feed_writer() Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220302145945.189607-2-bdenes@scylladb.com>	2022-03-10 15:20:07 +02:00
Botond Dénes	108d921fc9	mutation_writer: partition_based_splitting_writer: convert implementation to v2 Although its API was long converted to v2, its implementation stayed v1 because the memtable and mutation API were still v1. Now that the memtable flush returns a v2 reader we can have a second look at converting this. While the mutation API still uses v1, this can easily be worked around by using going through `mutation_rebuilder_v2`. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220302145945.189607-1-bdenes@scylladb.com>	2022-03-10 15:20:07 +02:00
Botond Dénes	7e0b51ff23	Merge 'Overhaul compaction_manager::task' from Benny Halevy The series overhauls the compaction_manager::task design and implementation by properly layering the functionality between the compaction_manager that deals with generic task execution, and the per-task business logic that is defined in a set of classes derived from the generic task class. While at it, the series introduces `task::state` and a set of helper functions to manage it to prevent leaks in the statistics, fixing #9974. Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`. Test: sstable_compaction_test Dtest: compaction_test.py compaction_additional_test.py Fixes #9974 Closes #10122 * github.com:scylladb/scylla: compaction_manager: use coroutine::switch_to compaction_manager::task: drop _compaction_running compaction_manager: move per-type logic to derived task compaction_manager: task: add state enum compaction_manager: task: add maybe_retry compaction_manager: reevaluate_postponed_compactions: mark as noexcept compaction_manager: define derived task types compaction_manager: register_metrics: expose postponed_compactions compaction_manager: register_metrics: expose failed_compactions compaction_manager: register_metrics: expose _stats.completed_tasks compaction: add documentation for compaction_type to string conversions compaction: expose to_string(compaction_type) compaction_manager: task: standardize task description in log messages compaction_manager: refactor can_proceed compaction_manager: pass compaction_manager& to task ctor compaction_manager: use shared_ptr<task> rather than lw_shared_ptr compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once compaction_manager: use compaction_state::lock only to synchronize major and regular compaction	2022-03-10 13:33:56 +02:00
Benny Halevy	5e1fda7e1d	compaction_manager: use coroutine::switch_to Saving an allocation for running the functor as a task in the switched-to scheduling group. Also, switch to the desired scheduling group at the beginning of the task so that the higher level logic, like getting the list of sstables to compact will be performed under the desired scheduling group, not only the compaction code itself. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:20:01 +02:00
Benny Halevy	8c66916652	compaction_manager::task: drop _compaction_running Replace the _compaction_running boolean member by calculating _state == state::active now that setup_new_compaction switches state to `active` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:20:01 +02:00
Benny Halevy	a2a5e530f0	compaction_manager: move per-type logic to derived task Move the business logic into the task specific classes. Separating initialization during task construction, from the compaction_done task, moved into a do_run() method, and in some cases moving a lambda function that was called per table (as in rewrite_sstables) into a private method of the derived class. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:20:01 +02:00
Benny Halevy	2e6ce43a97	compaction_manager: task: add state enum Add an enum class representing the task state machine and a switch_state function to transition between the states and update the corresponding compaction_manager stats counters. Refs #9974 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:19:59 +02:00
Mikołaj Sielużycki	5920349357	row_cache: Make row_cache reader from sstables compacting. Reading data from sstables without compacting first puts unnecessary pressure on the cache. The mutation streams need to be resolved anyway before passing to subsequent consumers, so it's better to do it as close to the source as possible. Fixes: #3568 Closes #10188	2022-03-10 11:40:10 +02:00
Benny Halevy	9c59d66b7e	compaction_manager: task: add maybe_retry Replacing and combining compaction_manager methods: maybe_stop_on_error and put_task_to_sleep. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 11:35:37 +02:00
Benny Halevy	ee32be3aa5	compaction_manager: reevaluate_postponed_compactions: mark as noexcept To simplify error handling in following patches that will coroutinize task logic. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 11:35:37 +02:00
Benny Halevy	72162ed653	compaction_manager: define derived task types Turn task into a class, defining a clear hierarchy of private, protected, and public methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 11:35:35 +02:00
Avi Kivity	2f967f84d4	Merge "Migrate sstable writer to v2" from Botond " This patch-set converts the sstable writer to v2, then prepares the ground for users actually being able to use the v2 variant. Finally it converts all users to do so and then decommissions the v1 variant. For users to be able to use the v2 writer API, we first have to add a v2 output to the compactor first, as some users write to sstables via the compactor. Tests: unit(dev, release) " * 'sstable-writer-v2/v2' of https://github.com/denesb/scylla: sstables/sstable: remove now unused v1 write_components() variant mutation_compactor: remove now unused compact_for_compaction test/boost/mutation_test: migrate to compact_for_mutation_v2 streaming: migrate to v2 variant of sstable writer API memtable-sstable: migrate to v2 variant of sstable writer API test: migrate to the v2 variant of the sstable writer API sstables/sstable: expose v2 variant of write_components() sstables: convert mx writer to v2 sstables/metadata_collector: use position_in_partition for min/max keys test/boost/mutation_test: test_compactor_range_tombstone_spanning_many_pages extend to check v2 output too mutation_reader: convert compacting reader v2 mutation_compactor: add v2 output mutation_compactor: make _last_clustering_pos track last input range_tombstone_change: add set_tombstone() test/lib/mutation_source_test: log name of each run_mutation_source()	2022-03-10 09:45:57 +02:00
Botond Dénes	2e0610e459	sstables/sstable: remove now unused v1 write_components() variant Supplanted by the v2 variant.	2022-03-10 09:16:33 +02:00
Botond Dénes	4e97477281	mutation_compactor: remove now unused compact_for_compaction	2022-03-10 09:16:33 +02:00
Botond Dénes	32e9809e9c	test/boost/mutation_test: migrate to compact_for_mutation_v2	2022-03-10 09:16:33 +02:00
Botond Dénes	06e6bb6ec9	streaming: migrate to v2 variant of sstable writer API	2022-03-10 09:16:33 +02:00
Botond Dénes	d8fec08468	memtable-sstable: migrate to v2 variant of sstable writer API	2022-03-10 09:16:33 +02:00
Botond Dénes	959483a2dc	test: migrate to the v2 variant of the sstable writer API	2022-03-10 09:16:33 +02:00
Benny Halevy	37694422dc	compaction_manager: register_metrics: expose postponed_compactions Provide a metric counting the number of tables with postponed compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	089d4442d8	compaction_manager: register_metrics: expose failed_compactions Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	8081f951d0	compaction_manager: register_metrics: expose _stats.completed_tasks Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	ffc314d506	compaction: add documentation for compaction_type to string conversions Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	28a74a2e90	compaction: expose to_string(compaction_type) To be used in the next patch to generate a string dscription from the compaction_type. In theory, we could use compaction_name() btu the latter returns the compaction type in all-upper case and that is very different from what we print to the log today. The all-upper strings are used for the api layer, e.g. to stop tasks of a particular compaction type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	20a8609392	compaction_manager: task: standardize task description in log messages Define task::describe and use it via operator<< to print the task metadata to the log in a standard way. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	59863b317f	compaction_manager: refactor can_proceed Move the task-internal parts of can_proceed to a respective compaction_manager::task method, preparing for turning it into a class with a proper hierarchy of access to private members. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	33b2731a4a	compaction_manager: pass compaction_manager& to task ctor And use it to get the compaction state of the table to compact. It will be used in a later patch to manage the task state from task methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	20067b1050	compaction_manager: use shared_ptr<task> rather than lw_shared_ptr Prepare for defining per compaction type tasks derived from compaction_manager::task. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	cb2403e917	compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once Like all other maintenance operations, acquire the _maintenance_ops_sem once for the whole task, rather than for each sstable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	d0f693a517	compaction_manager: use compaction_state::lock only to synchronize major and regular compaction Maintenance operations like cleanup, upgrade, reshape, and reshard are serialized serialized with major compaction using the _maintenance_ops_sem and they need no further synchronization with regular compaction by acquiring the per-table read lock.. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Botond Dénes	fed5b73147	sstables/sstable: expose v2 variant of write_components() In parallel to the existing v1 one. In the next patches we start migrating users to the v2 variant incrementally and finally remove the v1 variant.	2022-03-10 07:03:49 +02:00
Botond Dénes	105bf8888a	sstables: convert mx writer to v2 The sstables::sstable class has two methods for writing sstables: 1) sstable_writer get_writer(...); 2) future<> write_components(flat_mutation_reader, ...); (1) directly exposes the writer type, so we have to update all users of it (there is not that many) in this same patch. We defer updating users of (2) to a follow-up commits.	2022-03-10 07:03:49 +02:00
Botond Dénes	11adb404c6	sstables/metadata_collector: use position_in_partition for min/max keys Instead of naked clustering keys. Working with the latter is dangerous because it cannot accurately represent the entire clustering domain: it cannot represent positions between (before/after) keys. For this reason the metadata collector had a separate update_min_max_components() overload for range tombstones because the positions of these cannot be represented by clustering keys alone. Moving to position_in_partition solves this problem and it is now enough to have a single overload with position_in_partition_view. This is also more future proof as it will work with range tombstone changes without any additional changes.	2022-03-10 07:03:49 +02:00
Botond Dénes	2057db54ab	test/boost/mutation_test: test_compactor_range_tombstone_spanning_many_pages extend to check v2 output too	2022-03-10 07:03:49 +02:00
Botond Dénes	7a37e30310	mutation_reader: convert compacting reader v2 Its input was already a v2 reader, now itself is also a v2 reader. With this commit, compaction.cc is finally v2 all-the-way.	2022-03-10 07:03:46 +02:00
Botond Dénes	ad435dcf57	mutation_compactor: add v2 output The output version is selected via compactor_output_format, which is a template parameter of `compact_mutation_state` and all downstream types. This is to ensure a compaction state created to emit a v2 stream will not be accidentally used with a v1 consumer. When using a v2 output, the current active tombstone has to be tracked separately for the regular and for the gc consumer (if any), so that each can be closed properly on EOS. The current effective tombstone is tracked separately from these two. The reason is that purged tombstones are still applied to data, but are not emitted to the regular consumer.	2022-03-10 06:46:46 +02:00
Botond Dénes	1ccaeb2a1a	mutation_compactor: make _last_clustering_pos track last input Instead of updating _last_clustering_pos whenever a clustering fragment is pushed to the consumers, we now update it whenever a clustering fragment enters the compactor. Not only is this much more robust, but it also makes more sense. Just because a range tombstone is purged (and therefore the consumer doesn't see it), it still moves the logical clustering position in the stream. Also, tracking the input side avoids any ambiguity related to cases where we have two consumers (regular + gc consumer).	2022-03-10 06:46:46 +02:00
Botond Dénes	b2b6f03a5d	range_tombstone_change: add set_tombstone()	2022-03-10 06:46:46 +02:00
Botond Dénes	6544da342a	test/lib/mutation_source_test: log name of each run_mutation_source() Although we have a log in run_mutation_reader_tests(), it is useful to know where it was called from, when trying to find the test scenario that failed.	2022-03-10 06:46:46 +02:00
Avi Kivity	a4756334ce	Merge "tools/scylla-types: improve documentation" from Botond " Add per-action help content for each action. Main description now points to these for more details. " * 'scylla-types-improvements/v1' of https://github.com/denesb/scylla: tools/types: update main description tools/scylla-types: per-action help content tools/scylla-types: description: remove -- from action listing tools/scylla-types: use fmt::print() instead of std::cout <<	2022-03-09 19:37:00 +02:00
Avi Kivity	e1c326a5ba	Merge "Convert multishard writer to v2" from Botond " Also convert the foreign_reader used by it in the process. Tests: unit(dev) " * 'multishard-writer-v2/v1' of https://github.com/denesb/scylla: mutation_writer/multishard_writer: remove now unused v1 factory overloads test/boost/mutation_writer_test: test the v2 variant of distribute_reader_and_consume_on_shards() flat_mutation_reader: add v2 variant of make_generating_reader() mutation_reader: multishard_writer: migrate implementation to v2 mutation_reader: convert foreign_reader to v2 streaming/consumer: convert to v2 mutation_writer/multishard_writer: add v2 variant of distribute_reader_and_consume_on_shards()	2022-03-09 19:28:05 +02:00
Avi Kivity	fbb3904b67	Merge 'query: transform asserts into `on_internal_error` in `forward_result::merge`' from Michał Sala It was a suggestion from @psarna, done to get more info about the abort from #10174. Closes #10185 * github.com:scylladb/scylla: query: do not assert in `operator<<(ostream&, const forward_result::printer&)` query: transform asserts into on_internal_error in forward_result::merge	2022-03-09 16:30:32 +02:00
Tomasz Grabiec	8fa704972f	loading_cache: Make invalidation take immediate effect There are two issues with current implementation of remove/remove_if: 1) If it happens concurrently with get_ptr(), the latter may still populate the cache using value obtained from before remove() was called. remove() is used to invalidate caches, e.g. the prepared statements cache, and the expected semantic is that values calculated from before remove() should not be present in the cache after invalidation. 2) As long as there is any active pointer to the cached value (obtained by get_ptr()), the old value from before remove() will be still accessible and returned by get_ptr(). This can make remove() have no effect indefinitely if there is persistent use of the cache. One of the user-perceived effects of this bug is that some prepared statements may not get invalidated after a schema change and still use the old schema (until next invalidation). If the schema change was modifying UDT, this can cause statement execution failures. CQL coordinator will try to interpret bound values using old set of fields. If the driver uses the new schema, the coordinaotr will fail to process the value with the following exception: User Defined Type value contained too many fields (expected 5, got 6) The patch fixes the problem by making remove()/remove_if() erase old entries from _loading_values immediately. The predicate-based remove_if() variant has to also invalidate values which are concurrently loading to be safe. The predicate cannot be avaluated on values which are not ready. This may invalidate some values unnecessarily, but I think it's fine. Fixes #10117 Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>	2022-03-09 16:13:07 +02:00
Michał Sala	538cff651e	query: do not assert in `operator<<(ostream&, const forward_result::printer&)` Printing invalid forward_result should not cause Scylla to stop.	2022-03-09 14:58:11 +01:00
Michał Sala	51362e4e5e	query: transform asserts into on_internal_error in forward_result::merge It was done to show more context in case of forward_result::merge arguments size mismatch and also to prevent aborts caused by another nodes sending malformed data.	2022-03-09 14:58:11 +01:00
Nadav Har'El	397dd64dea	test/cql-pytest: avoid "run" warnings caused by pytest bug This patch gets rid annoying pytest configuration warnings when running test/cql-pytest/run. These started to happen after commit `afab1a97c6`, due to a pytest bug: In that commit, we added new "--scylla-path" and "--workdir" parameters to our pytest tests, and test/cql-pytest/run started passing them, and test/cql-pytest/run sometest runs pytest as: pytest --host something --workdir somedir --scylla-path somepath sometest Pytest wants to find a configuration file (pytest.ini or tox.ini) in the directory where the tests live, but its logic to find that directory is buggy: It (_pytest/config/findpaths.py::determine_setup()) looks at the command line for directory names, and looks for config files in these directories or any of their parents. It ignores parameters beginning with "-", but in our case the various arguments like "--scylla-path" are each followed by another option, and this one is not ignored! So instead of looking for the config file in sometest's parent directories (and finding test/cql-pytest/pytest.ini), pytest sees the directory given after "scylla-path", and finds the completely irrelevant tox.ini there - and uses that, which (depending what you have installed) can generate warnings. The solution is to change the run script to use "--scylla-path=..." as one parameter instead of "--scylla-path ..." as two parameters. When it's just one parameter, the pytest determine_setup() logic skips it entirely, and finds just the actual test directory. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220309132726.2311721-1-nyh@scylladb.com>	2022-03-09 15:37:08 +02:00
Nadav Har'El	733672fc54	Merge 'types: fix is_string for reversed types' from Piotr Sarna Checking if the type is string is subtly broken for reversed types, and these types will not be recognized as strings, even though they are. As a result, if somebody creates a column with DESC order and then tries to use operator LIKE on it, it will fail because the type would not be recognized as a string. Fixes #10183 Closes #10181 * github.com:scylladb/scylla: test: add a case for LIKE operator on a descending order column types: fix is_string for reversed types	2022-03-09 10:04:42 +02:00
Piotr Sarna	05b66102e9	test: add a case for LIKE operator on a descending order column This case is a regression test for issue #10181, where it turned out that a clustering column with descending order is not properly recognized as a string. This test case used to fail with: cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="LIKE is allowed only on string types, which b is not" ...until it got fixed by the previous commit.	2022-03-09 08:56:22 +01:00
Piotr Sarna	0a068cddb1	types: fix is_string for reversed types Checking if the type is string is subtly broken for reversed types, and these types will not be recognized as strings, even though they are. As a result, if somebody creates a column with DESC order and then tries to use operator LIKE on it, it will fail because the type would not be recognized as a string.	2022-03-09 08:18:33 +01:00
Benny Halevy	11ea2ffc3c	compaction_manager: rewrite_sstables: do not acquire table write lock Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain `6737c88045`. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in #10060, and possibly #10166. Fixes #10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10177	2022-03-09 09:13:46 +02:00
Nadav Har'El	c8152e78d7	Merge 'CQL3: fromJson accepts string as bool' from Jadw1 The problem was incompatibility with cassandra, which accepts bool as a string in `fromJson()` UDF. The difference between Cassandra and Scylla now is Scylla accepts whitespaces around word in string, Cassandra don't. Both are case insensitive. Fixes: https://github.com/scylladb/scylla/issues/7915 Closes #10134 * github.com:scylladb/scylla: CQL3/pytest: Updating test_json CQL3: fromJson accepts string as bool	2022-03-08 16:27:40 +02:00
Avi Kivity	1622995900	Merge 'Allow empty partition keys in views' from Nadav Har'El Cassandra generally does not allow empty strings as partition keys (note, by the way, that empty strings are allowed as clustering keys, as well as in individual components of a compound partition key). However, Cassandra does allow empty strings in _regular_ columns - and those regular columns can be indexed by a secondary index, or become an empty partition-key column in a materialized view. As noted in issues #9375 and #9364 and verified in a few xfailing cql-pytest tests, Scylla didn't allow these cases - and this patch series fixes that. Before the last patch in this series finally enables empty-string partition keys in materialized views, we first need to solve a couple of bugs in our code related to handling empty partition keys: The first patch fixes issue #10178 - a bug in `key_view::tri_compare()` where comparing two empty keys returned a random result instead of "equal". The second patch fixes issue #9352: our tokenizer has an inconsistency where for an empty string key, two variants of the same function return different results: 1. One variant `murmur3_partitioner::get_token(bytes_view key)` returned `minimum_token()` for the empty string. 2. Another variant `murmur3_partitioner::get_token(const schema& s, partition_key_view key)` did not have this special case, and called the normal hash-function calculation on the empty string (the resulting token is 0). Variant 2 was an unintentional bug, because Cassandra always does what variant does 1. So the "obvious" fix here would be to fix variant 2 to do what variant 1 does. Nevertheless, we decided to do the opposite: Change variant 1 to match variant 2. The reasoning is as follows: The `minimum_token()` is `token{token::kind::before_all_keys, 0 }` - it's not a real token. Since we intend in this patch allow real data to exist with the empty key, we need this real data to have a real token. For example, this token needs to be located on the token ring (so the empty-key partition will have replicas) and also belong to one of the shards, and it's not clear that `minimum_token()` will be handled correctly in this context. After changing the token of the empty string to 0, we note that some places in the code assume that `dht::decorated_key(dh t::minimum_token(), partition_key::make_empty())` is a legal decorated key. However, as far as I can tell, none of these places actually assume that the partition-key part (the `make_empty()`) really matches the token - this decorated key is only used to start an iteration (ignoring this key itself) or to indicate a non-existent key (in modern code `std::optional` should be used for that). While normally changing the token of a key is a big faux-pas, which can result in old data no longer being readable, in this case this change is safe because: 1. Scylla previously disallowed empty partition keys (in both base tables and views), so we cannot have had such a partition key saved in any sstable. 3. Cassandra does allow empty partition keys in _views_ and _secondary indexes_, but we do not support migrating sstables of those into Scylla - users are expected to only migrate the base table and then re-create the view or index. So however Cassandra writes those empty-key partitions, we don't care. The third patch finally fixes the materialized views implementation to not drop view rows with an empty-string partition key (#9375). This means we basically revert commit `ec8960df45` - which fixed #3262 by disallowing empty partition keys in views, whereas this patch fixes the same problem by handling the empty partition keys correctly. The fix for the secondary index bug (#9364) comes "for free" because it is based on materialized views. We already had xfailing test cases for empty strings in materialized views and indexes, and after this series they begin to pass so the "xfail" mark is removed. The series also adds additional test cases that validate additional corner cases discovered during the debugging. Fixes #9352 Fixes #9364 Fixes #9375 Fixes #10178 Closes #10170 * github.com:scylladb/scylla: compound_compat.hh: add missing methods of iterator materialized views: allow empty strings in views and indexes murmur3: fix inconsistent token for empty partition key compound_compat.hh: fix bug iterating on empty singular key	2022-03-08 15:55:55 +02:00
Nadav Har'El	674d3a5a84	compound_compat.hh: add missing methods of iterator While debugging legacy_compound_view, I noticed that it cannot be used as a C++20 std::ranges::input_range because it is missing some trivial methods. So let's fix this, and make the life of future developers a little bit easier. The two trivial methods we need to implement: 1. A postfix increment operator. We already had a prefix increment operator, but the C++20 concept weakly_iterable also needs postfix. 2. By mistake (this will be corrected in https://wg21.link/P2325R3), weakly_iterable also required the default_initialized concept, so our iterator type also needs a default constructor. We'll never actually use this silly constructor, and when this C++20 standard mistake is corrected, we can remove this constructor. After this patch, a legacy_compound_view is accepted for the C++20 ranges::input_range concept. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-03-08 15:37:03 +02:00
Nadav Har'El	ef43531fb6	materialized views: allow empty strings in views and indexes Although Cassandra generally does not allow empty strings as partition keys (note they are allowed as clustering keys!), it does allow empty strings in regular columns to be indexed by a secondary index, or to become an empty partition-key column in a materialized view. As noted in issues #9375 and #9364 and verified in a few xfailing cql-pytest tests, Scylla didn't allow these cases - and this patch fixes that. The patch mostly removes unnecessary code: In one place, code prevented an sstable with an empty partition key from being written. Another piece of removed code was a function is_partition_key_empty() which the materialized-view code used to check whether the view's row will end up with an empty partition key, which was supposedly forbidden. But in fact, should have been allowed like they are allowed in Cassandra and required for the secondary-index implementation, and the entire function wasn't necessary. Note that the removed function is_partition_key_empty() was NOT required for the "IS NOT NULL" feature of materialized views - this continues to work as expected after this patch, and we add another test to confirm it. Being null and being an empty string are two different things. This patch also removes a part of a unit test which enshrined the wrong behavior. After this patch we are left with one interesting difference from Cassandra: Though Cassandra allows a user to create a view row with an empty-string partition key, and this row is fully visible in when scanning the view, this row can not be queried individually because "WHERE v=''" is forbidden when v is the partition key (of the view). Scylla does not reproduce this anomaly - and such point query does work in Scylla after this patch. We add a new test to check this case, and mark it "cassandra_bug", i.e., it's a Cassandra behavior which we consider wrong and don't want to emulate. This patch relies on #9352 and #10178 having been fixed in previous patches, otherwise the WHERE v='' does not work when reading from sstables. We add to the already existing tests we had for empty materialized-views keys a lookup with WHERE v='' which failed before fixing those two issues. Fixes #9364 Fixes #9375 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-03-08 15:34:26 +02:00
Nadav Har'El	bc4d0fd5ad	murmur3: fix inconsistent token for empty partition key Traditionally in Scylla and in Cassandra, an empty partition key is mapped to minimum_token() instead of the empty key's usual hash function (0). The reasons for this are unknown (to me), but one possibility is that having one known key that maps to the minimal token is useful for various iterations. In murmur3_partitioner.cc we have two variants of the token calculation function - the first is get_token(bytes_view) and the second is get_token(schema, partition_key_view). The first includes that empty- key special case, but the second was missing this special case! As Kamil first noted in #9352, the second variant is used when looking up partitions in the index file - so if a partition with an empty-string key is saved under one token, it will be looked up under a different token and not found. I reproduced exactly this problem when fixing issues #9364 and #9375 (empty-string keys in materialized views and indexes) - where a partition with an empty key was visible in a full-table scan but couldn't be found by looking up its key because of the wrong index lookup. I also tried an alternative fix - changing both implementations to return minimum_token (and not 0) for the empty key. But this is undesirable - minimum_token is not supposed to be a valid token, so the tokenizer and sharder may not return a valid replica or shard for it, so we shouldn't store data under such token. We also have have code (such as an increasing- key sanity check in the flat mutation reader) which assumes that no real key in the data can be minimum_token, and our plan is to start allowing data with an empty key (at least for materialized views). This patch does not risk a backward-incompatible disk format changes for two reasons: 1. In the current Scylla, there was no valid case where an empty partition key may appear. CQL and Thrift forbid such keys, and materialized-views and indexes also (incorrectly - see #9364, #9375) drop such rows. 2. Although Cassandra does allow empty partition keys, they is only allowed in materialized views and indexes - and we don't support reading materialized views generated by Cassandra (the user must re-generate them in Scylla). When #9364 and #9375 will be fixed by the next patch, empty partition keys will start appearing in Scylla (in materialized views and in the materialized view backing a secondary index), and this fix will become important. Fixes #9352 Refs #9364 Refs #9375 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-03-08 14:15:03 +02:00
Nadav Har'El	f8807e24f4	compound_compat.hh: fix bug iterating on empty singular key When iterating over a compound key with legacy_compound_view<>, when the key is "singular" (i.e., a single column) we need to iterate over just the component's actual bytes - without the two length bytes or end-of-component byte. In particular, when the component is an empty string, the iteration should return zero bytes. In other words, we should have begin() == end(). Unfortunately, this is not what happened - for an empty singular key, the iterator returned for begin() was slightly different from end() - so code using this iterator would not know there is nothing to iterate. So in this patch we fix begin() and end() to return the same thing if we have an empty singular key. The bug in legacy_compound_view<> (which we fix here) caused a bug in sstables::key_view::tri_compare(const schema& s, partition_key_view other), causing it to return wrong results when comparing two empty keys. As a result we were unable to retrieve a partition with an empty key from the sstable index. So this patch is necessary to fix support for empty-string keys in sstables (part of issue #9375). This patch also includes a unit-test for this bug. We test it in the context of sstables::key_view::tri_compare(), where it was first discovered, and also test the legacy_compound_view itself. The included test used to fail in both places before this patch, and pass after it. Fixes #10178 Refs #9375 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-03-08 14:14:18 +02:00
Botond Dénes	f2060ac03b	Merge "Remove sub-mode booleans from storage service" from Pavel Emelyanov " There's a _operation_mode enum sitting on storage_service that indicates the top-level state of the scylla node. Next to it there's a bunch of booleans that define (and duplicate) some sub-modes. These booleans just make the code more obscure and complicated. This set removes all those booleans and patches all the relevant checks/calls/methods to rely only on the operation mode. Also, the switching between modes is simplified down to some bare minimum. tests: unit(dev) dtest.simple_boot_shutdown(dev) manual(dev) Manual test included start-stop, nodetool enablegossip, disablegosip and drain commands, scylla-cly is_initialized and is_joined calls As noticed in v2, this set changes the log messages that are checked by dtests. The fix for dtest, that's compatible with both -- current scylla and this patchset -- is already in dtest master. " * 'br-remove-bools-from-storage-service-3-rebase' of https://github.com/xemul/scylla: storage_service: Relax operation modes switch storage_service: Remove _ms_stopped storage_service: Remove _is_bootstrap_mode storage_service: Remove _initialized and is_initialized() storage_service: Remove _joined and is_joined() storage_service: Replace is_starting() with get_operation_mode() storage_service: Make get_operation_mode() return mode itself storage_service: Relax repeating set_mode-s	2022-03-07 15:27:03 +02:00
Avi Kivity	e2f3e9791b	Update seastar submodule * seastar 1d81c8e5aa...4e42a60199 (11): > condition_variable: Add coroutine-only "when" operation folding waiter into parent frame > condition_variable: Add simple test > condition_variable: Make std::chrono timeout operations templated > condition_variable: Remove semaphore usage, keep internal wait queue > condition_variable: Add concept checks to predicated wait methods > io_queue: Don't let preemption overlap requests > io_queue: Pending needs to keep capacity instead of ticket > io_queue: Extend grab_capacity() return codes > scattered_message: allow appending temporary buffers directly > util: file: include reactor.hh > tests: coroutines_test: Check scheduling group set with switch_to() is inherited and restored	2022-03-07 14:30:52 +02:00
Avi Kivity	8ab20bae68	Merge 'prepared_statements: Invalidate batch statement too' from Eliran Sinvani It seams that batch prepared statements always return false for depends_on_keyspace and depends_on_column_family, this in turn renders the removal criteria from the cache to always be false which result by the queries not being evicted. Here we change the functions to return the true state meaning, they will return true if any of the sub queries is dependant upon the keyspace or column family. In this fix we first make the API more coherent and then use this new API to implement the batch statement's dependency test. Fixes #10129 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #10132 * github.com:scylladb/scylla: prepared_statements: Invalidate batch statement too cql3 statements: Change dependency test API to express better it's purpose	2022-03-07 14:00:05 +02:00
Pavel Emelyanov	190385551c	storage_service: Relax operation modes switch The set_mode() tries to combine mode switching and extended logging, but there are no places left that do need this flexibility. It's simpler and nicer to make set_mode() _just_ switch the mode and log some generic "entering ... mode" message. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Pavel Emelyanov	0941098b39	storage_service: Remove _ms_stopped This boolean protects do_stop_ms from re-entrability. However, this method is only called from stop_transport() which handles re-entring itself, so the _ms_stopped can be just removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Pavel Emelyanov	74212286f8	storage_service: Remove _is_bootstrap_mode This "state" is the sub-state of the STARTING mode that's activated when the storage_service::bootstrap() is called. Instead of the separate boolean the new mode can be used. To stop it from reverting the BOOTSTRAP mode back to JOINING some calls to set_mode() should be converted into regular logging. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Pavel Emelyanov	dbaca825ec	storage_service: Remove _initialized and is_initialized() This bit is hairy. First, it indicates that the storage service entered the init_server() method. But, once the node is up and running it also indicates whether the gossiper is enabled or not via the APi call. To rely on the operation mode, first, the NONE mode is introduced at which the server starts. Then in init_server() is switches to STARTING. Second change is to stop using the bit in enable/disable gossiper API call, instead -- check the gossiper.is_enabled() itself. To keep the is_initialized API call compatible, when the operation mode is NORMAL it would return true/false according to the status of the gossiper. This change is simple because storage service API handlers already have the gossiper instance hanging around. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Pavel Emelyanov	ffbfa3b542	storage_service: Remove _joined and is_joined() The is_joined() status can be get with get_operation_mode(). Since it indicates that the operation mode is JOINING, NORMAL or anything above, the operation mode the enum class should be shuffled to get the simple >= comparison. Another needed change is to set mode few steps earlier than it happens now to cover the non-bootstrap startup case. And the third change is to partially revert the `d49aa7ab` that made the .is_joined() method be future-less. Nowadays the is_joined() is called only from the API which is happy with being future-full in all other storage service state checks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Pavel Emelyanov	ca03fd3145	storage_service: Replace is_starting() with get_operation_mode() This is trivial change, since the only user is in API and the get_operation_mode + mode values are at hand. One thing to pay attention to -- the new method checks the mode to be <= STARTING, not for equality. Now this is equivalent change, but next patch will introduce NONE mode that should be reported as is_starting() too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Pavel Emelyanov	c385fe7d79	storage_service: Make get_operation_mode() return mode itself Now it reports back formatted mode. For future convenience it's needed to return the raw value, all the more so the mode enum class is already public. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Pavel Emelyanov	968b07052d	storage_service: Relax repeating set_mode-s In several places the call to set_mode(...) is used as a (format-less) replecement for regular logging. Mode doesn't really change there, because it had been changed before. Patch all those places to use regular logging, next patches will make full use of it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-07 13:29:47 +03:00
Benny Halevy	c7de2e0682	compaction: log info message when interrupting compaction Info messages are logged when compaction jobs start and finish but there is no message logged when the job is interrupted, e.g. when stopped by the compaction_manager. Refs scylladb/scylla-dtest#2468 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-07 11:43:58 +02:00
Botond Dénes	f1b2ff1722	Merge 'service: storage_service: announce new CDC generation immediately with RBNO' from Kamil Braun When a new CDC generation is created (during bootstrap or otherwise), it is assigned a timestamp. The timestamp must be propagated as soon as possible, so all live nodes can learn about the generation before their clocks reach the generation's timestamp. The propagation mechanism for generation timestamps is gossip. When bootstrap RBNO was enabled this was not the case: the generation timestamp was inserted into gossiper state too late, after the repair phase finished. Fix this. Also remove an obsolete comment. Fixes https://github.com/scylladb/scylla/issues/10149. Closes #10154 * github.com:scylladb/scylla: service: storage_service: announce new CDC generation immediately with RBNO service: storage_service: fix indentation	2022-03-07 11:28:00 +02:00
Benny Halevy	a085ef74ff	atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal Following up on `a57c087c89`, compare_atomic_cell_for_merge should compare the ttl value in the reverse order since, when comparing two cells that are identical in all attributes but their ttl, we want to keep the cell with the smaller ttl value rather than the larger ttl, since it was written at a later (wall-clock) time, and so would remain longer after it expires, until purged after gc_grace seconds. Fixes #10173 Test: mutation_test.test_cell_ordering, unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>	2022-03-07 11:05:30 +02:00
Avi Kivity	9359a2caad	Merge 'cql3: expr: Replace column_value::sub with subscript struct in expression' from Jan Ciołek Currently a subscripted column is expressed using the struct `column_value`: ```c++ /// A column, optionally subscripted by a value (eg, c1 or c2['abc']). struct column_value { const column_definition* col; std::optional<expression> sub; ///< If present, this LHS is col[sub], otherwise just col. } ``` It would be better to have a generic AST node for expressing arbitrary subscripted values: ```c++ /// A subscripted value, eg list_colum[2], val[sub] struct subscript { expression val; expression sub; }; ``` The `subscript` struct would allow us to express more, for example: * subscripted `column_identifier`, not only `column_definition` (needed to get rid of `relation` class) * nested subscripts: `col[1][2]` Adding `subscript` to `expression` variant immediately would require to implement all `expr::visit` handlers immediately in the same commit, so I took a different approach. At first the struct is just there and visit handlers are implemented one by one in advance, then at the end `subscript` is added to the `expression`. This way all the new code can be neatly divided into commits and everything is still bisectable. There were a few cases where the existing behaviour seemed to make little sense, but I didn't change it to keep the PR focused on refactoring. I left a `FIXME` comments there and I will submit separate patches to fix them. Closes #10139 * github.com:scylladb/scylla: cql3: expr: Remove sub from column_value cql3: Create a subscript in single_column_relation cql3: expr: Add subscript to expression cql3: Handle subscript in multi_column_range_accumulator cql3: Handle subscript in selectable_process_selection cql3: expr: Handle subscript in test_assignment cql3: expr: Handle subscript in prepare_expression cql3: Handle subscript in prepare_selectable cql3: expr: Handle subscript in extract_clustering_prefix_restrictions cql3: expr: Handle subscript in extract_partition_range cql3: expr: Handle subscript in fill_prepare_context cql3: expr: Handle subscript in evaluate cql3: expr: Handle subscript in extract_single_column_restrictions_for_column cql3: expr: Handle subscript in search_and_replace cql3: expr: Handle subscript in recurse_until cql3: expr: Implement operator<< for subscript cql3: expr: Handle subscript in possible_lhs_values cql3: expr: Handle subscript in is_supported_by cql3: expr: Handle subscript in is_satisifed_by cql3: expr: Remove unused attribute cql3: expr: Use column_maybe_subscripted in is_one_of() cql3: expr: Use column_maybe_subscripted in limits() cql3: expr: Use column_maybe_subscripted in equal() cql3: expr: add get_subscripted_column(column_maybe_subscripted) cql3: expr: Add as_column_maybe_subscripted cql3: expr: Make get_value_comparator work with column_maybe_subscripted cql3: expr: Make get_value work with column_maybe_subscripted cql3: expr: Add column_maybe_subscripted cql3: expr: Add get_subscripted_column cql3: expr: Add subscript struct	2022-03-06 19:03:38 +02:00
Gleb Natapov	108e7fcc4e	raft: enter candidate state immediately when starting a singleton cluster When a node starts it does not immediately becomes a candidate since it waits to learn about already existing leader and randomize the time it becomes a candidate to prevent dueling candidates if several nodes are started simultaneously. If a cluster consist of only one node there is no point in waiting before becoming a candidate though because two cases above cannot happen. This patch checks that the node belongs to a singleton cluster where the node itself is the only voting member and becomes candidate immediately. This reduces the starting time of a single node cluster which are often used in testing. Message-Id: <YiCbQXx8LPlRQssC@scylladb.com>	2022-03-04 20:30:52 +01:00
Kamil Braun	1c5ab5d80c	test: raft: randomized_nemesis_test: when setting up clusters, only create the first server with singleton configuration When setting up clusters in regression tests, a bunch of servers were created, each starting with a singleton configuration containing itself. This is wrong: servers joining to an existing cluster should be started with an empty configuration. It 'worked' because the first server, which we wait for to become a leader before creating the other servers, managed to override the logs and configurations of other servers before they became leaders in their configurations. But if we want to change the logic so that servers in single-server clusters elect themselves as leaders immediately, things start to break. So fix the bug. Message-Id: <20220303100344.6932-1-kbraun@scylladb.com>	2022-03-04 20:29:19 +01:00
Jadw1	213dace26e	CQL3/pytest: Updating test_json Referring to issue #7915, cassandra also works with unprepared statement. There was missing `fromJson()`, the test was inserting string into boolean column.	2022-03-04 14:18:42 +01:00
Jadw1	1902dbc9ff	CQL3: fromJson accepts string as bool The problem was incompatibility with cassandra, which accepts bool as a string in `fromJson()` UDF. The difference between Cassandra and Scylla now is Scylla accepts whitespaces around word in string, Cassandra don't. Both are case insensitive. Fixes: #7915	2022-03-04 14:18:34 +01:00
Benny Halevy	eff5076dd5	sstables: close_files: auto-remove temporary sstable directory If the sstable is marked for deletion, e.g. when writing the sstable fails for any reason before it's sealed, make sure to remove the sstable's temporary directory, if present, besides the sstables files. This condition is benign as these empty temp dirs are removed when scylla starts up, but the do accumulate and we better remove them too. Fixes #9522 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302161827.2448980-1-bhalevy@scylladb.com>	2022-03-03 16:13:03 +02:00
Michael Livshin	0caa21079d	sstables: refrain from throwing on host id mismatch This makes host id mismatch cause a warning and stop being fatal, to un-break node replacement dtests. Should be revisited if/when the underlying problem (double setting of local host id on a replacing node) is fixed. Refs #10148 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220303085049.186259-1-michael.livshin@scylladb.com>	2022-03-03 15:53:19 +02:00
Benny Halevy	a57c087c89	atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge currently returns std::strong_ordering::equal if two cells are equal in every way except their ttl:s. The problem with that is that the cells' hashes are different and this will cause repair to keep trying to repair discrepancies caused by the ttl being different. This may be triggered by e.g. the spark migrator that computes the ttl based on the expiry time by subtracting the expiry time from the current time to produce a respective ttl. If the cell is migrated multiple times at different times, it will generate cells that the same expiry (by design) but have different ttl values. Fixes #10156 Test: mutation_test.test_cell_ordering, unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>	2022-03-03 15:27:16 +02:00
Piotr Grabowski	d3673f2b29	types/map.hh: add missing const qualifiers Add missing const qualifiers in serialize_to_bytes and serialize_to_managed_bytes. Lack of those qualifiers caused GCC compilation error: ./types/map.hh: In instantiation of ‘static bytes map_type_impl::serialize_to_bytes(const Range&) [with Range = std::map<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false>, serialized_compare>; bytes = seastar::basic_sstring<signed char, unsigned int, 31, false>]’: cql3/type_json.cc:138:45: required from here ./types/map.hh:72:41: error: loop variable ‘elem’ of type ‘const std::pair<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >&’ binds to a temporary constructed from type ‘const std::pair<const seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >’ [-Werror=range-loop-construct] 72 \| for (const std::pair<bytes, bytes>& elem : map_range) { \| ^~~~ ./types/map.hh:72:41: note: use non-reference type ‘const std::pair<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >’ to make the copy explicit or ‘const std::pair<const seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >&’ to prevent copying Adding those const qualifiers there is correct, as the definition of those functions specifies that the range is of std::pair<const bytes, bytes> elements, not std::pair<bytes, bytes> (before the change): requires std::convertible_to<std::ranges::range_value_t<Range>, std::pair<const bytes, bytes>> Note that there are some GCC compilation problems still left apart from this one. Closes #10157	2022-03-03 14:24:05 +02:00
Benny Halevy	d43da5d6dc	atomic_cell: compare_atomic_cell_for_merge: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302113833.2308533-2-bhalevy@scylladb.com>	2022-03-03 14:13:14 +02:00
Benny Halevy	be865a29b8	atomic_cell: compare_atomic_cell_for_merge: simplify expiry/deltion_time comparison No need to check first the the cells' expiry is different or that deletion_time is different before comparing them with `<=>`. If they are the same the function returns std::strong_ordering::equal anyhow and that is the same as `<=>` comparing identical values. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com>	2022-03-03 14:12:44 +02:00
Nadav Har'El	3d0bd523b5	Merge 'CQL3: fromJson out of range integer cause as error' from Jadw1 Passing integer which exceeds corresponding type's bounds to `fromJson()` was causing silent overflow, e.g. inserting `fromJson('2147483648')` to `int` coulmn stored `-2147483648`. Now, this will cause marshal_exception. All integer types are testing agains their bounds. Tests referring issue https://github.com/scylladb/scylla/issues/7914 in `test/cql-pytest/cassandra_tests/validation/entities/json_test.py` won't pass because the expected error's messages differ from the thrown ones. I was wondering what the message should be, because expected messages in tests aren't consistent, for instance: - bigint overflow expects `Expected a bigint value, but got a` message - short overflow expects `Unable to make short from` message For now the message is `Value {} out of bound`. Fixes: https://github.com/scylladb/scylla/issues/7914 Closes #10145 * github.com:scylladb/scylla: CQL3/pytest: Updating test_json CQL3: fromJson out of range integer cause as error	2022-03-03 13:46:16 +02:00
Piotr Grabowski	0544973b15	utils/rjson.cc: ignore buggy GCC warning When compiling utils/rjson.cc on GCC, the compilation triggers the following warning (which becomes a compilation error): utils/rjson.cc: In function ‘seastar::future<> rjson::print(const value&, seastar::output_stream<char>&, size_t)’: utils/rjson.cc:239:15: error: typedef ‘using Ch = char’ locally defined but not used [-Werror=unused-local-typedefs] 239 \| using Ch = char; \| ^~ This warning is a false positive. 'using Ch' is actually used internally by rapidjson::Writer. This is a known GCC bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61596), which has not been fixed since 2014. I disabled this warning only locally as other code is not affected by this warning and no other code already disables this warning. Note that there are some GCC compilation problems still left apart from this one. Closes #10158	2022-03-02 19:10:58 +02:00
Pavel Emelyanov	6a154305d7	gossiper: Remove db::config reference from gossiper Also const-ify the db::config reference argument and std::move the gossip_config argument while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 18:34:55 +03:00
Pavel Emelyanov	0c24087007	gossiper: Keep live-updateable options on gossiper These options need to have updateable_value<> instance referencing them from gossiper itself. The updateable_value<> is shard-aware in the sense that it should be constructed on correct shard. This patch does this -- the db::config reference is carried all the way down to the gossiper constructor, then each instance gets its shard-local construction of the updateable_value<>s. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 18:34:55 +03:00
Pavel Emelyanov	271ceb57b9	gossiper: Keep immutable options on gossip_config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 18:34:55 +03:00
Piotr Grabowski	e99f487d31	raft: add missing include Add missing include of "<experimental/source_location>" which caused compile errors on GCC: In file included from raft/fsm.hh:12, from raft/fsm.cc:8: raft/raft.hh:251:30: error: ‘std::experimental’ has not been declared 251 \| state_machine_error(std::experimental::source_location l = std::experimental::source_location::current()) \| ^~~~~~~~~~~~ raft/raft.hh:251:59: error: expected ‘)’ before ‘l’ 251 \| state_machine_error(std::experimental::source_location l = std::experimental::source_location::current()) \| ~ ^~ Note that there are some GCC compilation problems still left apart from this one. Closes #10155	2022-03-02 16:33:43 +01:00
Jadw1	742efc4992	CQL3/pytest: Updating test_json Added test for bigint overflow.	2022-03-02 15:36:09 +01:00
Kamil Braun	09357c784f	service: storage_service: announce new CDC generation immediately with RBNO When a new CDC generation is created (during bootstrap or otherwise), it is assigned a timestamp. The timestamp must be propagated as soon as possible, so all live nodes can learn about the generation before their clocks reach the generation's timestamp. The propagation mechanism for generation timestamps is gossip. When bootstrap RBNO was enabled this was not the case: the generation timestamp was inserted into gossiper state too late, after the repair phase finished. Fix this. Also remove an obsolete comment. Fixes #10149.	2022-03-02 14:55:49 +01:00
Benny Halevy	3b5ba5c1a9	compaction_manager: stop_tasks: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302081547.2205813-3-bhalevy@scylladb.com>	2022-03-02 15:44:10 +02:00
Benny Halevy	95cf4c1c6f	compaction_manager: coroutinize stop_tasks Simplify the function by implementing it as a coroutine, ensuring the input vector, holding the shared task ptrs, is kept alive throughout the lifetime of the function (instead of using do_with to achieve that) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302081547.2205813-2-bhalevy@scylladb.com>	2022-03-02 15:44:10 +02:00
Benny Halevy	d1d3c620b2	compaction_manager: embed task_stop into stop_tasks task_stop is called exclusively from stop_tasks, Now that stop_tasks calls task::stop() directly, there is no need for this separation, so open-code task_stop in stop_tasks, using coroutines. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302081547.2205813-1-bhalevy@scylladb.com>	2022-03-02 15:44:10 +02:00
Kamil Braun	c8e3d5f69d	service: storage_service: fix indentation	2022-03-02 14:43:08 +01:00
Benny Halevy	0764e511bb	compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group It was assumed that offstrategy compaction is always triggered by streaming/repair where it would inherit the caller's scheduling group. However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see how the expiration of this timer will inherit anything from streaming/repair. Also, since `d309a86`, offstrategy compaction may be triggered by the api where it will run in the default scheduling group. The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`. Fixes #10151 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>	2022-03-02 15:36:28 +02:00
Jadw1	0fd7ffb8c1	CQL3: fromJson out of range integer cause as error Passing integer which exceeds corresponding type's bounds to `fromJson()` was causing silent overflow, e.g. inserting `fromJson('2147483648')` to `int` coulmn stored `-2147483648`. Now, this will cause marshal_exception with value out of bound message. Also, all integer types are testing agains their bounds. Fixes: #7914	2022-03-02 14:30:03 +01:00
Botond Dénes	92eb02c301	Merge "Sanitize join_token_ring pre-bootstrap waiter" from Pavel Emelyanov " The set puts the code in question into a helper, coroutinizes it, removes some code duplication, improves a corner case and relaxes logging. tests: unit(dev), dtest.simple_boot_shutdown(v1, dev) " * 'br-join-ring-wait-sanitize-2' of https://github.com/xemul/scylla: storage_service: De-bloat waiting logs storage_service: Indentation fix after previous changes storage_service: Negate loop breaking check storage_service: Fix off-by-one-second waiting storage_service: Pack schema waiting loop storage_service: Out-line schema waiting code storage_service: Make int delay be std::chrono::milliseconds	2022-03-02 15:14:53 +02:00
Mikołaj Sielużycki	f4c57cbe87	memtable: Convert partition_snapshot_flat_reader to v2. This is a facade change only, the make_partition_snapshot_flat_reader function calls upgrade_to_v2 internally. Closes #10152	2022-03-02 15:07:36 +02:00
Pavel Emelyanov	bb1c4adb7c	storage_service: De-bloat waiting logs First thing is that logging can be done with logger methods, not with set_mode() because the mode is already set at this place. Second thing is that pre-update_pending_ranges logs are excessive, as the update_pending_ranges logs its progress itself. Third is that post-logging is also exsessive -- there are more logs after those lines. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 11:55:30 +03:00
Pavel Emelyanov	cb0d298cc4	storage_service: Indentation fix after previous changes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 11:55:30 +03:00
Pavel Emelyanov	829ffe630b	storage_service: Negate loop breaking check In simple words turn while { if (continue) { do_something } else { break } } into while { if (!continue) { break; } do_something } Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 11:55:30 +03:00
Pavel Emelyanov	463aa66b75	storage_service: Fix off-by-one-second waiting The waiting loop needs to abort once a minute passes and does it in one second steps. However, the expiration check happens after sleep, which effectively throws this last second away. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 11:55:30 +03:00
Pavel Emelyanov	60b53732e5	storage_service: Pack schema waiting loop The newly created method looks like this wait_for_schema_agreement update_pending_ranges while (consistent_range_movement) { pause wait_for_schema_agreement update_pending_range } This patch packs the wait_for_schema_agreement+update_pending_range pairs into a single loop. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 11:55:30 +03:00
Pavel Emelyanov	d5b75a24a5	storage_service: Out-line schema waiting code And coroutinize while moving. No other changes. While the code in question runs in a thread context and can enjoy synchronous .get() calls, it's still better if it doesn't make any assumptions about its environment. The ring joining code is changing and new intermediate helpers should better be on the safe side from the very beginning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 11:53:22 +03:00
Pavel Emelyanov	3ea7539d27	storage_service: Make int delay be std::chrono::milliseconds It's milliseconds and is converted back and forth in join_token_ring(). Having a chrono type for it makes things (mostly code reading) simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 11:51:47 +03:00
Benny Halevy	c6e0245f87	compaction_manager: get rid of the disable method It is unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302080632.2183782-1-bhalevy@scylladb.com>	2022-03-02 11:13:39 +03:00
Nadav Har'El	fa7a302130	cross-tree: split coordinator_result from exceptions.hh Recently, coordinator_result was introduced as an alternative for exceptions. It was placed in the main "exceptions/exceptions.hh" header, which virtually every single source file in Scylla includes. But unfortunately, it brings in some heavy header files and templates, leading to a lot of wasted build time - ClangBuildAnalyzer measured that we include exceptions.hh in 323 source files, taking almost two seconds each on average. In this patch, we split the coordinator_result feature into a separate header file, "exceptions/coordinator_result", and only the few places which need it include the header file. Unfortunately, some of these few places are themselves header, so the new header file ends up being included in 100 source files - but 100 is still much less than 323 and perhaps we can reduce this number 100 later. After this patch, the total Scylla object-file size is reduced by 6.5% (the object size is a proxy for build time, which I didn't directly measure). ClangBuildAnalyzer reports that now each of the 323 includes of exceptions.hh only takes 80ms, coordinator_result.hh is only included 100 times, and virtually all the cost to include it comes from Boost's result.hh (400ms per inclusion). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220228204323.1427012-1-nyh@scylladb.com>	2022-03-02 10:12:57 +02:00
Raphael S. Carvalho	2dba0670ad	compaction: Fix time_window_backlog_tracker::replace_sstables() Introduced in commit: `ddd693c6d7` We're not emplacing newer windows in the tracker, causing std::out_of_range when replacing sstables for windows. Let's fix the logic and add an unit test to cover this. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>	2022-03-02 10:08:40 +02:00
Botond Dénes	b2061688a5	mutation_writer/multishard_writer: remove now unused v1 factory overloads	2022-03-02 09:58:38 +02:00
Botond Dénes	70e95a9cf7	test/boost/mutation_writer_test: test the v2 variant of distribute_reader_and_consume_on_shards() The underlying implementation behind the v1 and v2 variants if said methods is the same, but we want to move to using the v2 variant in the test as the v1 variant is going away soon.	2022-03-02 09:57:24 +02:00
Botond Dénes	7a119080ee	flat_mutation_reader: add v2 variant of make_generating_reader()	2022-03-02 09:56:50 +02:00
Botond Dénes	bbf8e26a3a	mutation_reader: multishard_writer: migrate implementation to v2	2022-03-02 09:56:10 +02:00
Botond Dénes	cdf7e74da8	mutation_reader: convert foreign_reader to v2	2022-03-02 09:55:38 +02:00
Botond Dénes	ad1b157452	streaming/consumer: convert to v2 At least on the API level, internally there are still conversions, but these are going to be sorted out in the next patches too.	2022-03-02 09:55:09 +02:00
Benny Halevy	1e15caa158	compaction_manager: setup_new_compaction: allow setting output_run_identifier Currently the output_run_identifier is assigned right after the calling setup_new_compaction. Move setting the uuid to setup_new_compaction to simplify the flow. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220301083643.1845096-1-bhalevy@scylladb.com>	2022-03-02 09:50:59 +02:00
Michael Livshin	a389cc520b	system_keyspace, sstable: log local host id in key places Specifically: when it is generated, when it is loaded from `system.local`, and when there is a mismatch during sstable validation; in the latter case log the in-sstable host id also. Refs #10148 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220301123925.257766-1-michael.livshin@scylladb.com>	2022-03-02 09:49:37 +02:00
Benny Halevy	c9e06f1246	compaction_manager: task: get rid of the stopping member Instead, rely solely on compaction_data.abort source that is task::stop now uses to stop the task. This makes task stopping permanent, so it can't be undone (as used to be the case where task_stop set stopping to false after waiting for compaction_done, to allow rerite_sstables's task to be created before calling run_with_compaction_disabled, and start running after it - which is no longer the case) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220301083535.1844829-1-bhalevy@scylladb.com>	2022-03-01 16:46:09 +02:00
Benny Halevy	222389e0f5	compaction_manager: rewrite_sstables: retrieve sstable with compaction disabled before making task Currently, rewrite_sstables retrieved the sstables under run_with_compaction_disabled, after it's created a task for itself. This makes little sense as this task have not started running yet and therefore does not need to be stopped by run_with_compaction_disabled. This is currently worked around by setting task->stopping = false in task_stop(). This change just moves task create in rewrite_sstables till after the sstables are retrieved and the deferred cleanup of _stats.pending_tasks till after it's first adjusted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220301083409.1844500-1-bhalevy@scylladb.com>	2022-03-01 16:45:33 +02:00
Nadav Har'El	7cf2e5ee5c	Merge 'directory_lister: drop abort method and simplify close semantics' from Benny Halevy This series contains: - lister: move to utils - tidy up the clutter in the root dir Based on Avi's feedback to `[PATCH 1/1] utils: directory_lister: close: always abort queue` that was sent to the mailing list: - directory_lister: drop abort method - lister: do not require get after close to fail - test: lister_test: test_directory_lister_close simplify indentation - cosmetic cleanup Closes #10142 * github.com:scylladb/scylla: test: lister_test: test_directory_lister_close simplify indentation lister: do not require get after close to fail directory_lister: drop abort method lister: move to utils	2022-03-01 16:23:47 +02:00
Botond Dénes	cfa3910509	Merge 'Memtable - scanning and flush readers now implement flat_mutation_reader_v2::impl' from Michael Livshin This PR consists of two changes. The first fixes the flat_mutation_reader and flat_mutation_reader_v2, so that they can be destructed without being closed (if no action has been initiated). This has been discussed in the referenced issue. The second one changes scanning and flush readers so that they implement the second version of the API. It also contains unit test fixes, dealing with flat mutation reader assertions (where the v1 asserter failed to consume range tombstones intelligently enough in some flows) and several sstable_3_x tests (where sstables that contain range tombstones were expected to be byte-by-byte equivalent to a reference, aside from semantic validation). Fixes #9065. Closes #9669 * github.com:scylladb/scylla: flat_reader_assertions: do not accumulate out-of-range tombstones flat_reader_assertions: refactor resetting accumulated tombstone lists flat_mutation_reader_test: fix "test_flat_mutation_reader_consume_single_partition" memtable::make_flush_reader(): return flat_mutation_reader_v2 memtable::make_flat_reader(): return flat_mutation_reader_v2 flat_mutation_reader_v2: add consume_partitions() introduce the MutationConsumer concept mutation_source: clone shortcut constructors for flat_mutation_reader_v2 flat_mutation_reader_v2: add delegating_reader_v2 memtable: upgrade scanning_reader and flush_reader to v2 flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO. tests: stop comparing sstables with range tombstones to C* reference tests: flat_reader_assertions: improve range tombstone checking	2022-02-28 17:23:20 +02:00
Michael Livshin	fb6c79015a	flat_reader_assertions: do not accumulate out-of-range tombstones Also remove the incorrect difference in range tombstone checking behavior between `produces_range_tombstone()` and `produces(const range_tombstone&)` by having both turn on checking. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	9fa4d9a2bb	flat_reader_assertions: refactor resetting accumulated tombstone lists Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	2221aeff0e	flat_mutation_reader_test: fix "test_flat_mutation_reader_consume_single_partition" Since `flat_reader_assertions::produces(const range_tombstone&,...)` records the range tombstone for checking, be sure to explicitly pass in a clustering range that does not extend beyond the mock-read part of the mutation. Also (provisionally) change the assertion method to accept clustering ranges. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	34ed752885	memtable::make_flush_reader(): return flat_mutation_reader_v2 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	9bacce4359	memtable::make_flat_reader(): return flat_mutation_reader_v2 This is just a facade change. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	8da28d0902	flat_mutation_reader_v2: add consume_partitions() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	ce8f34f5a0	introduce the MutationConsumer concept Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	68cfb6261f	mutation_source: clone shortcut constructors for flat_mutation_reader_v2 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	fbbe27051e	flat_mutation_reader_v2: add delegating_reader_v2 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michał Radwański	2a3bd40c69	memtable: upgrade scanning_reader and flush_reader to v2 This change is a part of effort to migrate existing readers from old API to the new one. The corresponding make_flush_reader and make_flat_reader functions still return flat_mutation_reader.	2022-02-28 17:11:54 +02:00
Michał Radwański	9ada63a9cb	flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO. In functions such as upgrade_to_v2 (excerpt below), if the constructor of transforming_reader throws, r needs to be destroyed, however it hasn't been closed. However, if a reader didn't start any operations, it is safe to destruct such a reader. This issue can potentially manifest itself in many more readers and might be hard to track down. This commit adds a bool indicating whether a close is anticipated, thus avoiding errors in the destructor. Code excerpt: flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) { class transforming_reader : public flat_mutation_reader_v2::impl { // ... }; return make_flat_mutation_reader_v2<transforming_reader>(std::move(r)); } Fixes #9065.	2022-02-28 17:11:54 +02:00
Michael Livshin	67c3c31a6e	tests: stop comparing sstables with range tombstones to C* reference As flat mutation reader {up,down}grades get added to the write path, comparing range-tombstone-containing (at least) sstables byte-by-byte to a reference is starting to seem like a fool's errand. * When a flat mutation reader is {up,down}graded, information may get lost while splitting range tombstones. Making those splits revertable should in theory be possible but would surely make {up,down}graders slower and more complex, and may also possibly entail adding information to in-memory representation of range tombstones and range rombstone changes. Such investment for the sake of 7 unit tests does not seem wise, given that the plan is to get rid of reader {up,down}grade logic once the move to flat mutation reader v2 is completed. * All affected tests also validate their written sstables semantically. * At least some of the offending reference sstables are not "canonical" wrt range tombstones to begin with -- they contain range tombstones that overlap with clustering rows. The fact that Scylla does not "canonicalize" those in some way seems purely incidental. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	2337d48b41	tests: flat_reader_assertions: improve range tombstone checking `produces_range_tombstone()` is smart enough to not just try to read one range tombstone from the input and compare it to the passed reference, but to read as many range tombstones as the reader is looking at (including none) using `may_produce_tombstones()` and record those appropriately. When `produces(const schema&, const mutation_fragment&)` is passed a range tombstone as the second argument, it does not do anything special -- it just reads one fragment, disregards it (!), and applies its second argument to both "expected" and "encountered" range tombstone lists. The right thing here is to use the same logic as `produces_range_tombstone()`; upcoming memtable-related reader changes (which result in more split range tombstones) cause some unit tests to fail without fixing this. Refactor the relevant logic into a private method (`apply_rt()`) and use that in both places. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Nadav Har'El	f84094320d	exceptions: de-inline exception constructors The header file "exceptions/exceptions.hh" and the exception types in it is used by virtually every source file in Scylla, so excessive includes and templated code generation in this header could slow down the build considerably. Before this patch, all of the exceptions' constructors were inline in exceptions.hh, so source file using one of these exceptions will need to recompile the code, which is fairly heavy, using the fmt templates for various types. According to ClangBuildAnalyzer, 323 source files needed to materialize prepare_message<db::consistency_level,int&,int&>, taking 0.3 seconds each. So this patch moves the exception constructors from the header file exceptions.hh to the source file exceptions.cc. The header file no longer uses fmt. Unfortunately, the actual build-time savings from this patch is tiny - around 0.1%... It turns out that most of the prepare_message<> compilation time comes from fmt compilation time, and since virtually all source files use fmt for other header reasons (intentionally or through other headers), no compilation time can be saved. Nevertheless, I hope that as we proceed with more cleanups like this and eliminate more unnecessary code-generation-in-headers, we'll start seeing build time drop. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-28 14:47:41 +02:00
Botond Dénes	f8da0a8d1e	Merge "Conceptualize some static assertions" From Pavel Emelyanov " Some templates put constraints onto the involved types with the help of static assertions. Having them in form of concepts is much better. tests: unit(dev) " * 'br-static-assert-to-concept' of https://github.com/xemul/scylla: sstables: Remove excessive type-match assertions mutation_reader: Sanitize invocable asserion and concept code: Convert is_future result_of assertions into invoke_result concept code: Convert is_same+result_of assertions into invocable concepts code: Convert nothrow construction assertions into concepts code: Convert is_integral assertions to concepts	2022-02-28 13:58:01 +02:00
Nadav Har'El	b650ff5808	test/cql-pytest: test another corner-case of scientific-notation integers In a previous patch, we added a test for the case of Scylla trying to assign the JSON value 1e6 into an integer - which should be allowed because 1e6 is indeed a whole number, in the range of int. We already fixed that in commit `efe7456f0a`, but this patch adds another test which demonstrates that an even more esoteric problem remains: If we are reading a JSON value into a bigint (CQL's 64-bit integer), and if the number is between 2^53 and 2^63-1 and if the number is written using scientific notation, e.g., 922337203685477580.7e1 (which is 2^63-1), then the bigint is set incorrectly, with some digits being lost. The problem is that RapidJSON reads this integer into the "double" type, which only keeps 53 significant bits. Because this is an open issue (#10137), the test included here is marked as expected failure (xfail). The test is also known to fail in Cassandra - which doesn't allow scientific notation for JSON integers at all despite the JSON standard - so the test is also marked "cassandra_bug". Refs #10137 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-28 13:52:56 +02:00
Benny Halevy	1768aae603	compaction_manager: rewrite_sstables: construct compacting_sstable_registration with compaction_manager& Rather than using a std::optional<compacting_sstable_registration> for lazy construction, construct the object early and call register_compacting when the sstables to register are available. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 13:52:03 +02:00
Benny Halevy	1584c50710	compaction_manager: compacting_sstable_registration: keep a compaction_manager& Rather than a compaction_manager* so that in the next patch it could be constructed with just that and the caller can call register_compacting when it has the sstables to register ready. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 13:52:03 +02:00
Benny Halevy	c008fb137b	compaction_manager: use unordered_set for compacting sstables registration It is more efficient than using a vector as the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 13:52:03 +02:00
Benny Halevy	9c89c2df37	test: lister_test: test_directory_lister_close simplify indentation There's no need anymore for an indented block to destroy tnhe directory_lister since the other sub-case was deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 13:00:03 +02:00
Benny Halevy	41d097ef47	lister: do not require get after close to fail Currently, the lister test expected get() to always fail after close(), but it unexpectedly succeeded if get() was never called before close, as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4587/artifact/testlog/x86_64_debug/lister_test.test_directory_lister_close.4001.log ``` random-seed=1475104835 Generated 719 dir entries Getting 565 dir entries Closing directory_lister Getting 0 dir entries Closing directory_lister test/boost/lister_test.cc(190): fatal error: in "test_directory_lister_close": exception std::exception expected but not raised ``` This change relaxes this requirement to keep close() simple, based on Avi's feedback: > The user should call close(), and not do it while get() is running, and > that's it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 12:59:08 +02:00
Benny Halevy	00327bfae3	directory_lister: drop abort method Based on Avi's feedback: > We generally have a public abort() only if we depend on an external > event (like data from a tcp socket) that we don't control. But here > there are no such external events. So why have a public abort() at all? If needed in the future, we can consider adding get(abort_source&) to allow aborting get() via an external event. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 12:52:47 +02:00
Benny Halevy	ebbbf1e687	lister: move to utils There's nothing specific to scylla in the lister classes, they could (and maybe should) be part of the seastar library. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 12:36:03 +02:00
Botond Dénes	d27259ca5b	mutation_writer/multishard_writer: add v2 variant of distribute_reader_and_consume_on_shards() Just the factory function itself. The underlying machinery stays v1 for now. Behind the scenes the v2 variant still invokes the v1 one, with the necessary conversions. This allows migrating users to the v2 interface, migrating the machinery later.	2022-02-28 10:48:08 +02:00
Jan Ciolek	e086201420	cql3: expr: Remove sub from column_value column_value::sub has been replaced by the subscript struct everywhere, so we can finally remove it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 22:02:39 +01:00
Jan Ciolek	b80f9e6cf8	cql3: Create a subscript in single_column_relation When `val[sub]` is parsed, it used to be the case that column_value with a sub field was created. Now this has been changed to creating a subscript struct. This is the only place where a subscripted value can be created. All the code regarding subscripts now operates using only the subscript struct, so we will be able to remove column_value::sub soon. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 22:02:39 +01:00
Jan Ciolek	cf6e81e731	cql3: expr: Add subscript to expression All handlers for subscript have finally been implemented and subscript can now be added to expression without any trouble. All the commented out code that waited for this moment can now be uncommented. Every such piece of code had a `TODO(subscript)` note and by grepping this phrase we can make sure that we didn't forget any of them. Right now there is two ways to express a subscripted column - either by a column_value with a sub field or by using a subscript struct. The grammar still uses the old column_value way, but column_value.sub will be removed soon and everything will move to the subscript struct. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 22:02:29 +01:00
Jan Ciolek	0a7636b2d4	cql3: Handle subscript in multi_column_range_accumulator Same case as column_value. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	818e3544bb	cql3: Handle subscript in selectable_process_selection Selected values can't subscripted, the grammar in Cql.g doesn't allow it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	ec6f93d0c7	cql3: expr: Handle subscript in test_assignment test_assignment can't be passed a column_value, so a subscript won't work as well. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	ab89fc316b	cql3: expr: Handle subscript in prepare_expression column_value can't be prepared, so subscript can't be prepared as well. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	1a653f8f36	cql3: Handle subscript in prepare_selectable Selected values can't subscripted, the grammar in Cql.g doesn't allow it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	ef2acddcb9	cql3: expr: Handle subscript in extract_clustering_prefix_restrictions extract_clustering_prefix_restrictions collects restrictions on clustering key columns. In case we encounter col[sub] we treat it as a restriction on col and add it to the result. This seems to make some sense and is in line with the current behaviour which doesn't check whether a column is subscripted at all. The code has been copied from column_value& handler. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	ec57c2516e	cql3: expr: Handle subscript in extract_partition_range extract_parition_range collects restrictions on partition key columns. In case we encounter col[sub] we treat it as a restriction on col and add it to the result. This seems to make some sense and is in line with the current behaviour which doesn't check whether a column is subscripted at all. The code has been copied from column_value& handler. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	c39498537c	cql3: expr: Handle subscript in fill_prepare_context fill_prepare_context collects useful information about the expression involved in query restrictions. We should collect this information from subscript as well, just like we do from column_value and its sub. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	811685ad6a	cql3: expr: Handle subscript in evaluate A column_value can't be evaluated, so a subscripted column can't evaluated be as well. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	db8990436a	cql3: expr: Handle subscript in extract_single_column_restrictions_for_column extract_single_column_restrictions_for_column finds all restrictions for a column and puts them in a vector. In case we encounter col[sub] we treat it as a restriction on col and add it to the result. This seems to make some sense and is in line with the current behaviour which doesn't check whether a column is subscripted at all. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	2eaa39e1c8	cql3: expr: Handle subscript in search_and_replace Prepare a handler for subscript in search_and_replace. Some of the code must be commented out for now because subscript hasn't been added to expression yet. It will uncommented later. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	e2d983f659	cql3: expr: Handle subscript in recurse_until Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	2d4174dc46	cql3: expr: Implement operator<< for subscript expression can be printed using operator<<. We need to handle subscript there. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	02c3b78e25	cql3: expr: Handle subscript in possible_lhs_values possible_lhs_values returns set of possible values for a column given some restrictions. Current behaviour in case of a subscripted column is to just ignore the subscript and treat the restriction as if it were on just the column. This seems wrong, or at least confusing, but I won't change it in this patch to preserve the existing behaviour. Trying to change this to something more reasonable breaks other code which assumes that possible_lhs_values returns a list of values. (See partition_ranges_from_EQs() in cql3/restrictions/statement_restrictions.cc) Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	07fbf74a97	cql3: expr: Handle subscript in is_supported_by is_supported_by checks whether the given expression is supported by some index. The current behaviour seems wrong, but I kept it to avoid making changes in a refactor PR. Scylla doesn't have indexes on map entries yet, so for a subscript the answer is always no. I think we should just return false there. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	fb59f488df	cql3: expr: Handle subscript in is_satisifed_by For the most part subscript can be handled in the same way as column_value. column_value has a sub argument and all called functions evaluate lhs value using get_value() which is prepared to handle subscripted columns. These functions now take column_maybe_subscripted so we can pass &subscript to them without a problem. The difference is in CONTAINS, CONTAINS_KEY and LIKE. contains() and contains_key() throw an exception when the passed column has a subscript, so now we just throw an exception immediately. like() doesn't have a check for subscripted value, but from reading its code it's clear that it's not ready to handle such values, so an exception is now thrown as well. It shouldn't break any tests because when one tries to perform a query like: `select * from t where m[0] like '%' allow filtering;` an exception is throw somewhere earlier in the code. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	1edaa3ef0d	cql3: expr: Remove unused attribute Functions that were previously marked as unused to make the code compile are now used and we can remove the markings. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	cf839807ac	cql3: expr: Use column_maybe_subscripted in is_one_of() is_one_of() used to take column_value which could be subscripted as an argument. column_value.sub will be removed so this function needs to take column_maybe_subscripted now. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	75c8b2ec6c	cql3: expr: Use column_maybe_subscripted in limits() limits() used to take column_value which could be subscripted as an argument. column_value.sub will be removed so this function needs to take column_maybe_subscripted now. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	bc8c298be3	cql3: expr: Use column_maybe_subscripted in equal() equal() used to take column_value which could be subscripted as an argument. column_value.sub will be removed so this function needs to take column_maybe_subscripted now. To get lhs value the code uses get_value() which is ready to handle subscripted columns. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	ca423a455e	cql3: expr: add get_subscripted_column(column_maybe_subscripted) Add a function that extracts the column_value from column_maybe_subscripted. There were already overloads for expression and subscript, but this one will be needed as well. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	6d42ff580d	cql3: expr: Add as_column_maybe_subscripted Add a convenience function that allows to convert a reference to expression to column_maybe_subscripted. It will be useful in a moment. For now part of it must be commented out because subscript is not in the expression variant yet. It will be uncommented once subscript is finally added to expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	d577af0f0c	cql3: expr: Make get_value_comparator work with column_maybe_subscripted There is get_value_comparator(column_value) but soon we will also need get_value_comparator(column_maybe_subscripted). Implement it by copying code from get_value_comparator(column_value). Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	a8287a158a	cql3: expr: Make get_value work with column_maybe_subscripted There is a get_value(column_value), but soon we will also need get_value(column_maybe_subscripted). Implement get_value(column_maybe_subscripted) by checking whether the argument is a column_value or subscript and calling the right code. Code for handling the subscript case is copied from get_value(column_value) where sub has value. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	feee6e4ffb	cql3: expr: Add column_maybe_subscripted column_maybe_subscripted is a variant that can be either a column_value or a subscript. It will be used as an argument to functions which used to take column_value. Right now column_value has a sub field, but this will be removed soon once the subscript struct takes over. Changing the argument type is a smaller change than rewriting all these functions, although if they were rewritten the resulting code would probably be nicer. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:41 +01:00
Jan Ciolek	4d7438d30a	cql3: expr: Add get_subscripted_column Even though the new subscript allows for subscripting anything, the only thing that is really allowed to be subscripted is a column. Add a utility function that extracts the column_value from an expression with is a column_value or subscript. It will came in handy in the following commits. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 21:56:30 +01:00
Benny Halevy	132c9d5933	main: shutdown: do not abort on certain system errors Currently any unhandled error during deferred shutdown is rethrown in a noexcept context (in ~deferred_action), generating a core dump. The core dump is not helpful if the cause of the error is "environmental", i.e. in the system, rather than in scylla itself. This change detects several such errors and calls _Exit(255) to exit the process early, without leaving a coredump behind. Otherwise, call abort() explicitly, rather than letting terminate() be called implicitly by the destructor exception handling code. Fixes #9573 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>	2022-02-27 16:26:48 +02:00
Nadav Har'El	5df6e56fbf	Update seastar submodule * seastar 2849a8a8...1d81c8e5 (3): > Merge "make semaphore and shared_promise abortable" from Gleb > Fix io_tester.cc compilation with clang > Revert "Merge "make semaphore and shared_promise abortable" from Gleb"	2022-02-27 13:00:41 +02:00
Eliran Sinvani	4eb0398457	prepared_statements: Invalidate batch statement too It seams that batch prepared statements always return false for depends_on, this in turn renders the removal criteria from the prepared statements cache to always be false which result by the queries not being evicted. Here we change the function to return the true state meaning, they will return true if one of the sub queries is dependant upon the keyspace and/ or column family. Fixes #10129 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2022-02-27 11:48:03 +02:00
Eliran Sinvani	bf50dbd35b	cql3 statements: Change dependency test API to express better it's purpose Cql statements used to have two API functions, depends_on_keyspace and depends_on_column_family. The former, took as a parameter only a table name, which makes no sense. There could be multiple tables with the same name each in a different keyspace and it doesn't make sense to generalize the test - i.e to ask "Does a statement depend on any table named XXX?" In this change we unify the two calls to one - depends on that takes a keyspace name and optionally also a table name, that way every logical dependency tests that makes sense is supported by a single API call.	2022-02-27 11:48:03 +02:00
Jan Ciolek	a5bcd4f7f2	cql3: expr: Add subscript struct Add a struct called subscript, which will be used in expression variant to represent subscripted values e.g col[x], val[sub]. It will replace the sub field of column_value. Having a separate struct in AST for this purpose is cleaner and allows to express subscripting values other than column_value. It is not added to the expression variant yet, because that would require immediately implementing all visitors. The following commits will implement individual visitors and then subscript will finally be added to expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-27 01:32:59 +01:00
MaciekCisowski	439001b8c2	service_level_controller: fix small typo in exception message Closes #10136	2022-02-26 22:23:26 +02:00
Tomasz Grabiec	7719f4cd91	Merge "Group 0 discovery: persist and restore peers" from Kamil We add a `peers()` method to `discovery` which returns the peers discovered until now (including seeds). The caller of functions which return an output -- `tick` or `request` -- is responsible for persisting `peers()` before returning the output of `tick`/`request` (e.g. before sending the response produced by `request` back). The user of `discovery` is also responsible for restoring previously persisted peers when constructing `discovery` again after a restart (e.g. if we previously crashed in the middle of the algorithm). The `persistent_discovery` class is a wrapper around `discovery` which does exactly that. For storage we use a simple local table. A simple bugfix is also included in the first patch. * kbr/discovery-persist-v3: service: raft: raft_group0: persist discovered peers and restore on restart db: system_keyspace: introduce discovery table service: raft: discovery: rename `get_output` to `tick` service: raft: discovery: stop returning peer_list from `request` after becoming leader	2022-02-25 17:23:08 +01:00
Avi Kivity	ff2cd72766	Merge 'utils: cached_file: Fix alloc-dealloc mismatch during eviction' from Tomasz Grabiec cached_page::on_evicted() is invoked in the LSA allocator context, set in the reclaimer callback installed by the cache_tracker. However, cached_pages are allocated in the standard allocator context (note: page content is allocated inside LSA via lsa_buffer). The LSA region will happily deallocate these, thinking that they these are large objects which were delegated to the standard allocator. But the _non_lsa_memory_in_use metric will underflow. When it underflows enough, shard_segment_pool.total_memory() will become 0 and memory reclamation will stop doing anything, leading to apparent OOM. The fix is to switch to the standard allocator context inside cached_page::on_evicted(). evict_range() was also given the same treatment as a precaution, it currently is only invoked in the standard allocator context. The series also adds two safety checks to LSA to catch such problems earlier. Fixes #10056 \cc @slivne @bhalevy Closes #10130 * github.com:scylladb/scylla: lsa: Abort when trying to free a standard allocator object not allocated through the region lsa: Abort when _non_lsa_memory_in_use goes negative tests: utils: cached_file: Validate occupancy after eviction test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch utils: cached_file: Fix alloc-dealloc mismatch during eviction	2022-02-25 18:19:04 +02:00
Botond Dénes	daf0f7cee5	tools/types: update main description Remove examples and instead point user to action-specific help for more information about specific actions.	2022-02-25 15:02:07 +02:00
Botond Dénes	af19d5ccf1	tools/scylla-types: per-action help content Just like scylla-sstable, have a separate --help content for reach action. The existing description is shortened and is demoted to summary: this now only appears in the listing in the main description.	2022-02-25 15:01:02 +02:00
Botond Dénes	629a5c3ed6	tools/scylla-types: description: remove -- from action listing Actions are commands, not switches now, update the listing in the description accordingly.	2022-02-25 15:00:47 +02:00
Botond Dénes	05bd6b2bce	tools/scylla-types: use fmt::print() instead of std::cout << `std::cout <<` makes for very hard-to-read (and hard-to-write) code. Replace with `fmt::print()`.	2022-02-25 15:00:21 +02:00
Botond Dénes	d8833de3bb	Merge "Redefine Compaction Backlog to tame compaction aggressiveness" From Raphael S. Carvalho " Problem statement ================= Today, compaction can act much more aggressive than it really has to, because the strategy and its definition of backlog are completely decoupled. The backlog definition for size-tiered, which is inherited by all strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the world must reach the state of zero amplification. But that's unrealistic and goes against the intent amplification defined by the compaction strategy. For example, size tiered is a write oriented strategy which allows for extra space amplification for compaction to keep up with the high write rate. It can be seen today, in many deployments, that compaction shares is either close to 1000, or even stuck at 1000, even though there's nothing to be done, i.e. the compaction strategy is completely satisfied. When there's a single sstable per tier, for example. This means that whenever a new compaction job kicks in, it will act much more aggressive because of the high shares, caused by false backlog of the existing tables. This translates into higher P99 latencies and reduced throughput. Solution ======== This problem can be fixed, as proposed in the document "Fixing compaction aggressiveness due to suboptimal definition of zero backlog by controller" [1], by removing backlog of tiers that don't have to be compacted now, like a tier that has a single file. That's about coupling the strategy goal with the backlog definition. So once strategy becomes satisfied, so will the controller. Low-efficiency compaction, like compacting 2 files only or cross-tier, only happens when system is under little load and can proceed at a slower pace. Once efficient jobs show up, ongoing compactions, even if inefficient, will get more shares (as efficient jobs add to the backlog) so compaction won't fall behind. With this approach, throughput and latency is improved as cpu time is no longer stolen (unnecessarily) from the foreground requests. [1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ Results ======= Test sequentially populates 3 tables and then run a mixed workload on them, where disk:memory ratio (usage) reaches ~30:1 at the peak. Please find graphs here: https://user-images.githubusercontent.com/1409139/153687219-32368a35-ac63-461b-a362-64dbe8449a00.png 1) Patched version started at ~01:30 2) On population phase, throughput increase and lower P99 write latency can be clearly observed. 3) On mixed phase, throughput increase and lower P99 write and read latency can also be clearly observed. 4) Compaction CPU time sometimes reach ~100% because of the delay between each loader. 5) On unpatched version, it can be seen that backlog keeps growing even when though strategies become satisfied, so compaction is using much more CPU time in comparison. Patched version correctly clears the backlog. Can also be found at: github.com/raphaelsc/scylla.git compaction-controller-v5 tests: UNIT(dev, debug). " * 'compaction-controller-v5' of https://github.com/raphaelsc/scylla: tests: Add compaction controller test test/lib/sstable_utils: Set bytes_on_disk for fake SSTables compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component compaction: Redefine compaction backlog to tame compaction aggressiveness compaction_backlog_tracker: Batch changes through a new replacement interface table: Disable backlog tracker when stopping table compaction_backlog_tracker: make disable() public compaction_backlog_tracker: Clear tracker state when disabled compaction: Add normalized backlog metric compaction: make size_tiered_compaction_strategy static	2022-02-25 09:21:08 +02:00
Pavel Emelyanov	40078a6f8c	types.hh: Nitpick on <=> usage tri_compare_opt can avoid casting bool to int for spaceshipping int - int <=> 0 looks nicer and shorter as int <=> int data_type::compare from serialized_tri_compare already returns strong_ordering tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220224125556.13138-1-xemul@scylladb.com>	2022-02-25 07:26:11 +02:00
Nadav Har'El	c26230943b	alternator ttl: add metrics This patch adds metrics to the Alternator TTL feature (aka the "expiration service"). I put these metrics deliberately in their own object in ttl.{hh,cc}, and also with their own prefix ("expiration_") - and not* together with the rest of the Alternator metrics (alternator/stats.{hh,cc}). This is because later we may want to use the expiration service not only in Alternator but also in CQL - to support per-item expiration with CDC events also in CQL. So the implementation of this feature should not be too tangled with that of Alternator. The patch currently adds four metrics, and opens the path to easily add more in the future. The metrics added now are: 1. scylla_expiration_scan_passes: The number of scan passes over the entire table. We expect this to grow by 1 every alternator_ttl_period_in_seconds seconds. 2. scylla_expiration_scan_table: The number of table scans. In each scan pass, we scan all the tables that have the Alternator TTL feature enabled. Each scan of each table is counted by this counter. 3. scylla_expiration_items_deleted: Counts the number of items that the expiration service expired (deleted). Please remember that each item is considered for expiration - and then expired - on only one node, so each expired item is counted only once - not RF times. 4. scylla_expiration_secondary_ranges_scanned: If this counter is incremented, it means this node took over some other node's expiration scanning duties while the other node was down. This patch also includes a couple of unrelated comment fixes. I tested the new metrics manually - they aren't yet tested by the Alternator test suite because I couldn't make up my mind if such tests would belong in test_ttl.py or test_metrics.py :-) Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220224092419.1132655-1-nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Asias He	ec59f7a079	repair: Do not flush hints and batchlog if tombstone_gc_mode is not repair The flush of hints and batchlog are needed only for the table with tombstone_gc_mode set to repair mode. We should skip the flush if the tombstone_gc_mode is not repair mode. Fixes #10004 Closes #10124	2022-02-25 07:26:11 +02:00
Nadav Har'El	d1b4cbfbc3	test/cql-pytest: add reproducer for LWT bug with static-column conditions This patch adds a reproducing test for issue #10081. That issue is about a conditional (LWT) UPDATE operation that chose a non-existent row via WHERE, and its condition refers to both static and regular columns: In that case, the code incorrectly assumes that because it didn't read any row, all columns are null - and forgets that the static column is not null. The test, test_lwt.py::test_lwt_missing_row_with_static passes on Cassandra but fails on Scylla, so is marked xfail. Refs #10081 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220215215243.660087-1-nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Avi Kivity	8f2bc838af	Update seastar submodule * seastar ea6a6820ed...2849a8a8ba (1): > Merge "make semaphore and shared_promise abortable" from Gleb include fixup from Gleb added.	2022-02-25 07:26:11 +02:00
Benny Halevy	e2894bc762	compaction_manager: task: use plain UUID Now that a null uuid is defined to be logically false there's no need to use an optional UUID. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-25 07:26:11 +02:00
Nadav Har'El	db7b11cfc4	alternator: make TTL expiration scanner bypass cache The background scan for expired Alternator items (the TTL feature) should bypass the cache to avoid poluting it with the entire content of the table being scanned. I tested that the flag added in this patch really works by adding a printout to the code in table.cc which creates the reader. Although we do have a metric for uses of BYPASS CACHE, unfortunately this metric counts usage of "BYPASS CACHE" in CQL statements - and not does not account the low- level calls that we use in the ttl scanner. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Nadav Har'El	e06b5d9306	alternator: updated compatibility.md about TTL feature The document docs/alternator/compatibility.md suggested that Alternator does not support the TTL feature at all. The real situation is more optimistic - this feature is supported, but as experimental feature. So let's update compatibility.md with the real status of this feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Nadav Har'El	49a8164fb7	alternator: add configurable scan period to TTL expiration Before this patch, the experimental TTL (expiration time) feature in Alternator scans tables for expiration in a tight loop - starting the next scan one second after the previous one completed. In this patch we introduce a new configuration option, alternator_ttl_period_in_seconds, which determines how frequently to start the scan. The default is 24 hours - meaning that the next scan is started 24 hours after the previous one started. The tests (test/alternator/run) change this configuration back to one second, so that expiration tests finish as quickly as possible. Please note that the scan is not slowed down to fill this 24 hours - if it finishes in one hour, it will then sleep for 23 hours. Additional work would be needed to slow down the scan to not finish too quickly. One idea not yet implemented is to move the expiration service from the "maintenance" scheduling group which it uses today to a new scheduling group, and modifying the number of shares that this group gets. Another thing worth noting about the configurable period (which defaults to 24 hours) is that when TTL is enabled on an Alternator table, it can take that amount of time until its scan starts and items start expiring from it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Tomasz Grabiec	1d75a8c843	lsa: Abort when trying to free a standard allocator object not allocated through the region It indicates alloc-dealloc mismatch, and can cause other problems in the systems like unable to reclaim memory. We want to catch this at the deallocation site to be able to quickly indentify the offender. Misbehavior of this sort can cause fake OOMs due to underflow of _non_lsa_memory_in_use. When it underflows enough, shard_segment_pool.total_memory() will become 0 and memory reclamation will stop doing anything. Refs #10056	2022-02-25 01:42:15 +01:00
Tomasz Grabiec	9dd4153c16	lsa: Abort when _non_lsa_memory_in_use goes negative It indicates alloc-dealloc mismatch, and can cause other problems in the systems like unable to reclaim memory. Catch early. Refs #10056	2022-02-25 01:42:15 +01:00
Tomasz Grabiec	ca09a72597	tests: utils: cached_file: Validate occupancy after eviction Reproducer for #10056 Catches alloc-dealloc mismatch leading to the underflow of _non_lsa_memory_in_use.	2022-02-25 01:42:15 +01:00
Tomasz Grabiec	b0d5bb334c	test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch The test was allocating entries in the standard allocator, but they are evicted in the LSA allocator context. Fix by allocating under LSA.	2022-02-25 01:42:15 +01:00
Raphael S. Carvalho	2a7939ee4d	tests: Add compaction controller test There's no automated test for controller, it's time to have one. Let's start with a basic one that verifies the assumption that perfectly compacted tiers should produce 0 backlog. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 18:57:45 -03:00
Raphael S. Carvalho	96cfe7d530	test/lib/sstable_utils: Set bytes_on_disk for fake SSTables Not precise, as bytes_on_disk accounts for all components, but good enough for testing purposes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 18:57:45 -03:00
Raphael S. Carvalho	a8caa67937	compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component For describing data size, we use unsigned types. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 18:57:45 -03:00
Raphael S. Carvalho	1d9f53c881	compaction: Redefine compaction backlog to tame compaction aggressiveness Today, compaction can act much more aggressive than it really has to, because the strategy and its definition of backlog are completely decoupled. The backlog definition for size-tiered, which is inherited by all strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the world must reach the state of zero amplification. But that's unrealistic and goes against the intent amplification defined by the compaction strategy. For example, size tiered is a write oriented strategy which allows for extra space amplification for compaction to keep up with the high write rate. It can be seen today, in many deployments, that compaction shares is either close to 1000, or even stuck at 1000, even though there's nothing to be done, i.e. the compaction strategy is completely satisfied. When there's a single sstable per tier, for example. This means that whenever a new compaction job kicks in, it will act much more aggressive because of the high shares, caused by false backlog of the existing tables. This translates into higher P99 latencies and reduced throughput. Solution ======== This problem can be fixed, as proposed in the document "Fixing compaction aggressiveness due to suboptimal definition of zero backlog by controller" [1], by removing backlog of tiers that don't have to be compacted now, like a tier that has a single file. That's about coupling the strategy goal with the backlog definition. So once strategy becomes satisfied, so will the controller. Low-efficiency compaction, like compacting 2 files only or cross-tier, only happens when system is under little load and can proceed at a slower pace. Once efficient jobs show up, ongoing compactions, even if inefficient, will get more shares (as efficient jobs add to the backlog) so compaction won't fall behind. With this approach, throughput and latency is improved as cpu time is no longer stolen (unnecessarily) from the foreground requests. [1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ Fixes #4588. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 18:57:38 -03:00
Raphael S. Carvalho	ddd693c6d7	compaction_backlog_tracker: Batch changes through a new replacement interface This new interface allows table to communicate multiple changes in the SSTable set with a single call, which is useful on compaction completion for example. With this new interface, the size tiered backlog tracker will be able to know when compaction completed, which will allow it to recompute tiers and their backlog contribution, if any. Without it, tiered tracker would have to recompute tiers for every change, which would be terribly expensive. The old remove/add interface are being removed in favor of the new one. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 15:34:16 -03:00
Pavel Emelyanov	3f884fbdd7	sstables: Remove excessive type-match assertions The primitive_consumer method templates overcomplicate the declaration of the fact that one of the method arguments is the sub-type of a template argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:49:20 +03:00
Pavel Emelyanov	b1843e50de	mutation_reader: Sanitize invocable asserion and concept There are both in the filtering_reader template, leave only the concept and convert it into one-line invocable check Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:48:37 +03:00
Pavel Emelyanov	ffbf19ee3c	code: Convert is_future result_of assertions into invoke_result concept Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:47:32 +03:00
Pavel Emelyanov	645896335d	code: Convert is_same+result_of assertions into invocable concepts Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:46:10 +03:00
Pavel Emelyanov	063da81ab7	code: Convert nothrow construction assertions into concepts The small_vector also has N>0 constraint that's also converted Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:44:50 +03:00
Pavel Emelyanov	b8401f2ddd	code: Convert is_integral assertions to concepts Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:44:29 +03:00
Raphael S. Carvalho	84d843697b	table: Disable backlog tracker when stopping table Backlog tracker is managed by compaction strategy, and we'd like to have it disabled in table::stop(), to make sure that all state is cleared. For example, a reference to a shared sstable, in the tracker implementation, could prevent the sstable manager from being stopped as it relies on all sstables managed by it being closed first. By calling tracker's disable() method, table::stop() will guarantee that state is cleared by completion. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:41:05 -03:00
Raphael S. Carvalho	26350c8591	compaction_backlog_tracker: make disable() public Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:50 -03:00
Raphael S. Carvalho	c15e055612	compaction_backlog_tracker: Clear tracker state when disabled If the tracker is disabled, we never get to access the underlying implementation anymore. It makes sense to clear _impl on disable(). So table::stop() can call its backlog tracker's disable method, clearing all its state. This is important for clean shutdown, as any sstable in tracker state may cause sstable manager to hang when being stopped. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:39 -03:00
Raphael S. Carvalho	a70ce7ecb3	compaction: Add normalized backlog metric Normalized backlog metric is important for understanding the controller behavior as the controller acts on normalized backlog for yielding an output, not the raw backlog value in bytes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:33 -03:00
Raphael S. Carvalho	89eb563c94	compaction: make size_tiered_compaction_strategy static Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:29 -03:00
Tomasz Grabiec	e68cf55514	utils: cached_file: Fix alloc-dealloc mismatch during eviction on_evicted() is invoked in the LSA allocator context, set in the reclaimer callback instaled by the cache_tracker. However, cached_pages are allocated in the standard allocator context (note: page content is allocated inside LSA via lsa_buffer). The LSA region will happilly deallocate these, thinking that they these are large objects which were delegated to the standard allocator. But the _non_lsa_memory_in_use metric will underflow. When it underflows enough, shard_segment_pool.total_memory() will become 0 and memory reclamation will stop doing anything, leading to apparent OOM. The fix is to switch to the standard allocator context inside cached_page::on_evicted(). evict_range() was also given the same treatment as a precaution, it currently is only invoked in the standard allocator context. Fixes #10056	2022-02-23 18:38:05 +01:00
Asias He	680195564d	repair: Unify repair uuid report in the log More and more places are using the repair[uuid]: format for logging repair jobs with the uuid. Convert more places to use the new format to unify the log format. This makes it easier to grep a specific repair job in the log. Closes #10125	2022-02-23 09:13:12 +02:00
Avi Kivity	cbba80914d	memtable: move to replica module and namespace Memtables are a replica-side entity, and so are moved to the replica module and namespace. Memtables are also used outside the replica, in two places: - in some virtual tables; this is also in some way inside the replica, (virtual readers are installed at the replica level, not the cooordinator), so I don't consider it a layering violation - in many sstable unit tests, as a convenient way to create sstables with known input. This is a layering violation. We could make memtables their own module, but I think this is wrong. Memtables are deeply tied into replica memory management, and trying to make them a low-level primitive (at a lower level than sstables) will be difficult. Not least because memtables use sstables. Instead, we should have a memtable-like thing that doesn't support merging and doesn't have all other funky memtable stuff, and instead replace the uses of memtables in sstable tests with some kind of make_flat_mutation_reader_from_unsorted_mutations() that does the sorting that is the reason for the use of memtables in tests (and live with the layering violation meanwhile). Test: unit (dev) Closes #10120	2022-02-23 09:05:16 +02:00
Avi Kivity	5d4213e1b8	Update seastar submodule * seastar c18cc5dc68...ea6a6820ed (7): > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz Fixes #9061. > Merge "Export IO rate-limiter tokens metrics" from Pavel E > Merge "Fix block device configuration and RWF_NOWAIT support" from Pavel E > code: Sanitize fs magic values usage > Fix build error with c++17 > coroutine: introduce coroutine::switch_to > Broader catching of bpo command-line parsing errors in app-template	2022-02-22 20:58:25 +03:00
Avi Kivity	75fb45df1b	Merge 'Propagate CQL coordinator timeouts and failures for reads' from Piotr Dulikowski This PR propagates the read coordinator logic so that read timeout and read failure exceptions are propagated without throwing on the coordinator side. This PR is only concerned with exceptions which were originally thrown by the coordinator (in read resolvers). Exceptions propagated through RPC and RPC timeouts will still throw, although those exceptions will be caught and converted into exceptions-as-values by read resolvers. This is a continuation of work started in #10014. Results of `perf_simple_query --smp 1 --operations-per-shard 1000000` (read workload), compared with merge base (`10880fb0a7`): ``` BEFORE: 125085.13 tps ( 80.2 allocs/op, 12.2 tasks/op, 49010 insns/op) 125645.88 tps ( 80.2 allocs/op, 12.2 tasks/op, 49008 insns/op) 126148.85 tps ( 80.2 allocs/op, 12.2 tasks/op, 49005 insns/op) 126044.40 tps ( 80.2 allocs/op, 12.2 tasks/op, 49005 insns/op) 125799.75 tps ( 80.2 allocs/op, 12.2 tasks/op, 49003 insns/op) AFTER: 127557.21 tps ( 80.2 allocs/op, 12.2 tasks/op, 49197 insns/op) 127835.98 tps ( 80.2 allocs/op, 12.2 tasks/op, 49198 insns/op) 127749.81 tps ( 80.2 allocs/op, 12.2 tasks/op, 49202 insns/op) 128941.17 tps ( 80.2 allocs/op, 12.2 tasks/op, 49192 insns/op) 129276.15 tps ( 80.2 allocs/op, 12.2 tasks/op, 49182 insns/op) ``` The PR does not introduce additional allocations on the read happy-path. The number of instructions used grows by about 200 insns/op. The increase in TPS is probably just a measurement error. Closes #10092 * github.com:scylladb/scylla: indexed_table_select_statement: return some exceptions as exception messages result_combinators: add result_wrap_unpack select_statement: return exceptions as errors in execute_without_checking_exception_message select_statement: return exceptions without throwing in do_execute select_statement: implement execute_without_checking_exception_message select_statement: introduce helpers for working with failed results query_pager: resultify relevant methods storage_proxy: resultify (do_)query storage_proxy: resultify query_singular storage_proxy: propagate failed results through query_partition_key_range storage_proxy: resultify query_partition_key_range_concurrent storage_proxy: modify handle_read_error to also handle exception containers abstract_read_executor: return result from execute() abstract_read_executor: return and handle result from has_cl() storage_proxy: resultify handling errors from read-repair abstract_read_executor::reconcile: resultise handling of data_resolver->done() abstract_read_executor::execute: resultify handling of data_resolver->done() result_combinators: add result_discard_value abstract_read_executor: resultify _result_promise abstract_read_executor: return result from done() abstract_read_resolver: fail promises by passing exception as value abstract_read_resolver: resultify promises exceptions: make it possible to return read_{timeout,failure}_exception as value result_try: add as_inner/clone_inner to handle types result_try: relax ConvertWithTo constraint exception_container: switch impl to std::shared_ptr and make copyable result_loop: add result_repeat result_loop: add result_do_until result_loop: add result_map_reduce utils/result: add utilities for checking/creating rebindable results	2022-02-22 20:58:25 +03:00
Nadav Har'El	eec39e1258	Merge 'api: keyspace_scrub: validate params' from Benny Halevy Refs #10087 Add validation of all params for the keyspace_scrub api. The validation method is generic and should be used by all apis eventually, but I'm leaving that as follow-up work. While at it, fixed the exception types thrown on invalid `scrub_mode` or `quarantine_mode` values from `std::runtime_error` to `httpd::bad_param_exception` so to generate the `bad_request` http status. And added unit tests to verify that, and the handling of an unknown parameter. Test: unit(dev) DTest: nodetool_additional_test.py::TestNodetool::{test_scrub_with_one_node_expect_data_loss,test_scrub_with_multi_nodes_expect_data_rebuild,test_scrub_sstable_with_invalid_fragment,test_scrub_ks_sstable_with_invalid_fragment,test_scrub_segregate_sstable_with_invalid_fragment,test_scrub_segregate_ks_sstable_with_invalid_fragment} Closes #10090 * github.com:scylladb/scylla: api: storage_service: scrub: validate parameters api: storage_service: refactor parse_tables api: storage_service: refactor validate_keyspace test: rest_api: add test_storage_service_keyspace_scrub tests api: storage_service: scrub: throw httpd::bad_param_exception for invalid param values	2022-02-22 20:58:25 +03:00
Nadav Har'El	364bd00136	test/cql-pytest: confirm that table names cannot include non-Latin letters In CQL table names must be composed only of letters, digits, or underscores, but some Cassandra documentation is unclear whether these "letters" refer only to the Latin alphabet, or maybe UTF-8 names composed of letters in other alphabets should be allowed too. This patch adds a test that confirms that both Scylla and Cassandra only accept the Latin alphabet in table names, and for example UTF-8 names with French or Hebrew letters are rejected. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220222134220.972413-1-nyh@scylladb.com>	2022-02-22 20:58:25 +03:00
Nadav Har'El	1a940a1003	test/cql-pytest: remove "xfail" mark from scientific-notation tests that now pass After issue #10100 was fixed, the two tests reproducing it now pass, so remove their "xfail" marker. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220222131809.970592-1-nyh@scylladb.com>	2022-02-22 20:58:25 +03:00
Nadav Har'El	be84a8def3	Merge 'Allow integers in scientific format in `INSERT JSON` ' from Piotr Grabowski Add support for specifing integers in scientific format (for example 1.234e8) in INSERT JSON statement: ``` INSERT INTO table JSON '{"int_column": 1e7}'; ``` Before the JSON parsing library was switched to RapidJSON from JsonCpp, this statement used to work correctly, because JsonCpp transparently casts double to integer value. Inserting a floating-point number ending with .0 is allowed, as the fractional part is zero. Non-zero fractional part (for example 12.34) is disallowed. A new test is added to test all those behaviors. This behavior differs from Cassandra, which disallows those types of numbers (1e7, 123.0 and 12.34), however some users rely on that behavior and JSON specification itself does not distinct between floating-point numbers and integer numbers (only a single "number" type is defined). This PR also fixes two minor issues I noticed while looking at the code: wrong blob validation and missing `IsString()` checks that could result in assertion error. Fixes #10100 Fixes #10114 Fixes #10115 Closes #10101 * github.com:scylladb/scylla: type_json: support integers in scientific format type_json: add missing IsString() checks type_json: fix wrong blob JSON validation	2022-02-22 20:58:25 +03:00
Botond Dénes	3aa05f7f03	Merge "Make system.clients table virtual" from Pavel Emelyanov " The table lists connected clients. For this the clients are stored in real table when they connect, update their statuses when needed and remove^w tombstone themselves when they disconnect. On start the whole table is cleared. This looks weird. Here's another approach (inspired by the hackathon project) that makes this table a pure virtual one. The schema is preserved so is the data returned. The benefits of doing it virtual are - no on-disk updates while processing clients - no potentially failing updates on non-failing disconnect - less usage of the global qctx thing - less calls to global storage_proxy - simpler support for thrift and alternator clients (today's table implementation doesn't track them) - the need to make virtual tables reg/unreg dynamic branch: https://github.com/xemul/scylla/tree/br-clients-virtual-table-4 tests: manual(dev), unit(dev) The manual test used 80-shards node and 1M connections from 1k different IP addresses. " * 'br-clients-virtual-table-4' of https://github.com/xemul/scylla: test: Add cql-pytest sanity test for system.clients table client_data: Sanitize connection_notifier transport: Indentation fix after previous patch code: Remove old on-disk version of system.clients table system_keyspace: Add clients_v virtual table protocol_server: Add get_client_data call transport: Track client state for real transport: Add stringifiers to client_data class generic_server: Gentle iterator generic_server: Type alias docs: Add system.clients description	2022-02-22 20:58:25 +03:00
Piotr Dulikowski	ddf049738d	indexed_table_select_statement: return some exceptions as exception messages Adjusts the indexed_table_select_statement so that it uses the result-aware methods in storage_proxy and propagates failed results as result_message::exception.	2022-02-22 16:25:21 +01:00
Piotr Dulikowski	091b20019b	result_combinators: add result_wrap_unpack Adds a helper combinator utils::result_wrap_unpack which, in contrast to utils::result_wrap, uses futurize_apply instead of futurize_invoke to call the wrapped callable. In short, if utils::result_wrap is used to adapt code like this: f.then([] {}) -> f_result.then(utils::result_wrap([] {})) Then utils::result_wrap_unpack works for the following case: f.then_unpack([] (arg1, arg2) {}) -> f_result.then(utils::result_wrap_unpack([] (arg1, arg2) {}))	2022-02-22 16:25:21 +01:00
Piotr Dulikowski	c5bcfee28f	select_statement: return exceptions as errors in execute_without_checking_exception_message Modifies the remaining logic of execute_without... (apart from the do_execute call) so that the result-aware versions of storage_proxy's methods are called and failed results are converted to result_message::exception.	2022-02-22 16:25:21 +01:00
Piotr Dulikowski	5106c60cd0	select_statement: return exceptions without throwing in do_execute Modifies do_execute so that it uses the result-aware versions of the query_pager's methods and returns them as result_message::exception.	2022-02-22 16:25:21 +01:00
Piotr Dulikowski	3a4d3f3175	select_statement: implement execute_without_checking_exception_message The select_statement will be able to propagate coordinator failures without throwing, so it's important to override the default implementations of execute and excecute_without... so that the first calls the latter and not the other way around.	2022-02-22 16:25:21 +01:00
Piotr Dulikowski	df7668797b	select_statement: introduce helpers for working with failed results Adds: - Includes for result-related helper methods (to be used in later commits), - Alias for coordinator_result, - The wrap_result_to_error_message function - a bit similar to utils::result_wrap. Adapts a callable T -> shared_ptr<result_message> to take result<T> -> shared_ptr<result_message>. If the result is failed, it converts it into result_message::exception and returns.	2022-02-22 16:25:21 +01:00
Piotr Dulikowski	c96c8e4813	query_pager: resultify relevant methods Now, the relevant methods of all query pagers properly propagate failed results.	2022-02-22 16:25:21 +01:00
Piotr Dulikowski	e5922e650e	storage_proxy: resultify (do_)query Adjusts do_query so that it propagates and returns failed results. The query_result method is added which is result-aware, and the old query method was changed to call query_result.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	e39c5b6eba	storage_proxy: resultify query_singular Now, query_singular propagates and returns failed results without rethrowing them.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	2f5f746ae2	storage_proxy: propagate failed results through query_partition_key_range Now, query_partition_key_range propagates the failed result from query_partition_key_range_concurrent.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	608032b2b5	storage_proxy: resultify query_partition_key_range_concurrent Now, query_partition_key_range_concurrent propagates and returns exceptions as values, if possible.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	10923d9d58	storage_proxy: modify handle_read_error to also handle exception containers Now, storage_proxy::handle_read_error can work with both exception containers and exception_ptrs.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	89fe804a1a	abstract_read_executor: return result from execute()	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	15fa5e30f5	abstract_read_executor: return and handle result from has_cl() The has_cl() method is changed to return a future with a result. The result returned from has_cl() is handled without throwing.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	68b5b84fbe	storage_proxy: resultify handling errors from read-repair Now, failed results returned from read-repair are handled without throwing.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	dd860c70ce	abstract_read_executor::reconcile: resultise handling of data_resolver->done() Now, the logic of handling exceptions returned in reconcile() from data_resolver->done() was changed so that the failed result does not need to be converted to an exceptional future.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	5accfd8dae	abstract_read_executor::execute: resultify handling of data_resolver->done() Now, the logic of handling exceptions returned in execute() from data_resolver->done() was changed so that the failed result does not need to be converted to an exceptional future.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	a304fbfed3	result_combinators: add result_discard_value Adds a utils::result_discard_value, which is an alternative to future::discard_result which just ignores the "success" value of the provided result and does not ignore the exception.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	ee2c4725c3	abstract_read_executor: resultify _result_promise Adjusts the type of _result_promise so that it holds a result.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	e7f960d041	abstract_read_executor: return result from done()	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	28d562ddf6	abstract_read_resolver: fail promises by passing exception as value Now, on read timeouts and failures, _cl_promise and _done_promise is set to a failed result instead of an exceptional promise.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	5438973c9d	abstract_read_resolver: resultify promises Changes the types of _done_promise and _cl_promise so that they hold a result.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	4c58683102	exceptions: make it possible to return read_{timeout,failure}_exception as value Adds read_timeout_exception and read_failure exception to the list of exceptions supported by the coordinator_exception_container. Those exceptions are not yet returned-as-value anywhere, but they will be in the commits that follow.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	1e1f5b4a48	result_try: add as_inner/clone_inner to handle types Adds two methods to result_try's exception handles: - as_inner: returns a {l,r}-value reference either to the exception container, or the exception_ptr. This allows to use them in operations which work on both types, e.g. logging. - clone_inner: returns a copy of the underlying exception container or exception ptr.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	1ed416906e	result_try: relax ConvertWithTo constraint Currently, the catch handlers in result_futurize_try are required to return a future, although they are always being called with seastar::futurize_invoke, so if their result is not future it could be converted to one anyway. This commit relaxes the ConvertsWithTo constraint in order to allow this conversion.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	e87cf08591	exception_container: switch impl to std::shared_ptr and make copyable The exception_container is supposed to be a cheaper, but possibly harder to use alternative to std::exception_ptr. Before this commit, the exception was kept behind foreign_ptr<std::unique_ptr<>> so that moving the container is very cheap. However, the original std::exception_ptr supports copying in a thread-safe manner, and it turns out that some of the read coordinator logic intentionally copies the pointer in order to be able to fail two different promises with the same exception. The pointer type is changed to std::shared_ptr. Although it uses atomics for reference counting, this is also probably what std::exception_ptr does, so the performance should not be worse. The exception stored inside the container is immutable, so this allows for a non-throwing implementation of copying. To encourage moves instead of copying, the copy constructor is deleted and instead the `clone()` method should be used if it is really necessary.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	7afea88dfc	result_loop: add result_repeat Adds a result-aware counterpart to seastar::repeat. The new function does not base on seastar::repeat, but rather is a rewrite of the original (using a coroutine instead of an open-coded task). The main consequence of using a coroutine is that exceptions from AsyncAction need to be thrown once more.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	32cbc89779	result_loop: add result_do_until Adds a result-aware counterpart to seastar::do_until. The new function does not base on seastar::do_until, but rather is a rewrite of the original (using a coroutine instead of an open-coded task). The main consequence of using a coroutine is that exceptions from StopCondition or AsyncAction need to be thrown once more.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	4f0a98a829	result_loop: add result_map_reduce Adds result-aware counterparts to all seastar::map_reduce overloads. Fortunately, it was possible to implement the functions by basing them on seastar::map_reduce and get the same number of allocation. The only exception happens when reducer::get() returns a non-ready future, which doesn't seem to happen on the read coordinator path.	2022-02-22 16:08:52 +01:00
Piotr Dulikowski	b3a0480439	utils/result: add utilities for checking/creating rebindable results Adds: - ResultRebindableTo<L, R>: concept which is satisfied by a pair of results which do not necessarily share the same value, but have the same error and policy types; a failed result L can be converted to a failed result R. - rebind_result<T, R>: given a value type T and another result R, returns a result which can hold T as value and both the same error and policy as R.	2022-02-22 16:08:45 +01:00
Piotr Grabowski	efe7456f0a	type_json: support integers in scientific format Add support for specifing integers in scientific format (for example 1.234e8) in INSERT JSON statement: INSERT INTO table JSON '{"int_column": 1e7}'; Inserting a floating-point number ending with .0 is allowed, as the fractional part is zero. Non-zero fractional part (for example 12.34) is disallowed. A new test is added to test all those behaviors. Before the JSON parsing library was switched to RapidJSON from JsonCpp, this statement used to work correctly, because JsonCpp transparently casts double to integer value. This behavior differs from Cassandra, which disallows those types of numbers (1e7, 123.0 and 12.34). Fix typo in if condition: "if (value.GetUint64())" to "if (value.IsUint64())". Fixes #10100	2022-02-22 12:55:38 +01:00
Avi Kivity	d1a394fd97	loading_cache: fix indentation of timestamped_val and two nested type aliases timestamped_val (and two other type aliases) are nested inside loading_cache, but indented as if they were top-level names. Adjust the indent to avoid confusion. Closes #10118	2022-02-22 12:20:36 +02:00
Botond Dénes	2afacf9609	mutation_reader: drop now unused v1 multishard_combining_reader and friends Friends: shard_reader and evictable_reader. All these have been supplanted by their respective v2 variants. Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220222071925.223718-1-bdenes@scylladb.com>	2022-02-22 10:51:08 +03:00
Pavel Emelyanov	dfb980e5f5	Merge 'compaction_manager: allow stopping sleeping tasks' from Benny Halevy Use exponential_backoff_retry::retry(abort_source&) when sleeping between retries and request abort when the task is stopped. Fixes #10112 Test: unit(dev) Closes #10113 * github.com:scylladb/scylla: compaction_manager: allow stopping sleeping tasks compaction_manager: task: add make_compaction_stopped_exception compaction_manager: task: refactor stop	2022-02-22 10:39:47 +03:00
Wojciech Mitros	7f590a3686	sstables: index_reader: optimize single partition reads All entries from a single partition can be found in a single summary page. Because of that, in cases when we know we want to read only one partition, we can limit the underyling file input_stream to the range of the page. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-02-22 02:16:52 +01:00
Wojciech Mitros	c81992c665	sstables: use read-aheads in the index reader Currently, when advancing one of index_reader's bounds, we're creating a new index_consume_entry_context with a new underlying file input_stream for each new page. For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing. This patch adds a index_consume_entry_context to each of index_reader's bounds, so that for each new page, the same file input_stream is used. As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the input_stream's read aheads, decreasing the number of blocking reads and increasing the throughput of the index_reader. Fixes #2388 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-02-22 01:51:33 +01:00
Benny Halevy	57f97046a7	compaction_manager: allow stopping sleeping tasks Use exponential_backoff_retry::retry(abort_source&) when sleeping between retries and request abort when the task is stopped. Fixes #10112 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-21 21:01:56 +02:00
Benny Halevy	f21b985872	compaction_manager: task: add make_compaction_stopped_exception Provide a function to make a sstables::compaction_stopped_exception based on the information in the stopped task. To be reused by the next patch that will also throw this exception from the retry sleep path, when the task is stopped. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-21 18:09:49 +02:00
Benny Halevy	91514c20ec	compaction_manager: task: refactor stop Refactor compaction_manager::task::stop out of compaction_manager::task_stop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-21 18:04:06 +02:00
Piotr Grabowski	649ab70936	type_json: add missing IsString() checks Add missing IsString() checks to parsing date, time, uuid and inet types by introducing validated_to_string_view function which checks whether the value is of string type and otherwise throws marshal_exception. Without this check, a malformed input to those types would result in nasty ServerError with RapidJSON assertion instead of marshal_exception with detail about the problem. Add new tests checking passing non-string values for those types. Fixes #10115	2022-02-21 16:58:13 +01:00
Piotr Grabowski	f8b67c9bd1	type_json: fix wrong blob JSON validation Fixes wrong condition for validating whether a JSON string representing blob value is valid. Previously, strings such as "6" or "0392fa" would pass the validation, even though they are too short or don't start with "0x". Add those test cases to json_cql_query_test.cc. Fixes #10114	2022-02-21 16:58:12 +01:00
Botond Dénes	10880fb0a7	tools/scylla-sstable: fix description template Quote '{' and '}' used in CQL example, so format doesn't try to interpret it. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220221140652.173015-1-bdenes@scylladb.com>	2022-02-21 17:14:41 +02:00
Nadav Har'El	7181a6757a	test/cql-pytest: add a couple of tests for static columns This patch adds two tests for two interesting edge cases in the behavior of static columns in Scylla. We already have a lot of tests for static columns in other frameworks (C++ unit tests, cql and dtest), but the two cases here are issues where specifically we weren't sure how Cassandra behaves in those cases - and this can most easily be checked in the test/cql-pytest framework. The first test, test_static_not_selected, is a reproducer for issue #10091. This issue was reported by a user @aohotnik, who was surprised by the fact that Scylla returns empty values, instead of nothing, when selecting regular columns of a non-existent row if the partition has a static column set. The test demonstrates a difference between Scylla and Cassandra, so it is marked "xfail" - it passes on Cassandra and fails on Scylla. If later we decide that both Scylla's and Cassandra's behaviours are reasonable and both can be considered "correct", we can change this test to except Scylla's result as well and it will beging to pass. The second test, test_missing_row_with_static, shows that SELECT of a non-existent row returns nothing - even if the partition has a static column. The behavior in this case is identical in Scylla and Cassandra, so this test passes. This contrasts with the analogous situation in LWT UPDATE from issue #10081, where the IF condition is expected to see the static column value. Refs #10081 Refs #10091 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220220120418.831540-1-nyh@scylladb.com>	2022-02-21 16:04:57 +02:00
Avi Kivity	adc08d0ab9	Merge "Drop v1 input support for mutation compactor" from Botond " Currently the mutation compactor supports v1 and v2 output and has a v1 output. The next step is to add a v2 output but this would lead to a full conversion matrix which we want to avoid. So in preparation we drop the v1 input support. Most inputs were already v2, but there were some notable exceptions: tests, the compacting reader and the multishard query code. The former two was a simple mechanical update but the latter required some further work because it turned out the v2 version of evictable reader wasn't used yet and thus it managed to hide some bugs and dropped features. While at it, we migrate all evictable and multishard reader users to the v2 variant of the respective readers and drop the v1 variant completely. With this the road is open to a v2 compactor output and therefore to a v2 sstable writer. Tests: unit(dev, release), dtest(paging_additional_test.py) " * 'compact-mutation-v2-only-input/v5' of https://github.com/denesb/scylla: test/lib/test_utils: return OK from check() variants repair/row_level: use evictable reader v2 db/view/view_updating_consumer: migrate to v2 test/boost/mutation_reader_test: add v2 specific evictable reader tests test: migrate to evictable reader v2 and multishard combining reader v2 compact_mutation: drop support for v1 input test: pass v2 input to mutation_compaction test/boost/mutation_test: simplify test_compaction_data_stream_split test mutation_partition: do_compact(): do drop row tombstones covered by higher order tombstones multishard_mutation_query: migrate to v2 mutation_fragment_v2: range_tombstone_change: add memory_usage() evictable_reader_v2: terminate active range tombstones on reader recreation evictable_reader_v2: restore handling of non-monotonically increasing positions evictable_reader_v2: simplify handling of reader recreation mutation: counter_write_query: use v2 reader mutation: migrate consume() to v2 mutation_fragment_v2,flat_mutation_reader_v2: mirror v1 concept organization mutation_reader: compacting_reader: require a v2 input reader db/view/view_builder: use v2 reader test/lib/flat_mutation_reader_assertions: adjust has_monotonic_positions() to v2 spec	2022-02-21 14:32:55 +02:00
Botond Dénes	841b982e51	test/lib/test_utils: return OK from check() variants The various require() and check() methods in test_utils.hh were introduced to replace BOOST_REQUIRE() and BOOST_CHECK() respectively in multi-shard concurrent tests, specifically those in tests/boost/multishard_mutation_query_test.cc. This was done literally, just replacing BOOST_REQUIRE() with require() and BOOST_CHECK() with check(). The problem is that check() is missing a feature BOOST_CHECK() had: while BOOST_CHECK() doesn't cause an immediate test failure, just logging an error if the condition fails, it remembers this failure and will fail the test in the end. check() did not have this feature and this caused potential errors to just be logged while the test could still pass fine, causing false-positive tests passes. This patch fixes this by returning a [[nodiscard]] bool from the check() methods. The caller can & these together over all calls to check() methods and manually fail the test in the end. We choose this method over a hidden global (like BOOST_CHECK() does) for simplicity sake.	2022-02-21 12:29:25 +02:00
Botond Dénes	4aa9b90ba9	repair/row_level: use evictable reader v2	2022-02-21 12:29:24 +02:00
Botond Dénes	05c48ee0cc	db/view/view_updating_consumer: migrate to v2 Not a completely mechanical transition. The consumer has to generate its mutation via a mutation_rebuilder_v2 as mutation fragment v2 cannot be applied to mutations directly yet.	2022-02-21 12:29:24 +02:00
Botond Dénes	014a23bf2a	test/boost/mutation_reader_test: add v2 specific evictable reader tests One is a reincarnation of the recently removed test_multishard_combining_reader_non_strictly_monotonic_positions. The latter was actually targeting the evictable reader but through the multishard reader, probably for historic reasons (evictable reader was part of the multishard reader family). The other one checks that active range tombstones changes are properly terminated when the partition ends abruptly after recreating the reader.	2022-02-21 12:29:24 +02:00
Botond Dénes	e3c618beba	test: migrate to evictable reader v2 and multishard combining reader v2 All reads are now using the v2 version of these readers, test them instead of the old v1.	2022-02-21 12:29:24 +02:00
Botond Dénes	f1e9e3b3b7	compact_mutation: drop support for v1 input	2022-02-21 12:29:24 +02:00
Botond Dénes	284ed9154f	test: pass v2 input to mutation_compaction	2022-02-21 12:29:24 +02:00
Botond Dénes	dec4e5659b	test/boost/mutation_test: simplify test_compaction_data_stream_split test This test has very elaborate infrastructure essentially duplicating mutation, mutation::apply() and mutation::operator==. Drop all this extra code and use mutations directly instead. This makes migrating the test to v2 easier.	2022-02-21 12:29:24 +02:00
Botond Dénes	2941803da0	mutation_partition: do_compact(): do drop row tombstones covered by higher order tombstones The comment on the public methods calling said method promises to do so but doesn't actually follows through. This patch fixes this for row tombstones, to mirror the behaviour of the mutation compactor. This is especially important for tests that compare mutations compacted with different methods.	2022-02-21 12:29:24 +02:00
Botond Dénes	f2e2b84038	multishard_mutation_query: migrate to v2 Mostly mechanical transformation. The main difference is in the detached compaction state, from which we now get the range tombstone change, instead of the range tombstone list. The code around this is a bit awkward, will become simpler when compactor drops v1 support.	2022-02-21 12:29:24 +02:00
Botond Dénes	b330cba792	mutation_fragment_v2: range_tombstone_change: add memory_usage()	2022-02-21 12:29:24 +02:00
Botond Dénes	9e48237b86	evictable_reader_v2: terminate active range tombstones on reader recreation Reader recreation messes with the continuity of the mutation fragment stream because it breaks snapshot isolation. We cannot guarantee that a range tombstone or even the partition started before will continue after too. So we have to make sure to wrap up all loose threads when recreating the reader. We already close uncontinued partitions. This commit also takes care of closing any range tombstone started by unconditionally emitting a null range tombstone. This is legal to do, even if no range tombstone was in effect.	2022-02-21 12:29:24 +02:00
Botond Dénes	6db08ddeb2	evictable_reader_v2: restore handling of non-monotonically increasing positions We thought that unlike v1, v2 will not need this. But it does. Handled similarly to how v1 did it: we ensure each buffer represents forward progress, when the last fragment in the buffer is a range tombstone change: * Ensure the content of the buffer represents progress w.r.t. _next_position_in_partition, thus ensuring the next time we recreate the reader it will continue from a later position. * Continue reading until the next (peeked) fragment has a strictly larger position. The code is just much nicer because it uses coroutines.	2022-02-21 12:29:24 +02:00
Botond Dénes	498d03836b	evictable_reader_v2: simplify handling of reader recreation The evictable reader has a handful of flags dictating what to do after the reader is recreated: what to validate, what to drop, etc. We actually need a single flag telling us if the reader was recreated or not, all other things can be derived from existing fields. This patch does exactly that. Furthermore it folds do_fill_buffer() into fill_buffer() and replaces the awkward to use `should_drop_fragment()` with `examine_first_fragments()`, which does a much better job of encapsulating all validation and fragment dropping logic. This code reorganization also fixes two bugs introduced by the v2 conversion: * The loop in `do_fill_buffer()` could become infinite in certain circumstances due to a difference between the v1 and v2 versions of `is_end_of_stream()`. * The position of the first non-dropped fragment is was not validated (this was integrated into the range tombstone trimming which was thrown out by the conversion).	2022-02-21 12:29:24 +02:00
Botond Dénes	d4ac473f7d	mutation: counter_write_query: use v2 reader	2022-02-21 12:27:55 +02:00
Botond Dénes	fcda35d08e	mutation: migrate consume() to v2 The underlying mutation format is still v1, so consume() ends up doing an online conversion. This allows converting all downstream code to v2, leaving the conversion close to the code that is yet to be migrated to v2 native: the mutation itself.	2022-02-21 12:27:55 +02:00
Botond Dénes	1fa6537a2f	mutation_fragment_v2,flat_mutation_reader_v2: mirror v1 concept organization Currently all concepts are in mutation_fragment_v2.hh and flat_mutation_reader_v2.hh. Organize concepts similar to how the v1 ones are: move high-level consume concepts into mutation_consumer_concepts.hh.	2022-02-21 12:27:55 +02:00
Botond Dénes	fb0e0ec7c1	mutation_reader: compacting_reader: require a v2 input reader Before we add a v2 output option to the compactor, we want to get rid of all the v1 inputs to make it simpler. This means that for a while the compacting reader will be in a strange place of having a v2 input and a v1 output. Hopefully, not for long.	2022-02-21 12:27:55 +02:00
Botond Dénes	45b36d91c6	db/view/view_builder: use v2 reader	2022-02-21 12:27:55 +02:00
Botond Dénes	bba20f5cce	test/lib/flat_mutation_reader_assertions: adjust has_monotonic_positions() to v2 spec The v2 spec allows for non-strictly monotonically increasing positions, but has_monotonic_positions() tried to enforce it. Relax the check so it conforms to the spec.	2022-02-21 12:27:55 +02:00
Benny Halevy	f5259e048c	test: sstable_compaction_test: stop compaction manager and test table using deferred action To make sure they are properly stopped also on exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220220120939.2362590-2-bhalevy@scylladb.com>	2022-02-21 12:06:32 +02:00
Benny Halevy	9a308bc496	test: lib: register_compaction: do not allow null table Require to pass the table to be compacted so register_compaction finds the real compaction state rather than making a bogus one. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220220120939.2362590-1-bhalevy@scylladb.com>	2022-02-21 12:06:32 +02:00
Yaron Kaikov	23bb0761bf	SCYLLA-VERSION-GEN:set release-version value length Noticed this issue during my debug sessions while building Scylla on x86 and Arm (https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/build/670/artifact/) from x86 log (https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/build/670/artifact/output-build-x86_64.txt) ``` [883/2823] cd tools/python3 && ./reloc/build_reloc.sh --version $(<../../build/SCYLLA-PRODUCT-FILE)-$(<../../build/SCYLLA-VERSION-FILE)-$(<../../build/SCYLLA-RELEASE-FILE) --nodeps --packages "python3-pyyaml python3-urwid python3-pyparsing python3-requests python3-pyudev python3-setuptools python3-psutil python3-distro python3-click python3-six" --pip-packages "scylla-driver geomet" 5.1.dev-0.20220209.23da2b58796 ``` from arm log (https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/build/670/artifact/output-build-aarch64.txt) ``` [244/2823] cd tools/python3 && ./reloc/build_reloc.sh --version $(<../../build/SCYLLA-PRODUCT-FILE)-$(<../../build/SCYLLA-VERSION-FILE)-$(<../../build/SCYLLA-RELEASE-FILE) --nodeps --packages "python3-pyyaml python3-urwid python3-pyparsing python3-requests python3-pyudev python3-setuptools python3-psutil python3-distro python3-click python3-six" --pip-packages "scylla-driver geomet" 5.1.dev-0.20220209.23da2b587 ``` Related to git config parameter core.abbrev which is not defined so default is set for auto (based: https://git-scm.com/docs/git-config#Documentation/git-config.txt-coreabbrev) Fixes: https://github.com/scylladb/scylla/issues/10108 Closes #10109	2022-02-21 13:28:04 +02:00
Pavel Emelyanov	49c5d5b7e8	Merge 'lister: add directory_lister' from Benny Halevy directory_lister provides a simpler interface compared to lister. After creating the directory_lister, its async get() method should be called repeatedly, returning a std::optional<directory_entry> each call, until it returns a disengaged entry or an error. This is especially suitable for coroutines as demonstrated in the unit tests that were added. For example: ```c++ auto dl = directory_lister(path); while (auto de = co_await dl.get()) { co_await process(de); } ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #9835 github.com:scylladb/scylla: sstable_directory: process_sstable_dir: use directory_lister sstable_directory: process_sstable_dir: fixup indentation sstable_directory: coroutinize process_sstable_dir lister: add directory_lister	2022-02-21 12:24:28 +03:00
Nadav Har'El	4349514064	test/alternator: add smaller reproducer for Limit-less reverse query The regression test we have for Alternator's issue #9487 (where a reverse query without a Limit given was broken into 100MB pages instead of the expected 1MB) is test_query.py::test_query_reverse_long. But this is a very long test requiring a 100MB partition, and because of its slowness isn't run by default. This patch adds another version of that test, test_query_reverse_longish, which reproduces the same issue #9487 with a partition 50 times shorter (2MB) so it only takes a fraction of a second and can be enabled by default. It also requires much less network traffic which is important when running these tests non-locally. We leave the original test test_query_reverse_long behind, it can be still useful to stress Scylla even beyond the 100MB boundary, but it remains in @veryslow mode so won't run in default test runs. Refs #9487 Refs #7586 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220220161905.852994-1-nyh@scylladb.com>	2022-02-21 09:12:16 +01:00
Avi Kivity	8eb5d6ed31	frozen_schema: avoid allocating contiguous memory A frozen schema can be quite large (in #10071 we measured 500 bytes per column, and there can be thousands of columns in extreme tables). This can cause large contiguous allocations and therefor memory stalls or even failures to allocate. Switch to bytes_ostream as the internal representation. Fortunately frozen_schema is internally implemented as bytes_ostream, so the change is minimal. Ref #10071. Test: unit (dev) Closes #10105	2022-02-21 01:39:02 +01:00
Amnon Heiman	c764f0d0f8	gms/gossiper.cc: Add gauge for live and unreachable nodes this patch adds two gauges: scylla_gossip_live - how many live nodes the gossiper sees scylla_gossip_unreachable - how many nodes the gossiper tries to connect to but cannot. Both metrics are reported once per node (i.e., per node, not per shard) it gives visibility to how a specific node sees the cluster. For example, a split-brain 6 nodes cluster (3 and 3). Each node would report that it sees 2 nodes, but the monitoring system would see that there are, in fact, 6 nodes. Example of two nodes cluster, both running: `` scylla_gossip_live{shard="0"} 1.000000 scylla_gossip_unreachable{shard="0"} 0.000000 `` Example of two nodes cluster, one is down: `` scylla_gossip_live{shard="0"} 0.000000 scylla_gossip_unreachable{shard="0"} 1.000000 `` Fixes #10102 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #10103 [avi: remove whitespace change and correct spelling]	2022-02-20 19:42:58 +02:00
Wojciech Mitros	0a1500acd2	sstables: index_reader: remove unused members from index reader context The _file_name and _index_file fields in index_consume_entry_context are no longer used anywhere in the class (_file_name isn't even set, and _index_file was previously used when creating a promoted_index, which doesn't store the file object anymore)	2022-02-20 16:24:27 +01:00
Nadav Har'El	6476a64185	test/cql-pytest: reproducer for JSON scientific-notation integer problem The JSON standard specifies numbers without making a distinction of what is "an integer" and what is "floating point". The value 1e6 is a valid number, and although it is customary in C that 1e6 is a floating-point constant, as a JSON constant there is nothing inherently "non-integer" about it - it is a whole number. This is why I believe CQL commands such as CREATE TABLE t(pk int PRIMARY KEY, v int); INSERT INTO t JSON '{"pk": 1, "v": 1e6}'; should be allowed, as 1e6 is a whole number and fits in the range of Scylla's int. The included tests show that, unfortunately, 1e6 is not currently allowed to be assigned to an integer. The test currently fail on both Scylla and Cassandra - and we believe this failure to be a bug in both, so the test is marked with xfail (known to fail) and cassandra-bug (known failure on Cassandra considered to be a bug). Refs #10100 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220220141602.843783-1-nyh@scylladb.com>	2022-02-20 17:01:22 +02:00
Nadav Har'El	d3ac9a5790	Merge 'cql3: expr: Fix expr::visit so that it works with references' from Jan Ciołek There is a bug in `expr::visit`. When trying to return a reference from a visitor it actually returns a reference to some temporary location. So trying to do something like: ```c++ const expression e = new_bind_variable(123); const bind_variable& ref = visit(overloaded_functor { [](const bind_variable& bv) -> const bind_variable& { return bv; }, [](const auto&) -> const bind_variable& { throw std::runtime_error("Unreachable"); } }, e); std::cout << ref << std::endl; ``` Would actually print a random stack location instead of the value inside of `e`. Additionally trying to return a non-const reference doesn't compile. Current implementation of `expr::visit` is: ```c++ auto visit(invocable_on_expression auto&& visitor, const expression& e) { return std::visit(visitor, e._v->v); } ``` For reference, `std::visit` looks like this: ```c++ template<typename _Res, typename _Visitor, typename... _Variants> constexpr _Res visit(_Visitor&& __visitor, _Variants&&... __variants) { return std::__do_visit<_Res>(std::forward<_Visitor>(__visitor), std::forward<_Variants>(__variants)...); } ``` The problem is that `auto` can evaluate to `int` or `float`, but not to `int&`. It has now been changed to `decltype(auto)`, which is able to express references. I also added a missing `std::forward` on the visitor argument. The new version looks like this: ```c++ template <invocable_on_expression Visitor> decltype(auto) visit(Visitor&& visitor, const expression& e) { return std::visit(std::forward<Visitor>(visitor), e._v->v); } ``` I added some tests of `expr::visit` in `boost/expr_test`, but sadly they are not as throughout as they could be, Ideally I could return a refernce from `std::visit` and `expr::visit` and then check that they both point to the same address in memory. I can't do this because it would require to access a private field of `expression`. Some test pass before the fix, even though they shouldn't, but I'm not sure how to make them better without making field of expression public. I played around with some code, it can be found here: https://github.com/cvybhu/attached-files/blob/main/visit/visit_playground.cpp Closes #10073 * github.com:scylladb/scylla: cql3: expr: Add a test to show that std::forward is needed in expr::visit cql3: expr: add std::forward in expr::visit cql3: expr: Add tests for expr::visit cql3: expr: Fix expr::visit so that it works with references	2022-02-20 12:09:57 +02:00
Jan Ciolek	353ab8f438	cql3: expr: Add a test to show that std::forward is needed in expr::visit Adds a test with a vistior that can only be used as a rvalue. Without std::forward in expr::visit this test doesn't compile. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-18 14:19:49 +01:00
Jan Ciolek	7234cc851c	cql3: expr: add std::forward in expr::visit expr::visit was missing std::forward on the visitor. In cases where the visitor was passed as an rvalue it wouldn't be properly forwarded to std::visit. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-18 14:19:49 +01:00
Jan Ciolek	46367eec55	cql3: expr: Add tests for expr::visit Add tests for new expr::visit to ensure that it is working correctly. expr::visit had a hidden bug where trying to return a reference actually returned a reference to freed location on the stack, so now there are tests to ensure that everything works. Sadly the test `expr_visit_const_ref` also passes before the fix, but at lest expr_visit_ref doesn't compile before the fix. It would be better to test this by taking references returned by std::visit and expr::visit and checking that they point to the same address in memory, but I can't do this because I would have to access private field of expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-18 14:16:55 +01:00
Pavel Emelyanov	9c06897ec3	test: Add cql-pytest sanity test for system.clients table Check that SELECT {columns} FROM system.clients returns back only local connection of cql type (because there are no others during the test). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 15:02:26 +03:00
Pavel Emelyanov	de6c60c1c9	client_data: Sanitize connection_notifier Now the connection_notifier is all gone, only the client_data bits are left. To keep it consistent -- rename the files. Also, while at it, brush up the header dependencies and remove the not really used constexprs for client states. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 15:02:26 +03:00
Pavel Emelyanov	d63ba87266	transport: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 15:02:26 +03:00
Pavel Emelyanov	971c431a23	code: Remove old on-disk version of system.clients table This includes most of the connection_notifier stuff as well as the auxiliary code from system_keyspace.cc and a bunch of updating calls from the client state changing. Other than less code and less disk updates on clients connection paths, this removes one usage of the nasty global qctx thing. Since the system.clients goes away rename the system.clients_v here too so the table is always present out there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 15:02:26 +03:00
Pavel Emelyanov	0c9ed01716	system_keyspace: Add clients_v virtual table This table mirrors the existing clients one but temporarily has its own name. The schema is the same as in system.clients. The table gets client_data's from the registered protocol servers, which in turn are obtained from the storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 15:02:26 +03:00
Pavel Emelyanov	7bc697ec99	protocol_server: Add get_client_data call The call returns a chunked_vector with client_data's. For now only the native transport implements it, others return empty vector. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:08 +03:00
Pavel Emelyanov	0046cdc6cb	transport: Track client state for real Right now when the client state changes the respective update is performed on the system.clients table. While doing it some bits from this state are lost from the in-memory structures. For the sake of exporting this information we need to track whether the connected client goes authenticating or is already ready. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:08 +03:00
Pavel Emelyanov	00ce9b1c36	transport: Add stringifiers to client_data class There are two fields on the client_data that are not mapped to string with the help of standard fmt library. Add two methods that turn client state and type into strings. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:08 +03:00
Pavel Emelyanov	f035313b16	generic_server: Gentle iterator Add the ability to iterate over the list of connections in a "gentle" manner, i.e. -- preempting the loop when required. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:08 +03:00
Pavel Emelyanov	661c12066b	generic_server: Type alias For simpler future patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:07 +03:00
Pavel Emelyanov	d586805054	docs: Add system.clients description There's a document that sums up the tables from system keyspace and its missing the clients table. This set is going to reimplement the table keeping the schema intact, so it's good time to document it right at the beginning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:07 +03:00
Nadav Har'El	f292d3d679	alternator: make schema modifications in CreateTable atomic The Alternator CreateTable operation currently performs several schema- changing operations separately - one by one: It creates a keyspace, a table in that keyspace and possibly also multiple views, and it sets tags on the table. A consequence of this is that concurrent CreateTable and DeleteTable operations (for example) can result in unexpected errors or inconsistent states - for example CreateTable wants to create the table in the keyspace it just created, but a concurrent DeleteTable deleted it. We have two issues about this problem (#6391 and #9868) and three tests (test_table.py::test_concurrent_create_and_delete_table) reproducing it. In this patch we fix these problems by switching to the modern Scylla schema-changing API: Instead of doing several schema-changing operations one by one, we create a vector of schema mutation performing all these operations - and then perform all these mutations together. When the experimental Raft-based schema modifications is enabled, this completely solves the races, and the tests begin to pass. However, if the experimental Raft mode is not enabled, these tests continue to fail because there is still no locking while applying the different schema mutations (not even on a single node). So I put a special fixture "fails_without_raft" on these tests - which means that the tests xfail if run without raft, and expected to pass when run on Raft. Indeed, after this patch test/alternator/run --raft test_table.py::test_concurrent_create_and_delete_table shows three passing tests (they also pass if we drastically improve the number of iterations), while test/alternator/run test_table.py::test_concurrent_create_and_delete_table shows three xfailing tests. All other Alternator tests pass as before with this patch, verifying that the handling of new tables, new views, tags, and CDC log tables, all happen correctly even after this patch. A note about the implementation: Before this patch, the CreateTable code used high-level functions like prepare_new_column_family_announcement(). These high-level functions become unusable if we write multiple schema operations to one list of mutations, because for example this function validates that the keyspace had already been created - when it hasn't and that's the whole point. So instead we had to use lower-level function like add_table_or_view_to_schema_mutation() and before_create_column_family(). However, despite being lower level, these functions were public so I think it's reasonable to use them, and we probably have no other alternative. Fixes #6391 Fixes #9868 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-18 09:03:52 +02:00
Nadav Har'El	46120ca4f4	Merge 'tools/scylla-sstable: change output of dump commands to JSON' from Botond Dénes Replacing the previous text output with the exception of the dump-data command. The text output was supposed to be human-friendly but it is not really human friendlier than a well formatted JSON, the latter having the additional advantage of being machine friendly too. Although the text output already exists, having just one output format makes the code much simpler and easier to maintain so we chose not to pay the higher maintenance price for a format that is not expected to see much (if any) use. Although the JSON written by the tool is not formatted, it can easily be formatted by e.g. piping it through `jq`. The latter also allows lookup of specific field(s). The JSON schema of each command is documented in the --help output of the respective command (e.g. scylla sstable data-dump --help) . We keep the text output of the dump-data command as this is using scylla's built-in printer that we also use in logging and tests. Some people might be used to this format, so leave it in: the code already exists for it and lives in scylla core, so we don't need to maintain it separately. The default output-format of dump-data is now JSON. A smoke test suite is added for the dump commands too. The tests only check that some output is present and that it is valid JSON. Refs: #9882 Tests: unit(dev) Also on: https://github.com/denesb/scylla.git scylla-sstable-json/v2 Changelog v3: * Rebase on recent master (which has the required seastar fixes for debug tests) v2: * Document the JSON schema of each command. * Use the SAX-style API of rapidjson to generate streaming JSON, instead of hand-generating it. Closes #10074 * github.com:scylladb/scylla: test/cql-pytest: add tests for scylla-sstable's dump commands test/cql-pytest: prepare for tool tests tools/schema_loader: auto-create the keyspace for all statements tools/scylla-sstable: change output of dump-scylla-metadata to json tools/scylla-sstable: change output of dump-statistics to json tools/scylla-sstable: change output of dump-summary to json tools/scylla-sstable: change output of dump-compression-info to json tools/scylla-sstable: change output of dump-index to json tools/scylla-sstable: add json support in --dump-data tools/scylla-sstable: add json_writer tools/scylla-sstable: use fmt::print in --dump-data tools/scylla-sstable: prepare --dump-data for multiple output formats	2022-02-18 07:52:19 +02:00
Jan Ciolek	8676f60724	cql3: expr: Fix expr::visit so that it works with references expr::visit had a bug where if we wanted to return a reference in the visitor, the reference would be to a temporary stack location instead of the passed argument. So trying to do something like this: ``` const bind_variable& ref = visit(overloaded_functor { [](const bind_variable& bv) -> const bind_variable& { return bv; }, [](const auto&) -> const bind_variable& { ... } }, e); std::cout << ref << std::endl; ``` Would actually print a random location on stack instead of valid value inside of e. Additionally trying to return a non-const reference doesn't even compile. The problem was that the return type of expr::visit was defined as `auto`, which can be `int`, but not `int&`. This has been changed to `decltype(auto)` which can be both `int` and `int&` New version of `expr::visit` works for `const expression&` and `expression&` no matter what the visitor returns. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-17 17:29:28 +01:00
Botond Dénes	1e038b40cf	test/cql-pytest: add tests for scylla-sstable's dump commands The tests are smoke-tests: they mostly check that scylla doesn't crash while dumping and it produces some output. When dumping json, the test checks that it is valid json.	2022-02-17 15:24:24 +02:00
Botond Dénes	afab1a97c6	test/cql-pytest: prepare for tool tests We want to add tool tests. These tests will have to invoke scylla executable (as tools are hosted by the latter) and they want access to the scylla data directories. Propagate the scylla path and data directory used from `run` into the test suite via pytest request parameters.	2022-02-17 15:24:24 +02:00
Botond Dénes	96082631c8	tools/schema_loader: auto-create the keyspace for all statements Currently the keyspace is only auto-created for create type statements. However the keyspace is needed even without UDTs being involved: for example if the table contains a collection type. So auto-create the keyspace unconditionally before preparing the first statement. Also add a test-case with a create table statement which requires the keyspace to be present at prepare time.	2022-02-17 15:24:24 +02:00
Botond Dénes	59ce247164	tools/scylla-sstable: change output of dump-scylla-metadata to json	2022-02-17 15:24:24 +02:00
Botond Dénes	2a7ed8212f	tools/scylla-sstable: change output of dump-statistics to json	2022-02-17 15:24:24 +02:00
Botond Dénes	a617e66878	tools/scylla-sstable: change output of dump-summary to json	2022-02-17 14:17:11 +02:00
Botond Dénes	fb6b7c8036	tools/scylla-sstable: change output of dump-compression-info to json	2022-02-17 14:17:11 +02:00
Botond Dénes	f5c6d7e12e	tools/scylla-sstable: change output of dump-index to json	2022-02-17 14:17:11 +02:00
Botond Dénes	bdbbda29c1	tools/scylla-sstable: add json support in --dump-data But keep the old text output-format too. One can switch between the two with the --output-format flag, which defaults to "json".	2022-02-17 14:17:11 +02:00
Botond Dénes	03bbf1b362	tools/scylla-sstable: add json_writer Wrapping a rapidjson::Writer<> and mirrors the latter's API, providing more convenient overloads for the Key() and String() methods, as well as providing some extra, scylla-sstable specific methods too.	2022-02-17 14:17:11 +02:00
Botond Dénes	72f27c8782	tools/scylla-sstable: use fmt::print in --dump-data The rest of the code is standardizing on fmt::print(), bring the code for --dump-data in line.	2022-02-17 14:17:11 +02:00
Botond Dénes	ba2a61b2bc	tools/scylla-sstable: prepare --dump-data for multiple output formats Extract the actual dumping code into a separate class, which also implements sstable_consumer interface. The dumping consumer now just forwards calls to actual dumper through the abstract consumer interface, allowing different concrete dumpers to be instantiated.	2022-02-17 14:17:11 +02:00
Piotr Dulikowski	adfd9d2f7a	abstract_read_resolver::fail_request: make non-virtual This method is not overrided by any of the derived classes, so it does not need to be virtual. (cherry picked from commit b7fb93dc46531bca8db535301a069df52991f9d9)	2022-02-17 12:34:37 +02:00
Michael Livshin	f8d4bafa5a	to_string.hh: include <map> The code uses `std::map`, so it should include the definition explicitly. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-17 08:53:48 +02:00
Michael Livshin	a657dc9787	scylla-gdb.py: set source language to c++ When you interrupt a process in gdb using Ctrl-C or attach gdb to a running process, usually gdb will show the current frame as `syscall()` (no source information). But in some less usual setups gdb may happen to know that `syscall()` is implemented in assembly, and even knows which line is current in which assembly file. An unfortunate effect of gdb knowing that the current frame's source language is assembly is that since assembly is not C++, gdb's expression parser switches to "auto" while in the `syscall()` stack frame. And in the "auto" language explicit C++ global namespace references like "::debug::the_database" are not syntactically valid, which renders much of scylla-gdb.py unusable unless you remember to go up the call stack before doing anything. But since scylla-gdb.py is there to help debug Scylla, and Scylla is written in C++, we can just set gdb source language to "c++" and avoid the problem. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220216235301.1206341-1-michael.livshin@scylladb.com>	2022-02-17 08:43:59 +02:00
Botond Dénes	948bc359c2	Merge "ME sstable format support" from Michael Livshin " This series implements support for the ME sstable format (introduced in C* 3.11.11). Tests: unit(dev) " * tag 'me-sstable-format-v5' of https://github.com/cmm/scylla: sstables: validate originating host id sstable: add is_uploaded() predicate config: make the ME sstable format default scylla-gdb.py: recognize ME sstables sstables: store originating host id in stats metadata system_keyspace: cache local host id before flushing database_test: ensure host id continuity sstables_manager: add get_local_host_id() method and support sstables_manager: formalize inheritability system_keyspace, main: load (or create) local host id earlier sstable_3_x_test: test ME sstable format too add "ME_SSTABLE" cluster feature add "sstable_format" config add support for the ME sstable format scylla-sstable: add ability to dump optionals and utils::UUID sstables: add ability to write and parse optionals globalize sstables::write(..., utils::UUID)	2022-02-16 18:28:16 +02:00
Michael Livshin	79bf79ebd3	sstables: validate originating host id Add an additional sstable validation step to check that originating host id matches the local host id. This is only done for ME-and-up sstables, which do not come from upload/, and when the local host id is known. When local host id is unknown, check that the sstable belongs to a system keyspace, i.e. whether it is plausible that Scylla is still booting up and hasn't loaded/generated the local host id yet. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	3511d7cd21	sstable: add is_uploaded() predicate Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	3bf1e137fc	config: make the ME sstable format default Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	0ca58096cf	scylla-gdb.py: recognize ME sstables Also use the opportunity to unify two closely-related lists into a dictionary. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	dd4e330cc5	sstables: store originating host id in stats metadata With this change, ME sstables start carrying their originating host id, which makes ME format feature-complete so it can be made default. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	0ccd56e036	system_keyspace: cache local host id before flushing Later in this series the ME sstable format is made default, which means that `system.local` will likely be written as ME. Since, in ME, originating host id is a part of sstable stats metadata, the local host id needs to either already be cached by the time `system.local` is flushed, or to somehow be special-case-ignored when flushing `system.local`. The former (done here) is optimistic (cache before flush), but the alternative would be an abstraction violation and would also cost a little time upon each sstable write. (Cache-before-flush could be undone by catching any exceptions during flush and un-caching, but inability to `co_await` in catch clauses makes the code look rather awkward. And there is no need to bother because bootstrap failures should be fatal anyway) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	d8cc535297	database_test: ensure host id continuity The "populate_from_quarantine_works" test case creates sstables with one db config, then reads them with another. Ensure that both configs have the same host id so the sstables pass validation. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	3fef604075	sstables_manager: add get_local_host_id() method and support Since ME sstable format includes originating host id in stats metadata, local host id needs to be made available for writing and validation. Both Scylla server (where local host id comes from the `system.local` table) and unit tests (where it is fabricated) must be accomodated. Regardless of how the host id is obtained, it is stored in the db config instance and accessed through `sstables_manager`. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	0895188851	sstables_manager: formalize inheritability The class is already inherited from in tests (along with overriding a non-virtual method), so this seems to be called for. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	7d2af177eb	system_keyspace, main: load (or create) local host id earlier We want it to be cached before any sstable is written, so do it right after system_keyspace::minimal_setup(). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	387c882dc7	sstable_3_x_test: test ME sstable format too Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	d370558279	add "ME_SSTABLE" cluster feature Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	0b1447c702	add "sstable_format" config Initialize it to "md" until ME format support is complete (i.e. storing originating host id in sstable stats metadata is implemented), so at present there is no observable change by default. Also declare "enable_sstables_md_format" unused -- the idea, going forward, being that only "sstable_format" controls the written sstable file format and that no more per-format enablement config options shall be added. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	c96708d262	add support for the ME sstable format The ME format has been introduced in Cassandra 3.11.11: `11952fae77/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java (L123)` `d84c6e9810` It adds originating host id to sstable metadata in support of fixing loss of commit log data when moving sstables between nodes: https://issues.apache.org/jira/browse/CASSANDRA-16619 In Scylla: * The supported way to ingest sstables is via upload/, where stored commit log replay position should be disregarded (but see https://github.com/scylladb/scylla/issues/10080). * A later commit in this series implements originating host id validation for native ME sstables. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	3712a82ca7	scylla-sstable: add ability to dump optionals and utils::UUID Needed for the ME sstable format. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	26bae0cd39	sstables: add ability to write and parse optionals (that is, instances of `std::optional`). The ME sstable format includes optional originating host id in stats metadata. We know how to write and parse uuids, but not how to write and parse optionals. The format is (used by C* in this case, and also happens to be consistent with how booleans are serialized): first a boolean indicating whether the contents are present (0 or 1, as a byte), then the contents (if any). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:23 +02:00
Michael Livshin	c00d272b16	globalize sstables::write(..., utils::UUID) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:23 +02:00
Benny Halevy	5a63026932	api: storage_service: scrub: validate parameters Validate all parameters, rejecting unsupported parameters. Refs #10087 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-16 17:01:46 +02:00
Benny Halevy	16afde46e7	api: storage_service: refactor parse_tables Prepare for string-based parsing and validation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-16 16:53:18 +02:00
Benny Halevy	cce6810615	api: storage_service: refactor validate_keyspace Prepare for string-based validation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-16 16:53:18 +02:00
Benny Halevy	eef131ea10	test: rest_api: add test_storage_service_keyspace_scrub tests Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-16 16:53:16 +02:00
Benny Halevy	fc2e9abeba	api: storage_service: scrub: throw httpd::bad_param_exception for invalid param values Throwing std::runtime_error results in http status 500 (internal_server_error), but the problem is with the request parameters, nt with the server. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-16 15:39:17 +02:00
Benny Halevy	b7b0c19fdc	test: uuid: cement the assumption that default and null uuid are equal Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220216081623.830627-2-bhalevy@scylladb.com>	2022-02-16 10:19:47 +02:00
Benny Halevy	489e50ef3a	utils: uuid: make operator bool explicit Following up on `69fcc053bb` To prevent unintentional implicit conversions e.g. to a number. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220216081623.830627-1-bhalevy@scylladb.com>	2022-02-16 10:19:47 +02:00
Piotr Dulikowski	742f2abfd8	exception_container: do not throw in accept This commit changes the behavior of `exception_container::accept`. Now, instead of throwing an `utils::bad_exception_container_access` exception when the container is empty, the provided visitor is invoked with that exception instead. There are two reasons for this change: - The exception_container is supposed to allow handling exceptions without using the costly C++'s exception runtime. Although an empty container is an edge case, I think it the new behavior is more aligned with the class' purpose. The old behavior can be simulated by providing a visitor which throws when called with bad access exception. - The new behavior fixes a bug in `result_try`/`result_futurize_try`. Before the change, if the `try` block returned a failed result with an empty exception container, a bad access exception would either be thrown or returned as an exceptional future without being handled by the `catch` clauses. Although nobody is supposed to return such result<>s on purpose, a moved out result can be returned by accident and it's important for the exception handling logic to be correct in such a situation. Tests: unit(dev) Closes #10086	2022-02-16 10:06:10 +02:00
Nadav Har'El	7be3129458	cdc: don't need current keyspace to create the log table CDC registers to the table-creation hook (before_create_column_family) to add a second table - the CDC log table - to the same keyspace. The handler function (on_before_update_column_family() in cdc/log.cc) wants to retrieve the keyspace's definition, but that does NOT WORK if we create the keyspace and table in one operation (which is exactly what we intend to do in Alternator to solve issue #9868) - because at the time of the hook, the keyspace does not yet exist in the schema. It turns out that on_before_update_column_family() does not REALLY need the keyspace. It needed it to pass it on to make_create_table_mutations() but that function doesn't use the keyspace parameter passed to it! All it needs is the keyspace's name - which is in the schema anyway and doesn't need to be looked up. So in this patch we fix make_create_table_mutations() to not require the unused keyspace parameter - and fix the CDC code not to look for the keyspace that is no longer needed. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220215162342.622509-1-nyh@scylladb.com>	2022-02-16 08:38:56 +02:00
Benny Halevy	69fcc053bb	utils: uuid: add null_uuid and respective bool predecate and operator and unit test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220215113438.473400-1-bhalevy@scylladb.com>	2022-02-15 18:02:54 +02:00
Avi Kivity	817d1aade8	tools: toolchain: regenerate with libstdc++-11.2.1-9.fc34.x86_64	2022-02-15 18:02:54 +02:00
Benny Halevy	3e20fee070	cql3: result_set: remove std::ref from comperator& Applying std::ref on `RowComparator& cmp` hits the following compilation error on Fedora 34 with libstdc++-devel-11.2.1-9.fc34.x86_64 ``` FAILED: build/dev/cql3/statements/select_statement.o clang++ -MD -MT build/dev/cql3/statements/select_statement.o -MF build/dev/cql3/statements/select_statement.o.d -I/home/bhalevy/dev/scylla/seastar/include -I/home/bhalevy/dev/scylla/build/dev/seastar/gen/include -std=gnu++20 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=6 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_TYPE_ERASE_MORE -DFMT_LOCALE -DFMT_SHARED -I/usr/include/p11-kit-1 -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -O2 -DSCYLLA_ENABLE_WASMTIME -iquote. -iquote build/dev/gen --std=gnu++20 -ffile-prefix-map=/home/bhalevy/dev/scylla=. -march=westmere -DBOOST_TEST_DYN_LINK -Iabseil -fvisibility=hidden -Wall -Werror -Wno-mismatched-tags -Wno-tautological-compare -Wno-parentheses-equality -Wno-c++11-narrowing -Wno-sometimes-uninitialized -Wno-return-stack-address -Wno-missing-braces -Wno-unused-lambda-capture -Wno-overflow -Wno-noexcept-type -Wno-error=cpp -Wno-ignored-attributes -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-defaulted-function-deleted -Wno-redeclared-class-member -Wno-unsupported-friend -Wno-unused-variable -Wno-delete-non-abstract-non-virtual-dtor -Wno-braced-scalar-init -Wno-implicit-int-float-conversion -Wno-delete-abstract-non-virtual-dtor -Wno-uninitialized-const-reference -Wno-psabi -Wno-narrowing -Wno-array-bounds -Wno-nonnull -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DHAVE_LZ4_COMPRESS_DEFAULT -c -o build/dev/cql3/statements/select_statement.o cql3/statements/select_statement.cc In file included from cql3/statements/select_statement.cc:14: In file included from ./cql3/statements/select_statement.hh:16: In file included from ./cql3/statements/raw/select_statement.hh:16: In file included from ./cql3/statements/raw/cf_statement.hh:16: In file included from ./cql3/cf_name.hh:16: In file included from ./cql3/keyspace_element_name.hh:16: In file included from /home/bhalevy/dev/scylla/seastar/include/seastar/core/sstring.hh:25: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/algorithm:74: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/pstl/glue_algorithm_defs.h:13: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/functional:58: /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: error: exception specification of 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' uses itself = decltype(reference_wrapper::_S_fun(std::declval<_Up>()))> ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: note: in instantiation of exception specification for 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' requested here /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:321:2: note: in instantiation of default argument for 'reference_wrapper<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' required here reference_wrapper(_Up&& __uref) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1017:57: note: while substituting deduced template arguments into function template 'reference_wrapper' [with _Up = __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, $1 = (no value), $2 = (no value)] = __bool_constant<__is_nothrow_constructible(_Tp, _Args...)>; ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1023:14: note: in instantiation of template type alias '__is_nothrow_constructible_impl' requested here : public __is_nothrow_constructible_impl<_Tp, _Args...>::type ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:153:14: note: in instantiation of template class 'std::is_nothrow_constructible<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here : public conditional<_B1::value, _B2, _B1>::type ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:298:11: note: (skipping 8 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all) return __and_<typename _Base::_Local_storage, ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1933:13: note: in instantiation of function template specialization 'std::__partial_sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here std::__partial_sort(__first, __last, __last, __comp); ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1954:9: note: in instantiation of function template specialization 'std::__introsort_loop<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, long, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here std::__introsort_loop(__first, __last, ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:4875:12: note: in instantiation of function template specialization 'std::__sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here std::__sort(__first, __last, __gnu_cxx::__ops::__iter_comp_iter(__comp)); ^ ./cql3/result_set.hh:168:14: note: in instantiation of function template specialization 'std::sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>' requested here std::sort(_rows.begin(), _rows.end(), std::ref(cmp)); ^ cql3/statements/select_statement.cc:773:21: note: in instantiation of function template specialization 'cql3::result_set::sort<std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>' requested here rs->sort(_ordering_comparator); ^ 1 error generated. ninja: build stopped: subcommand failed. ``` Fixes #10079. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220215071955.316895-3-bhalevy@scylladb.com>	2022-02-15 10:57:23 +02:00
Benny Halevy	41b5c266db	cql3: result_set: add concept for RowComparator Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220215071955.316895-2-bhalevy@scylladb.com>	2022-02-15 10:57:19 +02:00
Benny Halevy	ee59b851b4	cql3: result_set: define internal types Define the types for column, row, and vector of rows and reuse correspondingly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220215071955.316895-1-bhalevy@scylladb.com>	2022-02-15 10:57:18 +02:00
Michael Livshin	04c1286a94	Add "me" sstables for the multi-format tests Prerequisite for the "ME sstable format support" series (which has been posted to the mailing list) -- to be merged or rejected together with that. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9939	2022-02-15 09:24:09 +02:00
Benny Halevy	67580c0855	sstables: get rid of remove_sstable_with_temp_toc It is unused since `e40aa042a7` (version 4.2) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214140029.1513522-2-bhalevy@scylladb.com>	2022-02-14 18:57:40 +02:00
Benny Halevy	e5fc4b6f5d	sstables: coroutinize remove_by_toc_name Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214140029.1513522-1-bhalevy@scylladb.com>	2022-02-14 18:57:39 +02:00
Nadav Har'El	4e3038b57f	alternator: add FIXME for schema changes requiring a loop In commit `a664ac7ba5`, the Alternator schema-modifying code (e.g., delete_table()) was reorganized to support the new Raft-based schema modifications. Schema modifications now work with an "optimistic locking" approach: We retrieve the current schema version id ("group0_guard"), reads the current schema and verifies it can do the changes it wants to do, and then does them with mm.announce(group0_guard) - which will fail if the schema version is not current because some other concurrent modification beat us in the race. This means that we need to do this whole read-modify-write (group0_guard, checking the schema, creating mutations, calling mm.announce()) in a retry loop. We have such a loop in the CQL code but it's missing in the Alternator code. In this patch we don't add the loop yet, but add FIXMEs to remind us where it's missing. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220214154435.544125-1-nyh@scylladb.com>	2022-02-14 18:24:16 +02:00
Nadav Har'El	212c321c55	test/alternator: add reproducers for non-atomic table creation We add reproducing tests for two known Alternator issues, #6391 and #9868, which involve the non-atomicity of table creation. Creating a table currently involves multiple steps - creating a keyspace, a table, materialized views, and tags. If some of these steps succeed and some fail, we get an InternalServerError and potentially leave behind some half-built table. Both issues will be solved by making better use of the new Raft-based capabilities of making multiple modifications to the schema atomically, but this patch doesn't fix the problem - it just proves it exist. The new tests involve two threads - one repeatedly trying to create a table with a GSI or with tags - and the other thread repeatedly trying to delete the same table under its feet. Both bugs are reproduced almost immediately. Note that like all test/alternator tests, the new tests are usually run on just one node. So when we fix the bug and these tests begin to pass, it will not be a proof that concurrent schema modification works safely on different nodes. To prove that, we will also need a multi-node test. However, this test can prove that we used Raft-based schema modification correctly - and if we assume that the Raft-based schema modification feature is itself correct, then we can be sure that CreateTable will be correct also across multiple nodes. Although it won't hurt to check it directly. Refs #6391 Refs #9868 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220207223100.207074-1-nyh@scylladb.com>	2022-02-14 18:21:21 +02:00
Benny Halevy	244df07771	large_data_handler: use only basename to identify the sstable SSTables may be created in one directory (e.g. staging) and be removed from another directory (base table dir, or quarantine if scrub moved them there), so identify the sstable by its unique component basename rather than the full path. Fixes #10075 Test: unit(dev) DTest: wide_rows_test.py (w/ https://github.com/scylladb/scylla-dtest/pull/2606) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214131923.1468870-1-bhalevy@scylladb.com>	2022-02-14 17:57:49 +02:00
Benny Halevy	19ea228cf8	replica: table: coroutinize move_sstables_from_staging Test: unit(dev) DTest: test_drop_mv_during_base_table_writes Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214102911.1314022-1-bhalevy@scylladb.com>	2022-02-14 17:52:27 +02:00
Benny Halevy	8f417b8021	sstable: coroutinize seal_sstable Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214105214.1337361-1-bhalevy@scylladb.com>	2022-02-14 17:49:52 +02:00
Benny Halevy	c75e63e480	sstable: coroutinize move_to_new_dir Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214154403.1590022-1-bhalevy@scylladb.com>	2022-02-14 17:47:09 +02:00
Benny Halevy	b131f94fc3	large_data_handler: maybe_delete_large_data_entries: data_size is unused Since `64a4ffc579` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214115258.1354372-1-bhalevy@scylladb.com>	2022-02-14 13:58:44 +02:00
Kamil Braun	4147ef94b0	service: raft: raft_group0: persist discovered peers and restore on restart We add a `peers()` method to `discovery` which returns the peers discovered until now (including seeds). The caller of functions which return an output -- `tick` or `request` -- is responsible for persisting `peers()` before returning the output of `tick`/`request` (e.g. before sending the response produced by `request` back). The user of `discovery` is also responsible for restoring previously persisted peers when constructing `discovery` again after a restart (e.g. if we previously crashed in the middle of the algorithm). The `persistent_discovery` class is a wrapper around `discovery` which does exactly that.	2022-02-14 12:05:18 +01:00
Kamil Braun	5dbf86fa29	db: system_keyspace: introduce discovery table This table will be used to persist the list of peers discovered by the `discovery` algorithm that is used for creating Raft group 0 when bootstrapping a fresh cluster.	2022-02-14 12:05:18 +01:00
Kamil Braun	02d4087c6e	service: raft: discovery: rename `get_output` to `tick` The name `get_output` suggests that this is the only way to get output from `discovery`. But there is a second public API: `request`, which also provides us with a different kind of output. Rename it to `tick`, which describes what the API is used for: periodically ticking the discovery state machine in order to make progress.	2022-02-14 12:04:37 +01:00
Kamil Braun	586ef8fc23	service: raft: discovery: stop returning peer_list from `request` after becoming leader In `raft_group0::discover_group0`, when we detect that we became a leader, we destroy the `discovery` object, create a group 0, and respond with the group 0 information to all further requests. However there is a small time window after becoming a leader but before destroying the `discovery` object where we still answer to discovery requests by returning peer lists, without informing the requester that we become a leader. This is unsafe, and the algorithm specification does not allow this. For example, consider the seed graph 0 --> 1. It satisfies the property required by the algorithm, i.e. that there exists a vertex reachable from every other vertex. Now `1` can become a leader before `0` contacts it. When `0` contacts `1`, it should learn from `1` that `1` created a group 0, so `0` does not become a leader itself and create another group 0. However, with the current implementation, it may happen that `0` contacts `1` and receives a peer list (instead of group 0 information), and also becomes a leader because it has the smallest ID, so we end up with two group 0s. The correct thing to do is to stop returning peer lists to requests immediately after becoming a leader. This is what we fix in this commit.	2022-02-14 12:04:37 +01:00
Benny Halevy	795d4a0bad	batchlog_manager: batchlog_replay_loop: ignore broken_semaphore if abort_requested drain() breaks _sem, causing do_batch_log_replay to throw broken_semaphore. Ignore this error in batchlog_replay_loop as it's expected on shutdown. https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/1073/testReport/junit/thrift_tests/TestCompactStorageThriftAccesses/test_get/ ``` E AssertionError: Unexpected errors found: [('node1', ['ERROR 2022-02-14 06:55:44,263 [shard 0] batchlog_manager - Exception in batch replay: seastar::broken_semaphore (Semaphore broken)'])] ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214090607.1213740-1-bhalevy@scylladb.com>	2022-02-14 11:34:16 +02:00
Raphael S. Carvalho	a9427f150a	Revert "sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME" This reverts commit `4c05e5f966`. Moving cleanup to maintenance group made its operation time up to 10x slower than previous release. It's a blocker to 4.6 release, so let's revert it until we figure this all out. Probably this happens because maintenance group is fixed at a relatively small constant, and cleanup may be incrementally generating backlog for regular compaction, where the former is fighting for resources against the latter. Fixes #10060. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>	2022-02-13 21:48:20 +02:00
Avi Kivity	13cf66d3ef	Revert "schema_registry: Increase grace period for schema version cache" This reverts commit `23da2b5879`. It causes the node to quickly run out of memory when many schema changes are made within a small time window. Fixes #10071.	2022-02-13 19:38:24 +02:00
Avi Kivity	7cc43f8aa8	Merge 'utils: add result_try and result_futurize_try' from Piotr Dulikowski Adds `utils::result_try` and `utils::result_futurize_try` - functions which allow to convert existing try..catch blocks into a version which handles C++ exceptions, failed results with exception containers and, depending on the function variant, exceptional futures using the same exception handling logic. For example, you can convert the following try..catch block: try { return a_function_that_may_throw(); } catch (const my_exception& ex) { return 123; } catch (...) { throw; } ...to this: return utils::result_try([&] { return a_function_that_may_throw_or_return_a_failed_result(); }, utils::result_catch<my_exception>([&] (const Ex&) { return 123; }), utils::result_catch_dots([&] (auto&& handle) { return handle.into_result(); }); Similarly, `utils::result_futurize_try` can be used to migrate `then_wrapped` or `f.handle_exception()` constructs. As an example of the usability of the new constructs, two places in the current code which need to simultaneously handle exceptions and failed results are converted to use `result_try` and `result_futurize_try`. Results of `perf_simple_query --smp 1 --operations-per-shard 1000000 --write`: ``` 127041.61 tps ( 67.2 allocs/op, 14.2 tasks/op, 52422 insns/op) 126958.60 tps ( 67.2 allocs/op, 14.2 tasks/op, 52409 insns/op) 127088.37 tps ( 67.2 allocs/op, 14.2 tasks/op, 52411 insns/op) 127560.84 tps ( 67.2 allocs/op, 14.2 tasks/op, 52424 insns/op) 127826.61 tps ( 67.2 allocs/op, 14.2 tasks/op, 52406 insns/op) 126801.02 tps ( 67.2 allocs/op, 14.2 tasks/op, 52420 insns/op) 125371.51 tps ( 67.2 allocs/op, 14.2 tasks/op, 52425 insns/op) 126498.51 tps ( 67.2 allocs/op, 14.2 tasks/op, 52427 insns/op) 126359.41 tps ( 67.2 allocs/op, 14.2 tasks/op, 52423 insns/op) 126298.27 tps ( 67.2 allocs/op, 14.2 tasks/op, 52410 insns/op) ``` The number of tasks and allocations is unchanged. The number of instructions per operations seems similar, it may have increased slightly (by 10-20) but it's hard to tell for sure because of the noisiness of the results. Tests: unit(dev) Closes #10045 * github.com:scylladb/scylla: transport: use result_try in process_request_one storage_proxy: use result_futurize_try in mutate_end storage_proxy: temporarily throw exception from result in mutate_end utils: add result_try and result_futurize_try	2022-02-13 19:38:13 +02:00
Avi Kivity	45bdb57b05	Update seastar submodule * seastar 299c9474d...c18cc5dc6 (5): > log: Fix silencer to be shard-local and logger-global Fixes #9784. > Fix alloc-dealloc-mismatch error in DPDK mode > Fix stack buffer overflow when using native network inteface > test: coroutines: test_scheduling_group: fixup indentation > test: coroutines: test_scheduling_group: destroy temporary scheduling_group when done	2022-02-13 16:47:25 +02:00
Avi Kivity	6572b297a2	treewide: clean up stray license blurbs After the mechanical change in `fcb8d040e8` ("treewide: use Software Package Data Exchange (SPDX) license identifiers"), a few stray license blurbs or fragments thereof remain. In two cases these were extra blurbs in code generators intended for the generated code, in others they were just missed by the script. Clean them up, adding an SPDX license identifier where needed. Closes #10072	2022-02-13 14:16:16 +02:00
Avi Kivity	6b380121e0	Merge 'utils/result: optimize result_parallel_for_each' from Piotr Dulikowski This PR rewrites the `utils::result_parallel_for_each`'s implementation to resemble the original `seastar::parallel_for_each` more closely instead of using the less efficient `seastar::map_reduce`. It uses less tasks and allocations now, as demonstrated in the results from the `perf_result_query` benchmark, attached at the end of the cover letter. The main drawback of the new implementation is that it needs to rethrow exceptions propagated as exceptional futures from the parallel sub-invocations. Contrary to the original `seastar::parallel_for_each` which uses a custom task to collect results, the new `utils::result_parallel_for_each` uses a coroutine and there doesn't currently seem to be a way to co_await for a future and inspect its state without either rethrowing or handling it in then_wrapped (which allocates a continuation). Fortunately, rethrowing is not needed for exceptions returned in failed result<>, which are already intended to be a more performant alternative to regular exceptions. As a bonus, definitions from `utils/result.hh` are now split across three different headers in order to improve (re)compilation times. Results from `perf_simple_query --smp 1 --operations-per-shard 1000000 --write` (before vs. after): ``` 126872.54 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op) 126532.13 tps ( 67.2 allocs/op, 14.2 tasks/op, 52408 insns/op) 126864.99 tps ( 67.2 allocs/op, 14.2 tasks/op, 52428 insns/op) 127073.10 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op) 126895.85 tps ( 67.2 allocs/op, 14.2 tasks/op, 52411 insns/op) 127894.02 tps ( 66.2 allocs/op, 13.2 tasks/op, 52036 insns/op) 127671.51 tps ( 66.2 allocs/op, 13.2 tasks/op, 52042 insns/op) 127541.42 tps ( 66.2 allocs/op, 13.2 tasks/op, 52044 insns/op) 127409.10 tps ( 66.2 allocs/op, 13.2 tasks/op, 52052 insns/op) 127831.30 tps ( 66.2 allocs/op, 13.2 tasks/op, 52043 insns/op) ``` Test: unit(dev, debug) Closes #10053 * github.com:scylladb/scylla: utils/result: optimize result_parallel_for_each utils/result: split into `combinators` and `loop` file	2022-02-13 12:04:40 +02:00
Avi Kivity	52e707f978	Merge 'gms: gossiper: coroutinize code (continued)' from Pavel Solodovnikov This series continues the effort of https://github.com/scylladb/scylla/pull/9844 to reduce `seastar::async` usage and coroutinize in the gossiper code. There are mostly trivial conversions from using `.get()` to `co_await`, where appropriate, as well, as elimination of `seastar::async()` wrappers. A few more functions are not yet converted, though (e.g. `apply_new_states`, `do_apply_state_locally`, `apply_state_locally`, `apply_state_locally_without_listener_notification`, maybe a few others, as well). The motivation is to be able to call every public API function of `gossiper` class without requiring `seastar::async` context. Tests: unit(debug, dev), dtest (topology-related tests) Closes #10032 * github.com:scylladb/scylla: gms: gossiper: coroutinize `wait_for_gossip` gms: gossiper: coroutinize `advertise_token_removed` gms: gossiper: coroutinize `advertise_removing` gms: gossiper: don't wrap `convict` calls into `seastar::async` gms: gossiper: coroutinize `handle_major_state_change` gms: gossiper: coroutinize `handle_shutdown_msg` gms: gossiper: coroutinize `mark_as_shutdown` and `convict` gms: gossiper: remove comment about requiring thread context in `mark_alive` gms: gossiper: don't use `seastar::async` in `mark_alive` gms: gossiper: coroutinize `do_on_change_notifications` gms: gossiper: coroutinize `do_before_change_notifications` gms: gossiper: coroutinize `real_mark_alive` gms: gossiper: coroutinize `mark_dead`	2022-02-13 11:51:44 +02:00
Piotr Dulikowski	dd3284ec38	utils/result: optimize result_parallel_for_each It now resembles the original parallel_for_each more, but uses a coroutine instead of a custom `task` to collect not-ready futures. Although the usage of a coroutine saves on allocations, the drawback is that there is currently no way to co_await on a future and handle its exception without throwing or without unconditionally allocating a then_wrapped or handle_exception continuation - so it introduces a rethrow. Furthermore, now failed results and exceptions are treated as equals. Previously, in case one parallel invocation returned failed result and another returned an exception, the exception would always be returned. Now, the failed result/exception of the invocation with the lowest index is always preferred, regardless of the failure type. The reimplementation manages to save about 350-400 instructions, one task and one allocation in the perf_simple_query benchmark in write mode. Results from `perf_simple_query --smp 1 --operations-per-shard 1000000 --write` (before vs. after): ``` 126872.54 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op) 126532.13 tps ( 67.2 allocs/op, 14.2 tasks/op, 52408 insns/op) 126864.99 tps ( 67.2 allocs/op, 14.2 tasks/op, 52428 insns/op) 127073.10 tps ( 67.2 allocs/op, 14.2 tasks/op, 52404 insns/op) 126895.85 tps ( 67.2 allocs/op, 14.2 tasks/op, 52411 insns/op) 127894.02 tps ( 66.2 allocs/op, 13.2 tasks/op, 52036 insns/op) 127671.51 tps ( 66.2 allocs/op, 13.2 tasks/op, 52042 insns/op) 127541.42 tps ( 66.2 allocs/op, 13.2 tasks/op, 52044 insns/op) 127409.10 tps ( 66.2 allocs/op, 13.2 tasks/op, 52052 insns/op) 127831.30 tps ( 66.2 allocs/op, 13.2 tasks/op, 52043 insns/op) ``` Test: unit(dev), unit(result_utils_test, debug)	2022-02-10 18:19:08 +01:00
Piotr Dulikowski	6abeec6299	utils/result: split into `combinators` and `loop` file Segregates result utilities into: - result.hh - basic definitions related to results with exception containers, - result_combinators.hh - combinators for working with results in conjunction with futures, - result_loop.hh - loop-like combinators, currently has only result_parallel_for_each. The motivation for the split is: 1. In headers, usually only result.hh will be needed, so no need to force most .cc files to compile definitions from other files, 2. Less files need to be recompiled when a combinator is added to result_combinators or result_loop. As a bonus, `result_with_exception` was moved from `utils::internal` to just `utils`.	2022-02-10 18:19:05 +01:00
Piotr Dulikowski	049564bd2d	transport: use result_try in process_request_one Adapts the exception handling logic in process_request_one so that it uses utils::result_try to handle both C++ exceptions and failed results in a unified way.	2022-02-10 17:35:32 +01:00
Piotr Dulikowski	98bde8d6d2	storage_proxy: use result_futurize_try in mutate_end Adapts the mutate_end exception handling logic so that it uses the new utils::result_futurize_try function to handle both exceptional futures and failed results in an unified way.	2022-02-10 17:35:32 +01:00
Piotr Dulikowski	d5d24a5140	storage_proxy: temporarily throw exception from result in mutate_end Temporarily removes the logic which handles failed results in a non-throwing way. Exceptions from failed results are thrown and handled in try..catch. The reason for this change is that it makes the following commit, which migrates the whole try..catch block to utils::result_futurize_try much nicer. The next commit will also bring back the non-throwing handling of the failed result.	2022-02-10 17:35:32 +01:00
Piotr Dulikowski	8d52ceca50	utils: add result_try and result_futurize_try Adds result_try and result_futurize_try - functions which allow to convert existing try..catch blocks into a version which handles C++ exceptions, failed results with exception containers and, depending on the function variant, exceptional futures.	2022-02-10 17:35:32 +01:00
Juliusz Stasiewicz	00a6fda7b9	tracing: Trace slow queries on replicas wrt. parent's clock Secondary tracing sessions used to compute the execution time from the point of their `begin()`-ning, not the parent session's `begin()`. As a result, replica reported a slow query if it exceeded the entire threshold on that replica too. This change augments `trace_info` with the TS of parent's session starting point, to be used as a reference on replicas. Fixes #9403 Closes #10005	2022-02-10 12:03:53 +01:00
Pavel Solodovnikov	e892170c86	raft: add raft tables to `extra_durable_tables` list `system.raft`, `system.raft_snapshots` and `system.raft_config` were missing from the `extra_durable_tables` list, so that `set_wait_for_sync_to_commitlog(true)` was not enabled when the tables were re-created via `create_table_from_mutations`. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20220210073418.484843-1-pa.solodovnikov@scylladb.com>	2022-02-10 11:47:41 +02:00
Botond Dénes	ef34c10a94	main: run scylla main to when there are no arguments main() has some logic to select the main function it will delegate to based on argv[1]. The intent is that when the value of argv[1] suggest that the user did not specify a specific app to run, we default to "server" (scylla proper). This logic currently breaks down when there are no arguments at all: in this case the following error is printed and scylla refuses to start: error: unrecognized first argument: expected it to be "server", a regular command-line argument or a valid tool name (see `scylla --list-tools`), but got Fix this by checking for empty argv[1] and defaulting to "server" in that case. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220210092125.293682-1-bdenes@scylladb.com>	2022-02-10 11:47:20 +02:00
Benny Halevy	c8cf545fdc	sstable_directory: process_sstable_dir: use directory_lister Simplify the implementation by using directory_lister get() rather than lister::scan_dir. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-10 11:41:50 +02:00
Benny Halevy	6b59c5bccd	sstable_directory: process_sstable_dir: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-10 11:41:50 +02:00
Benny Halevy	8b654afc1c	sstable_directory: coroutinize process_sstable_dir Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-10 11:41:50 +02:00
Benny Halevy	207174c692	lister: add directory_lister directory_lister provides a simpler interface compared to lister. After creating the directory_lister, its async get() method should be called repeatedly, returning a std::optional<directory_entry> each call, until it returns a disengaged entry or an error. This is especially suitable for coroutines as demonstrated in the unit tests that were added. For example: auto dl = directory_lister(path); while (auto de = co_await dl.get()) { co_await process(*de); } Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-10 11:41:50 +02:00
Avi Kivity	2e2b54254c	Merge 'docs: update theme 1.1' from David Garcia Related issue https://github.com/scylladb/sphinx-scylladb-theme/issues/310 ScyllaDB Sphinx Theme 1.1 is now released 🥳 We’ve made a number of updates to update all our dependencies to the latest version and introduced new directives you can use to write great docs. You can read more about all notable changes [here](https://sphinx-theme.scylladb.com/master/upgrade/CHANGELOG.html#february-2022). Before, the theme installed [poetry 1.1.x](https://python-poetry.org/) as a dependency to manage Python dependencies. However, ``poetry 1.2.x`` changed the installation method. Therefore, we've decided to [#307 Make poetry a prerequisite](https://github.com/scylladb/sphinx-scylladb-theme/issues/307) so that you can decide to install the poetry version you prefer. To preview the docs locally, you should uninstall the previous version of poetry. Then, install the latest version: 1. Uninstall Poetry 1.1.x. ``` curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py \| POETRY_UNINSTALL=1 python - ``` 2. Install Poetry 1.2.x. For detailed instructions, see [Poetry installation](https://python-poetry.org/docs/master/#installation). 1. Clone this PR. For more information, see [Cloning pull requests locally](https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally). 2. Uninstall poetry 1.1 and install poetry 1.2. For more information, see Breaking changes notice above. 3. Enter the docs folder, and run: ``` make preview ```` 4. Open http://127.0.0.1:5500/ with your favorite browser. The doc should render without errors, and the version should be Sphinx Theme version (see the footer) must be ``1.1.x``: ![image](https://user-images.githubusercontent.com/9107969/152107446-52b167d8-c607-4431-a7a4-92579153d024.png) Closes #10054 * github.com:scylladb/scylla: Add missing lexer docs: update theme 1.1	2022-02-10 11:14:02 +02:00
Botond Dénes	54b27a6dec	Update seastar submodule * seastar d27bf8b5...299c9474 (1): > core/app_template: print debug warning to std::cerr	2022-02-10 09:51:41 +02:00
Nadav Har'El	4937270803	test/alternator: add option to run with Raft-based schema changes This patch adds a "--raft" option to test/alternator/run to enable the experimental Raft-based schema changes ("--experimental-features=raft") when running Scylla for the tests. This is the same option we added to test/cql-pytest/run in a previous patch. Note that we still don't have any Alternator tests that pass or fail differently in these two modes - these will probably come later as we fix issues #9868 and #6391. But in order to work on fixing those issues we need to be able to run the tests in Raft mode. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220209123144.321344-1-nyh@scylladb.com>	2022-02-10 09:43:10 +02:00
Nadav Har'El	8409a42baa	merge: Convert table::compact_sstables to coroutines Patch series by Mikołaj Sielużycki compaction: Fix indentation in table::compact_sstables. compaction: Convert table::compact_sstables to coroutines.	2022-02-10 09:10:24 +03:00
Nadav Har'El	a1635b553e	cql-pytest: fix detection of "raft" experimental feature In a previous patch we fixed the output of experimental features list (issue #10047), so we also need to fix the test code which detects the "raft" experimental feature - to use the string "raft" and not the silly byte 4 we had there before. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220209104331.312999-1-nyh@scylladb.com>	2022-02-10 09:10:24 +03:00
Nadav Har'El	de586ef856	test/cql-pytest: mechanism for tests requiring raft-based schema updates Issue #8968 no longer exists when Raft-based schema updates are enabled in Scylla (with --experimental-features=raft). Before we can close this issue we need a way to re-run its test test_keyspace.py::test_concurrent_create_and_drop_keyspace with Raft and see it pass. But we also want the tests to continue to run by default the older raft-less schema updates - so that this mode doesn't regress during the potentially-long duration that it's still the default! The solution in this patch is: 1. Introduce a "--raft" option to test/cql-pytest/run, which runs the tests against a Scylla with the raft experimental feature, while the default is still to run without it. 2. Introduce a text fixture "fails_without_raft" which marks a test which is expected to fail with the old pre-raft code, but is expected to pass in the new code. 3. Mark the test test_concurrent_create_and_drop_keyspace with this new "fails_without_raft". After this patch, running test/cql-pytest/run --raft test_keyspace.py::test_concurrent_create_and_drop_keyspace Passes, which shows that issue 8968 was fixed (in Raft mode) - so we can say: Fixes #8968 Running the same test without "--raft" still xfails (an expected failure). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220208162732.260888-1-nyh@scylladb.com>	2022-02-10 09:10:24 +03:00
Nadav Har'El	fef7934a2d	config: fix some types in system.config virtual table The system.config virtual tables prints each configuration variable of type T based on the JSON printer specified in the config_type_for<T> in db/config.cc. For two variable types - experimental_features and tri_mode_restriction, the specified converter was wrong: We used value_to_json<string> or value_to_json<vector<string>> on something which was not a string. Unfortunately, value_to_json silently casted the given objects into strings, and the result was garbage: For example as noted in #10047, for experimental_features instead of printing a list of features names, e.g., "raft", we got a bizarre list of one-byte strings with each feature's number (which isn't documented or even guaranteed to not change) as well as carriage-return characters (!?). So solution is a new printable_to_json<T> which works on a type T that can be printed with operator<< - as in fact the above two types can - and the type is converted into a string or vector of strings using this operator<<, not a cast. Also added a cql-pytest test for reading system.config and in particular options of the above two types - checking that they contain sensible strings and not "garbage" like before this patch. Fixes #10047. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220209090421.298849-1-nyh@scylladb.com>	2022-02-10 09:10:24 +03:00
David Garcia	e092bf3bad	Add missing lexer	2022-02-09 11:25:10 +00:00
Mikołaj Sielużycki	ee386213c2	compaction: Fix indentation in table::compact_sstables.	2022-02-09 12:19:23 +01:00
Mikołaj Sielużycki	ec91192525	compaction: Convert table::compact_sstables to coroutines.	2022-02-09 12:19:23 +01:00
David Garcia	24b5584941	docs: update theme 1.1	2022-02-09 11:13:38 +00:00
Avi Kivity	7f0dec9227	Update seastar submodule * seastar 0d250d15a...d27bf8b5a (5): > Merge "Clean internal namespace in io_queue.cc" from Pavel E > Making par.._for_each and max_conc.._for_each compatible with move-only views (like generators) > tests: Perf test for smp::submit_to efficiency > Merge "Auto-increase IO latency goal from reactor" from Pavel E > reactor: Fix default task-quota-ms to be 0.5ms	2022-02-09 10:17:26 +02:00
Tomasz Grabiec	23da2b5879	schema_registry: Increase grace period for schema version cache If version is absent in cache, it will be fetched from the coordinator. This is not expensive, but if the version is not known, it must be also "synced". It means that the node will do a full schema pull from the coordinator. This pull is expensive and can take seconds. If the coordinator we pull from is at an old version, the pull will do nothing and current node will soon forget the old version, initiating another pull. If some nodes stay at an old version for a long time for some reason, this will make new coordinators initiate pulls frequently. Increase the expiration period to 15 minutes to reduce the impact in such scenarios. Fixes #10042. Message-Id: <20220207122317.674241-1-tgrabiec@scylladb.com>	2022-02-09 09:27:07 +02:00
Tomasz Grabiec	7ae947b7e1	Merge "raft: bootstrap nodes as non-voter" from Alejo Make only the first node in group0 to start as voter. Subsequent nodes start as non-voters and request change to voter once bootstrap is successful. Add support for this in raft and a couple of minor fixes. * alejo/raft-join-non-voting-v6: raft: nodes joining as non-voters raft: group 0: use cfg.contains() for config check raft: modify_config: support voting state change raft: minor: fix log format string	2022-02-09 09:27:07 +02:00
Raphael S. Carvalho	d208d33636	Fix quadratic behavior and compaction inefficiency when adding new files With trigger_compaction() being called after each new sstable is added to the set, we'll get quadratic behavior because strategies like tiered will sort all the candidates before iterating on them, so complexity is ~ ((N - 1) * N * logN). Additionally, compaction may be inefficient as we're not waiting for the sstable set to settle, so table may end up missing files that would allow for more efficient jobs. The latter isn't a big problem because we have reshape running in an earlier phase, so data layout should satisfy the strategy almost. Boot is not affected by these problems because it temporarily disables auto compaction, so trigger_compaction() is a no-op for it. So refresh remains as the only one affected. Fixes #10046. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220208151154.72606-1-raphaelsc@scylladb.com>	2022-02-09 09:27:07 +02:00
Alejo Sanchez	a0c2bc0df2	raft: nodes joining as non-voters Except for the first node creating the group0, make other nodes join as non-voters and make them voters after successful bootstrap. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-02-08 09:16:30 -04:00
Avi Kivity	5099b1e272	Merge 'Propagate coordinator timeouts for regular writes and batches without throwing' from Piotr Dulikowski Currently, most of the failures that occur during CQL reads or writes are reported using C++ exceptions. Although the seastar framework avoids most of the cost of unwinding by keeping exceptions in futures as `std::exception_ptr`s, the exceptions need to be inspected at various points for the purposes of accounting metrics or converting them to a CQL error response. Analyzing the value and type of an exception held by `std::exception_ptr`'s cannot be done without rethrowing the exception, and that can be very costly even if the exception is immediately caught. Because of that, exceptions are not a good fit for reporting failures which happen frequently during overload, especially if the CPU is the bottleneck. This PR introduces facilities for reporting exceptions as values using the boost::outcome library. As a first step, the need to use exceptions for reporting timeouts was eliminated for regular and batch writes, and no exceptions are thrown between creation of a `mutation_write_timeout_exception` and its serialization as a CQL response in the `cql_server`. The types and helpers introduced here can be reused in order to migrate more exceptions and exception paths in a similar fashion. Results of `perf_simple_query --smp 1 --operations-per-shard 1000000`: Master (`00a9326ae7`) 128789.53 tps ( 82.2 allocs/op, 12.2 tasks/op, 49245 insns/op) This PR 127072.93 tps ( 82.2 allocs/op, 12.2 tasks/op, 49356 insns/op) The new version seems to be slower by about 100 insns/op, fortunately not by much (about 0.2%). Tests: unit(dev), unit(result_utils_test, debug) Closes #10014 * github.com:scylladb/scylla: cql_test_env: optimize handling result_message::exception transport/server: handle exceptions from coordinator_result without throwing transport/server: propagate coordinator_result to the error handling code transport/server: unwrap the exception result_message in process_xyz_internal query_processor: add exception-returning variants of execute_ methods modification_statement: propagate failed result through result_message::exception batch_statement: propagate failed result through result_message::exception cql_statement: add `execute_without_checking_exception_message` result_message: add result_message::exception storage_proxy: change mutate_with_triggers to return future<result<>> storage_proxy: add mutate_atomically_result storage_proxy: return result<> from mutate_result storage_proxy: return result<> from mutate_internal storage_proxy: properly propagate future from mutate_begin to mutate_end storage_proxy: handle exceptions as values in mutate_end storage_proxy: let mutate_end take a future<result<>> storage_proxy: resultify mutate_begin storage_proxy: use result in the _ready future of write handlers storage_proxy: introduce helpers for dealing with results exceptions: add coordinator_exception_container and coordinator_result utils: add result utils utils: add exception_container	2022-02-08 14:27:09 +02:00
Alejo Sanchez	2d9f40f716	raft: group 0: use cfg.contains() for config check There will be nodes in non-voting state in configuration, so can_vote() is not a good check. Use newer cfg.contains(). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-02-08 08:00:07 -04:00
Alejo Sanchez	627275945f	raft: modify_config: support voting state change Handle requests to change voting for servers already present in the current configuration. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-02-08 08:00:07 -04:00
Alejo Sanchez	a40417df08	raft: minor: fix log format string Fix format string for log line. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-02-08 08:00:07 -04:00
Piotr Dulikowski	ffd439d908	cql_test_env: optimize handling result_message::exception The single_node_cql_env uses query_processor::execute_xyz family of methods to perform operations. Due to previous commits in this series, they allocate one more task than before - a continuation that converts result_message::exception into an exceptional future. We can recover that one task by using variants of those methods which do not perform a conversion, and turn .finally() invocations into .then()s which perform conversion manually.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	81968f2c3a	transport/server: handle exceptions from coordinator_result without throwing Instead of throwing the exception contained in failed `result<>`, it is now inspected with a visitor which avoids the need for throwing.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	4cc5d582e3	transport/server: propagate coordinator_result to the error handling code Now, the failed `result<>` is throwlessly propagated to the continuation which converts exceptions to CQL response messages, and is thrown there.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	c750f7895f	transport/server: unwrap the exception result_message in process_xyz_internal At the point where `result_message` is converted to a `cql_server::response`, now the result message is inspected and returned as failed `result<>` if it contained an error. For now, the failed `result<>` is thrown as exception in `process` and `process_on_shard`, but that will change in the next commit.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	53f3feb103	query_processor: add exception-returning variants of execute_ methods Adds variants of the execute_prepared, execute_direct and execute_batch which are allowed to return exceptions as `result_message::exception`. Because the `result_message::exception` must be explicitly handled by the receiver, new variants are introduced in order not to accidentally ignore the exception, which would be very bad.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	2572104dfe	modification_statement: propagate failed result through result_message::exception Modifies the modification_statement code so that is converts failed `result<>` into a `result_message::exception` without involving the C++ exception runtime.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	f9d1914e1c	batch_statement: propagate failed result through result_message::exception Modifies the batch_statement code so that is converts failed `result<>` into a `result_message::exception` without involving the C++ exception runtime.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	e1d762b110	cql_statement: add `execute_without_checking_exception_message` Adds a new virtual method to the cql_statement with a wordy name. The new method is a variant of `execute`, but it is allowed to return errors via the `result_message::exception` object. The reason for an additional method is that there are many places in the code which call `execute` but do not check the result in any way. Because ignoring an exception unintentionally is a very bad thing, the new method needs to be explicitly implemented by statements which can return a `result_message::exception`, and explicitly called in the code which is prepared to handle a `result_message::exception`.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	e4ff22b4ca	result_message: add result_message::exception In order to propagate exceptions as values through the CQL layer with minimal modifications to the interfaces, a new result_message type is introduced: result_message::exception. Similarly to result_message::bounce_to_shard, this is an internal type which is supposed to be handled before being returned to the client.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	4c1eae7600	storage_proxy: change mutate_with_triggers to return future<result<>> Changes the interface of `mutate_with_triggers` so that it returns `future<result<>>` instead of `future<>`. No intermediate `mutate_with_triggers_result` method is introduced because all call sites will be changed in this PR so that they properly handle failed `result<>`s with exceptions-as-values.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	7ed668a177	storage_proxy: add mutate_atomically_result Similarly to `mutate_result` introduced in the previous commit, `mutate_atomically_result` is introduced which returns some exceptions inside `result<>`. The pre-existing `mutate_atomically` keeps the same interface but uses `mutate_atomically_result` internally, converting failed `result<>` to exceptional future if needed.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	f9ff5e7692	storage_proxy: return result<> from mutate_result In order to be able to propagate exceptions-as-values from storage_proxy but without having to modify all call sites of `mutate`, an in-between method `mutate_result` is introduced which returns some exceptions inside `result<>`. Now, `mutate` just calls the latter and converts those exceptions to exceptional future if needed.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	f02b8614af	storage_proxy: return result<> from mutate_internal Changes the interface of `mutate_internal` so that it returns a `future<result<>>` instead of `future<>`.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	f8bbf67e64	storage_proxy: properly propagate future from mutate_begin to mutate_end Modifies all call sites of `mutate_begin` and `mutate_end` so that the failed result<> created in the former is properly propagated to the latter.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	e2893368a7	storage_proxy: handle exceptions as values in mutate_end Instead of stupidly rethrowing the exception in failed result<>, the `storage_proxy::mutate_end` function now inspects it with a visitor, which does not involve any rethrows. Moreover, mutate_end now also returns a `future<result<>>` instead of just `future<>`.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	5c00b27662	storage_proxy: let mutate_end take a future<result<>> Changes the `storage_proxy::mutate_end` method to accept a `future<result<>>` instead of `future<>`. For the time being, all call call sites of that method pass a future which is either exceptional or contains a result<> with a value. Moreover, in case of a failed result<>, mutate_end just rethrows the exception. Both of these will change in the upcoming commits of this PR.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	59efe085af	storage_proxy: resultify mutate_begin Changes the `storage_proxy::mutate_begin` method to return a future<result<>>.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	3a92513ef6	storage_proxy: use result in the _ready future of write handlers Changes the type of the _ready promise in abstract_write_response_handler - a promise used by the coordinator logic to wait until the write operation is complete - to keep a `result<>` instead of `void`. Now, a timeout is signalled by setting the promise to a value containing a `result<>` with a mutation write timeout exception - previously it was signalled by setting the promise to an exceptional value. This is just a first step on a long road of throwless propagation of the error to the cql_server - for now, a failed result is immediately converted to an exceptional future in `storage_proxy::response_wait`.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	6ac98f26e0	storage_proxy: introduce helpers for dealing with results Adds a number of typedefs in order to make working with coordinator exceptions-as-values easier.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	9304791ce5	exceptions: add coordinator_exception_container and coordinator_result Adds coordinator_exception_container which is a typedef over exception_container and is meant to hold exceptions returned from the coordinator code path. Currently, it can only hold mutation write timeout exceptions, because only that kind of error will be returned by value as a result of this PR. In the future, more exception types can be added. Adds coordinator_result which is a boost::outcome::result that uses coordinator_exception_container as the error type.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	11cb670881	utils: add result utils Adds a number of utilities for working with boost::outcome::result combined with exception_container. The utilities are meant to help with migration of the existing code to use the boost::outcome::result: - `exception_container_throw_policy` - a NoValuePolicy meant to be used as a template parameter for the boost::outcome::result. It protects the caller of `result::value()` and `result::error()` methods - if the caller wishes to get a value but the result has an error (exception_container in our case), the exception in the container will be thrown instead. In case it's the other way around, boost::outcome::bad_result_access is thrown. - `result_parallel_for_each` - a version of `parallel_for_each` which is aware of results and returns a failed result in case any of the parallel invocations return a failed result. - `result_into_future` - converts a result into a future. If the result holds a value, converts it into make_ready_future; if it holds an exception, the exception is returned as make_exception_future. - `then_ok_result` takes a `future<T>` and converts it into a `future<result<T>>`. - `result_wrap` adapts a callable of type `T -> future<result<T>>` and returns a callable of type `result<T> -> future<result<T>>`.	2022-02-08 11:08:42 +01:00
Raphael S. Carvalho	38f83d8862	compaction_manager: Don't mix member functions and variables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220204190911.37276-1-raphaelsc@scylladb.com>	2022-02-07 18:40:48 +02:00
Botond Dénes	9cfde98cce	Merge "Move is_replacing/get_replace_address from database" from Pavel Emelyanov " This is the continuation of `3e31126b` (Brush up the initial tokens generation code). The replica::database is still used as the configuration provider, and two of those bits can be easily fixed. " tests: unit(dev) * 'br-database-no-replacing-config' of https://github.com/xemul/scylla: database: Move is_replacing() and get_replace_address() (back) into storage_service bootstrapper: Get 'is-replacing' via argument too bootstrapper: Get replace address via argument	2022-02-07 18:40:48 +02:00
Nadav Har'El	9982a28007	alternator: allow REMOVE of non-existent nested attribute DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x exists in the item, but x.y doesn't - the removal silently does nothing. Alternator incorrectly generated an error in this case, and unfortunately we didn't have a test for this case. So in this patch we add the missing test (which fails on Alternator before this patch - and passes on DynamoDB) and then fix the behavior. After this patch, "REMOVE x.y" will remain an error if "x" doesn't exist (saying "document paths not valid for this item"), but if "x" exists and is a map, but "x.y" doesn't, the removal will silently do nothing and will not be an error. Fixes #10043. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220207133652.181994-1-nyh@scylladb.com>	2022-02-07 18:40:48 +02:00
Benny Halevy	31f4cd21eb	shard_reader: close: degrade error message to warning 1. There's nothing we can do about this error. 2. It doesn't affect any query 3. No need to reprort timeout errors here. Refs #10029 Note that in 4.6.rc4-0.20220203.34d470967a0 (where the issue above was opened against) the error is likely to be related to read_ahead failure which is already reported as a warning in master since `fc729a804b`. When backported, this patch should be applied after: `fc729a804b` `d7a993043d` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220207080041.174934-1-bhalevy@scylladb.com>	2022-02-07 18:40:48 +02:00
Kamil Braun	93eed6d0c7	service: storage_service: leave Raft group 0 before `stop_transport` in `decommission` Leaving group 0 in `decommission` would previously fail with RPC exception because it happened after messaging service was shutdown. Fixes #9845. Message-Id: <20220201112743.9705-1-kbraun@scylladb.com>	2022-02-07 18:40:48 +02:00
Piotr Sarna	5a13ff09e9	expression: fix get_value for mismatched column definitions As observed in #10026, after schema changes it somehow happened that a column defition that does not match any of the base table columns was passed to expression verification code. The function that looks up the index of a column happens to return -1 when it doesn't find anything, so using this returned index without checking if it's nonnegative results in accessing invalid vector data, and a segfault or silent memory corruption. Therefore, an explicit check is added to see if the column was actually found. This serves two purposes: - avoiding segfaults/memory corruption - making it easier to investigate the root cause of #10026 Closes #10039	2022-02-07 18:40:48 +02:00
Nadav Har'El	203291f7ba	cql: reject a map literal with the same key twice The CQL parser currently accepts a command like: ALTER KEYSPACE ksname WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc1' : 3 } But because these options are read into an std::map, one of the definitions of 'dc1' is silently ignored (counter-intuitively, it is the first setting which is kept, and the second setting is ignored.) But this is most likely a user's typo, so a better choice is to report this as a parse error instead of arbitrarly and silently keeping just one of the settings. This is what Cassandra does since version 3.11 (see https://issues.apache.org/jira/browse/CASSANDRA-13369 and Cassandra commit 1a83efe2047d0138725d5e102cc40774f3b14641), and this is what we do in this patch. The unit test cassandra_tests/validation/operations/alter_test.py:: testAlterKeyspaceWithMultipleInstancesOfSameDCThrowsSyntaxException, translated from Cassandra's unit tests, now passes. Fixes #10037. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220207113709.78613-1-nyh@scylladb.com>	2022-02-07 18:40:48 +02:00
Pavel Emelyanov	66b9a53808	database: Move is_replacing() and get_replace_address() (back) into storage_service Both helpers (natuarally) used to be storage-service methods, but then were moved to databse because bootstrapper code wanted to know this info. Now the bootstraper is equipped with necessary arguments. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-07 12:43:08 +03:00
Pavel Emelyanov	469ded71a9	bootstrapper: Get 'is-replacing' via argument too This also removes the only usage of this helper outside of the storage service. The place that needs it is the use_strict_sources_for_ranges() checker and all the callers of it are aware of whether it's replacing happenning or not. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-07 12:41:02 +03:00
Pavel Emelyanov	9770f54789	bootstrapper: Get replace address via argument This removes the only usage of db.get_replace_address outside of storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-07 12:39:51 +03:00
Nadav Har'El	cc57ac8c1c	cql3: add a cql3::util::quote() function The function cql3::util::maybe_quote() is used throughout Scylla to convert identifier names (column names, table names, etc.) into strings that can be embedded in CQL commands. maybe_quote() sometimes needs to quote these identifier names, but when the identifier name is lowercase, and not a CQL keyword, it is not quoted. Not quoting identifier names when not needed is nice and pretty, but has a forward-compatibility problem: If some CQL command with an unquoted identifier is saved somewhere, and new version of Scylla adss this identifier as a new reserved keyword - the CQL command will break. So this patch introduces a new function, cql3::util::quote(), which unconditionally quotes the given identifier. The new function is not yet used in Scylla, but we add a unit test (based on the test of maybe_quote()) to confirm it behaves correctly. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220118161217.231811-2-nyh@scylladb.com>	2022-02-07 11:33:57 +02:00
Nadav Har'El	5d2f694a90	cql3: fix cql3::util::maybe_quote() for keywords cql3::util::maybe_quote() is a utility function formatting an identifier name (table name, column name, etc.) that needs to be embedded in a CQL statement - and might require quoting if it contains non-alphanumeric characters, uppercase characters, or a CQL keyword. maybe_quote() made an effort to only quote the identifier name if neccessary, e.g., a lowercase name usually does not need quoting. But lowercase names that are CQL keywords - e.g., to or where - cannot be used as identifiers without quoting. This can cause problems for code that wants to generate CQL statements, such as the materialized-view problem in issue #9450 - where a user had a column called "to" and wanted to create a materialized view for it. So in this patch we fix maybe_quote() to recognize invalid identifiers by using the CQL parser, and quote them. This will quote reserved keywords, but not so-called unreserved keywords, which are allowed as identifiers and don't need quoting. This addition slows down maybe_quote(), but maybe_quote() is anyway only used in heavy operations which need to generate CQL. This patch also adds two tests that reproduce the bug and verify its fix: 1. Add to the low-level maybe_quote() test (a C++ unit test) also tests that maybe_quote() quotes reserved keywords like "to", but doesn't quote unreserved keywords like "int". 2. Add a test reproducing issue #9450 - creating a materialized view whose key column is a keyword. This new test passes on Cassandra, failed on Scylla before this patch, and passes after this patch. It is worth noting that maybe_quote() now has a "forward compatiblity" problem: If we save CQL statements generated by maybe_quote(), and a future version introduces a new reserved keyword, the parser of the future version may not be able to parse the saved CQL statement that was generated with the old mayb_quote() and didn't quote what is now a keyword. This problem can be solved in two ways: 1. Try hard not to introduced new reserved keywords. Instead, introduce unreserved keywords. We've been doing this even before recognizing this maybe_quote() future-compatibility problem. 2. In the next patch we will introduce quote() - which unconditionally quotes identifier names, even if lowercase. These quoted names will be uglier for lowercase names - but will be safe from future introduction of new keywords. So we can consider switching some or all uses of maybe_quote() to quote(). Fixes #9450 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220118161217.231811-1-nyh@scylladb.com>	2022-02-07 11:33:56 +02:00
Nadav Har'El	b3cfd4ce07	cql-pytest: translate Cassandra's tests for ALTER operations This is a translation of Cassandra's CQL unit test source file validation/operations/AlterTest.java into our our cql-pytest framework. This test file includes 24 tests for various types of ALTER operations (of keyspaces, tables and types). Two additional tests which required multiple data centers to test were dropped with a comment explaining why. All 24 tests pass on Cassandra, with 8 failing on Scylla reproducing one already known Scylla issue and 5 previously-unknown ones: Refs #8948: Cassandra 3.11.10 uses "class" instead of "sstable_compression" for compression settings by default Refs #9929: Cassandra added "USING TIMESTAMP" to "ALTER TABLE", we didn't. Refs #9930: Forbid re-adding static columns as regular and vice versa Refs #9935: Scylla stores un-expanded compaction class name in system tables. Refs #10036: Reject empty options while altering a keyspace Refs #10037: If there are multiple values for a key, CQL silently chooses last value Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220206163820.1875410-2-nyh@scylladb.com>	2022-02-07 10:57:43 +02:00
Nadav Har'El	b61876f4ff	test/cql-pytest: implement nodetool.compact() Implement the nodetool.compact() function, requesting a major compaction of the given table. As usual for the nodetool.* functions, this is implemented with the REST API if available (i.e., testing Scylla), or with the external "nodetool" command if not (for testing Cassandra). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220206163820.1875410-1-nyh@scylladb.com>	2022-02-07 10:57:42 +02:00
Konstantin Osipov	caeaba60f9	cql_repl: use POSIX primitives to reset input/output Seastar uses POSIX IO for output in addition to C++ iostreams, e.g. in print_safe(), where it write()s directly to stdout. Instead of manipulating C++ output streams to reset stdout/log files, reopen the underlying file descriptors to output/log files. Fixes #9962 "cql_repl prints junk into the log" Message-Id: <20220204205032.1313150-1-kostja@scylladb.com>	2022-02-07 10:53:20 +02:00
Nadav Har'El	c020ed7383	merge: test.py: assorted fixes Merged patch series by Konstantin Osipov: Assorted fixes in test.py in preparation for cluster testing: - better logging - async search for unit test cases - ubuntu fixes test.py: highlight the failure cause test.py: clean up setting of scylla executable test.py: speed up search for tests cases, use async test.py: make case cache global test.py: make --cpus option work on Ubuntu test.py: create an own TestSuite instance for each path/mode combo test.py: do not fail entire run if list-content fails due to ASAN test.py: print subtest name on cancel test.py: fix flake8 complaints	2022-02-06 10:13:36 +02:00
Pavel Solodovnikov	dce3159156	gms: gossiper: coroutinize `wait_for_gossip` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:34:52 +03:00
Pavel Solodovnikov	ab41151a41	gms: gossiper: coroutinize `advertise_token_removed` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:33:32 +03:00
Pavel Solodovnikov	4416070f56	gms: gossiper: coroutinize `advertise_removing` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:33:13 +03:00
Pavel Solodovnikov	e9f5da9507	gms: gossiper: don't wrap `convict` calls into `seastar::async` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:32:14 +03:00
Pavel Solodovnikov	e26829e202	gms: gossiper: coroutinize `handle_major_state_change` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	705a759891	gms: gossiper: coroutinize `handle_shutdown_msg` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	9ce0e2efa3	gms: gossiper: coroutinize `mark_as_shutdown` and `convict` Since these two functions call each other, convert to coroutines and eliminate the dependency on `seastar::async` for both of them at the same time. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	c584a9cc1f	gms: gossiper: remove comment about requiring thread context in `mark_alive` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	ee30d0a385	gms: gossiper: don't use `seastar::async` in `mark_alive` Since `real_mark_alive` does not require `seastar::async` now, we can eliminate the wrapping async call, as well. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	529f4d0f98	gms: gossiper: coroutinize `do_on_change_notifications` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	37066039df	gms: gossiper: coroutinize `do_before_change_notifications` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	231d8a3ad4	gms: gossiper: coroutinize `real_mark_alive` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	c929f23b8d	gms: gossiper: coroutinize `mark_dead` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:20 +03:00
Piotr Dulikowski	80f6224959	utils: add exception_container Adds `exception_container` - a helper type used to hold exceptions as a value, without involving the std::exception_ptr. The motivation behind this type is that it allows inspecting exception's type and value without having to rethrow that exception and catch it, unlike std::exception_ptr. In our current codebase, some exception handling paths need to rethrow the exception multiple times in order to account it into metrics or encode it as an error response to the CQL client. Some types of exceptions can be thrown very frequently in case of overload (e.g. timeouts) and inspecting those exceptions with rethrows can make the overload even worse. For those kinds of exceptions it is important to handle them as cheaply as possible, and exception_container used with conjunction with boost::outcome::result can help achieve that.	2022-02-04 20:18:00 +01:00
Konstantin Osipov	dee6da53b3	test.py: highlight the failure cause Use color palette to highlight the exception which aborted the harness.	2022-02-04 17:15:52 +03:00
Konstantin Osipov	56aaabfa31	test.py: clean up setting of scylla executable Now that suites are per mode, set scylla executable path once per suite, not once per test. Ditto for scylla env.	2022-02-04 17:15:52 +03:00
Konstantin Osipov	e9ec69494e	test.py: speed up search for tests cases, use async Search for test cases in parallel. This speeds up the search for test cases from 30 to 4-5 seconds in absence of test case cache and from 4 to 3 seconds if case cache is present.	2022-02-04 17:15:52 +03:00
Konstantin Osipov	45270f5ad2	test.py: make case cache global test.py runs each unit test's test case in a separate process. The list of test cases is built at start, by running --list-cases for each unit test. The output is cached, so that if one uses --repeat option, we don't list the cases again and again. The cache, however, was only useful for --repeat, because it was only caching the last tests' output, not all tests output, so if I, for example, run tests like: ./test.py foo bar foo .. the cache was unused. Make the cache global which simplifies its logic and makes it work in more cases.	2022-02-04 16:47:35 +03:00
Konstantin Osipov	445f90dc3b	test.py: make --cpus option work on Ubuntu The used API is only available in python3, so use it explicitly.	2022-02-04 16:16:17 +03:00
Konstantin Osipov	60fde39880	test.py: create an own TestSuite instance for each path/mode combo To run tests in a given mode we will need to start off scylla clusters, which we would want to pool and reuse between many tests. TestSuite class was designed to share resources of common tests. One can't pool together scylla servers compiled with different tests, so create an own TestSuite instance for each mode.	2022-02-04 16:16:17 +03:00
Konstantin Osipov	efd7b9f4a3	test.py: do not fail entire run if list-content fails due to ASAN If list-content of one test fails with ASAN error, do not abort.	2022-02-04 16:16:17 +03:00
Konstantin Osipov	c63e0ee271	test.py: print subtest name on cancel	2022-02-04 16:16:17 +03:00
Konstantin Osipov	8cc7c1a5bb	test.py: fix flake8 complaints It's good practice to use linters and style formatters for all scripted languages. Python community is more strict about formatting guidelines than others, and using formatters (like flake8 or black) is almost universally accepted. test.py was adhering to flake8 standards at some point, but later this was spoiled by random commits.	2022-02-04 16:16:17 +03:00
Raphael S. Carvalho	755cec1199	table: Close reader if flush fails to peek into fragment An OOM failure while peeking into fragment, to determine if reader will produce any fragment, causes Scylla to abort as flat_mutation_reader expects reader to be closed before destroyed. Let's close it if peek() fails, to handle the scenario more gracefully. Fixes #10027. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220204031553.124848-1-raphaelsc@scylladb.com>	2022-02-04 12:48:36 +02:00
Avi Kivity	fe65122ccd	Merge 'Distribute `select count()` queries' from Michał Sala This pull request speeds up execution of `count()` queries. It does so by splitting given query into sub-queries and distributing them across some group of nodes for parallel execution. New level of coordination was added. Node called super-coordinator splits aggregation query into sub-queries and distributes them across some group of coordinators. Super-coordinator is also responsible for merging results. To develop a mechanism for speeding up `count()` queries, there was a need to detect which queries have a `count()` selector. Due to this pull request being a proof of concept, detection was realized rather poorly. It is only allows catching the simplest cases of `count()` queries (with only one selector and no column name specified). After detecting that a query is a `count()` it should be split into sub-queries and sent to another coordinators. Splitting part wasn't that difficult, it has been achieved by limiting original query's partition ranges. Sending modified query to another node was much harder. The easiest scenario would be to send whole `cql3::statements::select_statement`. Unfortunately `cql3::statements::select_statement` can't be [de]serialized, so sending it was out of the question. Even more unfortunately, some non-[de]serializable members of `cql3::statements::select_statement` are required to start the execution process of this statement. Finally, I have decided to send a `query::read_command` paired with required [de]serializable members. Objects, that cannot be [de]serialized (such as query's selector) are mocked on the receiving end. When a super-coordinator receives a `count()` query, it splits it into sub-queries. It does so, by splitting original query's partition ranges into list of vnodes, grouping them by their owner and creating sub-queries with partition ranges set to successive results of such grouping. After creation, each sub-query is sent to the owner of its partition ranges. Owner dispatches received sub-query to all of its shards. Shards slice partition ranges of the received sub-query, so that they will only query data that is owned by them. Each shard becomes a coordinator and executes so prepared sub-query. 3 node cluster set up on powerful desktops located in the office (3x32 cores) Filled the cluster with ~2 10^8 rows using scylla-bench and run: ``` time cqlsh <ip> <port> --request-timeout=3600 -e "select count() from scylla_bench.test using timeout 1h;" ``` master: 68s * this branch: 2s 3 node cluster (each node had 2 shards, `murmur3_ignore_msb_bits` was set to 1, `num_tokens` was set to 3) ``` > cqlsh -e 'tracing on; select count() from ks.t; Now Tracing is enabled count ------- 1000 (1 rows) Tracing session: e5852020-7fc3-11ec-8600-4c4c210dd657 activity \| timestamp \| source \| source_elapsed \| client ---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2022-01-27 22:53:08.770000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 1] \| 2022-01-27 22:53:08.770451 \| 127.0.0.1 \| -- \| 127.0.0.1 Processing a statement [shard 1] \| 2022-01-27 22:53:08.770487 \| 127.0.0.1 \| 36 \| 127.0.0.1 Dispatching forward_request to 3 endpoints [shard 1] \| 2022-01-27 22:53:08.770509 \| 127.0.0.1 \| 58 \| 127.0.0.1 Sending forward_request to 127.0.0.1:0 [shard 1] \| 2022-01-27 22:53:08.770516 \| 127.0.0.1 \| 64 \| 127.0.0.1 Executing forward_request [shard 1] \| 2022-01-27 22:53:08.770519 \| 127.0.0.1 \| -- \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770528 \| 127.0.0.1 \| 9 \| 127.0.0.1 Start querying token range ({-4242912715832118944, end}, {-4075408479358018994, end}] [shard 1] \| 2022-01-27 22:53:08.770531 \| 127.0.0.1 \| 12 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770537 \| 127.0.0.1 \| 18 \| 127.0.0.1 Scanning cache for range ({-4242912715832118944, end}, {-4075408479358018994, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770541 \| 127.0.0.1 \| 22 \| 127.0.0.1 Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.770589 \| 127.0.0.1 \| 70 \| 127.0.0.1 Sending forward_request to 127.0.0.2:0 [shard 1] \| 2022-01-27 22:53:08.770600 \| 127.0.0.1 \| 149 \| 127.0.0.1 Sending forward_request to 127.0.0.3:0 [shard 1] \| 2022-01-27 22:53:08.770608 \| 127.0.0.1 \| 157 \| 127.0.0.1 Executing forward_request [shard 0] \| 2022-01-27 22:53:08.770627 \| 127.0.0.1 \| -- \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.770639 \| 127.0.0.1 \| 11 \| 127.0.0.1 Start querying token range ({2507462623645193091, end}, {3897266736829642805, end}] [shard 0] \| 2022-01-27 22:53:08.770643 \| 127.0.0.1 \| 15 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.770646 \| 127.0.0.1 \| 19 \| 127.0.0.1 Scanning cache for range ({2507462623645193091, end}, {3897266736829642805, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.770649 \| 127.0.0.1 \| 22 \| 127.0.0.1 Executing forward_request [shard 1] \| 2022-01-27 22:53:08.770658 \| 127.0.0.2 \| -- \| 127.0.0.1 Executing forward_request [shard 1] \| 2022-01-27 22:53:08.770674 \| 127.0.0.3 \| 5 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770698 \| 127.0.0.2 \| 40 \| 127.0.0.1 Start querying token range [{4611686018427387904, start}, {5592106830937975806, end}] [shard 1] \| 2022-01-27 22:53:08.770704 \| 127.0.0.2 \| 46 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770710 \| 127.0.0.2 \| 52 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770712 \| 127.0.0.3 \| 43 \| 127.0.0.1 Scanning cache for range [{4611686018427387904, start}, {5592106830937975806, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770714 \| 127.0.0.2 \| 56 \| 127.0.0.1 Start querying token range [{-4611686018427387904, start}, {-4242912715832118944, end}] [shard 1] \| 2022-01-27 22:53:08.770718 \| 127.0.0.3 \| 49 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770739 \| 127.0.0.3 \| 70 \| 127.0.0.1 Scanning cache for range [{-4611686018427387904, start}, {-4242912715832118944, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770743 \| 127.0.0.3 \| 73 \| 127.0.0.1 Page stats: 17 partition(s), 0 static row(s) (0 live, 0 dead), 17 clustering row(s) (17 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.770814 \| 127.0.0.3 \| 145 \| 127.0.0.1 Executing forward_request [shard 0] \| 2022-01-27 22:53:08.770846 \| 127.0.0.3 \| -- \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.770862 \| 127.0.0.3 \| 16 \| 127.0.0.1 Page stats: 71 partition(s), 0 static row(s) (0 live, 0 dead), 71 clustering row(s) (71 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.770865 \| 127.0.0.1 \| 238 \| 127.0.0.1 Start querying token range ({-6683686776653114062, end}, {-6473446911791631266, end}] [shard 0] \| 2022-01-27 22:53:08.770867 \| 127.0.0.3 \| 21 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.770874 \| 127.0.0.3 \| 28 \| 127.0.0.1 Scanning cache for range ({-6683686776653114062, end}, {-6473446911791631266, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.770879 \| 127.0.0.3 \| 33 \| 127.0.0.1 Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.770880 \| 127.0.0.2 \| 222 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.770888 \| 127.0.0.1 \| 369 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770909 \| 127.0.0.1 \| 390 \| 127.0.0.1 Start querying token range ({-4075408479358018994, end}, {-3391415989210253693, end}] [shard 1] \| 2022-01-27 22:53:08.770911 \| 127.0.0.1 \| 392 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770914 \| 127.0.0.1 \| 395 \| 127.0.0.1 Scanning cache for range ({-4075408479358018994, end}, {-3391415989210253693, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770936 \| 127.0.0.1 \| 418 \| 127.0.0.1 Executing forward_request [shard 0] \| 2022-01-27 22:53:08.770951 \| 127.0.0.2 \| -- \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.770966 \| 127.0.0.2 \| 15 \| 127.0.0.1 Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.770969 \| 127.0.0.3 \| 123 \| 127.0.0.1 Start querying token range (-inf, {-6683686776653114062, end}] [shard 0] \| 2022-01-27 22:53:08.770969 \| 127.0.0.2 \| 18 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.770974 \| 127.0.0.2 \| 23 \| 127.0.0.1 Scanning cache for range (-inf, {-6683686776653114062, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.770977 \| 127.0.0.2 \| 26 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.770993 \| 127.0.0.3 \| 324 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770998 \| 127.0.0.3 \| 329 \| 127.0.0.1 Start querying token range ({-3391415989210253693, end}, {0, start}) [shard 1] \| 2022-01-27 22:53:08.771001 \| 127.0.0.3 \| 332 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.771004 \| 127.0.0.3 \| 335 \| 127.0.0.1 Scanning cache for range ({-3391415989210253693, end}, {0, start}) and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.771007 \| 127.0.0.3 \| 338 \| 127.0.0.1 Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.771044 \| 127.0.0.1 \| 525 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.771069 \| 127.0.0.1 \| 442 \| 127.0.0.1 On shard execution result is [71] [shard 0] \| 2022-01-27 22:53:08.771145 \| 127.0.0.1 \| 518 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.771308 \| 127.0.0.1 \| 789 \| 127.0.0.1 On shard execution result is [60] [shard 1] \| 2022-01-27 22:53:08.771351 \| 127.0.0.1 \| 832 \| 127.0.0.1 Page stats: 127 partition(s), 0 static row(s) (0 live, 0 dead), 127 clustering row(s) (127 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.771379 \| 127.0.0.2 \| 427 \| 127.0.0.1 Page stats: 183 partition(s), 0 static row(s) (0 live, 0 dead), 183 clustering row(s) (183 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.771385 \| 127.0.0.3 \| 716 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.771402 \| 127.0.0.3 \| 556 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.771403 \| 127.0.0.2 \| 745 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.771408 \| 127.0.0.2 \| 750 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.771409 \| 127.0.0.3 \| 563 \| 127.0.0.1 Start querying token range ({5592106830937975806, end}, +inf) [shard 1] \| 2022-01-27 22:53:08.771411 \| 127.0.0.2 \| 754 \| 127.0.0.1 Start querying token range ({-6272011798787969456, end}, {-4611686018427387904, start}) [shard 0] \| 2022-01-27 22:53:08.771412 \| 127.0.0.3 \| 566 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.771415 \| 127.0.0.3 \| 569 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.771415 \| 127.0.0.2 \| 757 \| 127.0.0.1 Scanning cache for range ({5592106830937975806, end}, +inf) and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.771419 \| 127.0.0.2 \| 761 \| 127.0.0.1 Scanning cache for range ({-6272011798787969456, end}, {-4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.771419 \| 127.0.0.3 \| 573 \| 127.0.0.1 Received forward_result=[131] from 127.0.0.1:0 [shard 1] \| 2022-01-27 22:53:08.771454 \| 127.0.0.1 \| 1003 \| 127.0.0.1 Page stats: 74 partition(s), 0 static row(s) (0 live, 0 dead), 74 clustering row(s) (74 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.771764 \| 127.0.0.3 \| 918 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.771768 \| 127.0.0.3 \| 922 \| 127.0.0.1 Start querying token range [{0, start}, {2507462623645193091, end}] [shard 0] \| 2022-01-27 22:53:08.771771 \| 127.0.0.3 \| 925 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.771775 \| 127.0.0.3 \| 929 \| 127.0.0.1 Scanning cache for range [{0, start}, {2507462623645193091, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.771779 \| 127.0.0.3 \| 933 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.771935 \| 127.0.0.3 \| 1265 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.771950 \| 127.0.0.2 \| 998 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.771956 \| 127.0.0.2 \| 1004 \| 127.0.0.1 Start querying token range ({-6473446911791631266, end}, {-6272011798787969456, end}] [shard 0] \| 2022-01-27 22:53:08.771959 \| 127.0.0.2 \| 1008 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.771963 \| 127.0.0.2 \| 1011 \| 127.0.0.1 Scanning cache for range ({-6473446911791631266, end}, {-6272011798787969456, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.771966 \| 127.0.0.2 \| 1014 \| 127.0.0.1 Page stats: 13 partition(s), 0 static row(s) (0 live, 0 dead), 13 clustering row(s) (13 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.772008 \| 127.0.0.2 \| 1057 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.772012 \| 127.0.0.2 \| 1061 \| 127.0.0.1 Start querying token range ({3897266736829642805, end}, {4611686018427387904, start}) [shard 0] \| 2022-01-27 22:53:08.772014 \| 127.0.0.2 \| 1063 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.772016 \| 127.0.0.2 \| 1065 \| 127.0.0.1 Scanning cache for range ({3897266736829642805, end}, {4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.772019 \| 127.0.0.2 \| 1067 \| 127.0.0.1 On shard execution result is [200] [shard 1] \| 2022-01-27 22:53:08.772053 \| 127.0.0.3 \| 1384 \| 127.0.0.1 Page stats: 56 partition(s), 0 static row(s) (0 live, 0 dead), 56 clustering row(s) (56 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.772138 \| 127.0.0.2 \| 1186 \| 127.0.0.1 Page stats: 190 partition(s), 0 static row(s) (0 live, 0 dead), 190 clustering row(s) (190 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.772364 \| 127.0.0.2 \| 1706 \| 127.0.0.1 Page stats: 149 partition(s), 0 static row(s) (0 live, 0 dead), 149 clustering row(s) (149 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.772407 \| 127.0.0.3 \| 1561 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772417 \| 127.0.0.3 \| 1571 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.772418 \| 127.0.0.2 \| 1760 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772426 \| 127.0.0.2 \| 1475 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772428 \| 127.0.0.2 \| 1476 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772449 \| 127.0.0.3 \| 1604 \| 127.0.0.1 On shard execution result is [196] [shard 0] \| 2022-01-27 22:53:08.772555 \| 127.0.0.2 \| 1603 \| 127.0.0.1 On shard execution result is [238] [shard 1] \| 2022-01-27 22:53:08.772674 \| 127.0.0.2 \| 2016 \| 127.0.0.1 On shard execution result is [235] [shard 0] \| 2022-01-27 22:53:08.772770 \| 127.0.0.3 \| 1924 \| 127.0.0.1 Received forward_result=[435] from 127.0.0.3:0 [shard 1] \| 2022-01-27 22:53:08.772933 \| 127.0.0.1 \| 2482 \| 127.0.0.1 Received forward_result=[434] from 127.0.0.2:0 [shard 1] \| 2022-01-27 22:53:08.773110 \| 127.0.0.1 \| 2658 \| 127.0.0.1 Merged result is [1000] [shard 1] \| 2022-01-27 22:53:08.773111 \| 127.0.0.1 \| 2660 \| 127.0.0.1 Done processing - preparing a result [shard 1] \| 2022-01-27 22:53:08.773114 \| 127.0.0.1 \| 2663 \| 127.0.0.1 Request complete \| 2022-01-27 22:53:08.772666 \| 127.0.0.1 \| 2666 \| 127.0.0.1 ``` Fixes #1385 Closes #9209 github.com:scylladb/scylla: docs: add parallel aggregations design doc db: config: add a flag to disable new parallelized aggregation algorithm test: add parallelized select count test forward_service: add metrics forward_service: parallelize execution across shards forward_service: add tracing cql3: statements: introduce parallelized_select_statement cql3: query_processor: add forward_service reference to query_processor gms: add PARALLELIZED_AGGREGATION feature service: introduce forward_service storage_proxy: extract query_ranges_to_vnodes_generator to a separate file messaging_service: add verb for count() request forwarding cql3: selection: detect if a selection represents count()	2022-02-04 12:34:19 +02:00
Nadav Har'El	b54e85088d	Merge 'snapshots: Fix snapshot-ctl to include snapshots of dropped tables' from Benny Halevy Snapshot-ctl methods fetch information about snapshots from column family objects. The problem with this is that we get rid of these objects once the table gets dropped, while the snapshots might still be present (the auto_snapshot option is specifically made to create this kind of situation). This commit switches from relying on column family interface to scanning every datadir that the database knows of in search for "snapshots" folders. This PR is a rebased version of #9539 (and slightly cleaned-up, cosmetically) and so it replaces the previous PR. Fixes #3463 Closes #7122 Closes #9884 * github.com:scylladb/scylla: snapshots: Fix snapshot-ctl to include snapshots of dropped tables table: snapshot: add debug messages	2022-02-04 12:34:19 +02:00
Piotr Sarna	c613d1ce87	alternator: migrate expression parsers to string_view Following the advice in the FIXME note, helper functions for parsing expressions are now based on string views to avoid a few unnecessary conversions to std::string. Tests: unit(dev) Closes #10013	2022-02-04 12:34:19 +02:00
Nadav Har'El	87e48d61a7	build: rebuild relocatable packages if version changed In commit `d72465531e` we fixed the building of relocatable packages of submodules (tools/java, etc.) to use the top-level Scylla's version. However, if on an active working directory Scylla's version changes - as we just did from 4.7 to 5.0 - these relocatable packages are not rebuilt with the new version number, and as a result some of our scripts (such as the docker build) can't find them. Because the build-submodule-reloc rule depends on the files build/SCYLLA-{PRODUCT,VERSION,RELEASE}-FILE (which is what the aforementioned commit did), in this patch we add those files as a dependency whenever build-submodule-reloc is used. This means that if any of these files change, we rebuild the relocatable packages and anything depending on them (e.g., Debian packages). Fixes #10018. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220202131248.1610678-1-nyh@scylladb.com>	2022-02-03 10:19:15 +02:00
Botond Dénes	996e2f8048	Merge 'Handle serialized_action trigger exceptions' from Benny Halevy " which is currently unhandled from multiple call sites, leading to the following warning as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/1094/artifact/logs-all.release.2/1643794928169_materialized_views_test.py%3A%3ATestInterruptBuildProcess%3A%3Atest_interrupt_build_process_and_resharding_half_to_max_test/node2.log ``` Scylla version 5.0.dev-0.20220201.a026b4ef4 with build-id cebf6dca8edd8df843a07e0f01a1573f1d0a6dfc starting ... WARN 2022-02-02 09:31:56,616 [shard 2] seastar - Exceptional future ignored: seastar::sleep_aborted (Sleep is aborted), backtrace: 0x463b65e 0x463bb50 0x463be58 0x426c165 0x230c744 0x42adad4 0x42aeea7 0x42cdb55 0x4281a2a /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libpthread.so.0+0x9298 /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libc.so.6+0x100352 -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false> >(seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> ``` Decoded: ``` void seastar::backtrace(seastar::current_backtrace_tasklocal()::$_3&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:86 seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:137 seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:170 seastar::report_failed_future(std::__exception_ptr::exception_ptr const&) at ./build/release/seastar/./seastar/src/core/future.cc:210 (inlined by) seastar::report_failed_future(seastar::future_state_base::any&&) at ./build/release/seastar/./seastar/src/core/future.cc:218 seastar::future_state_base::any::check_failure() at ././seastar/include/seastar/core/future.hh:567 (inlined by) seastar::future_state::clear() at ././seastar/include/seastar/core/future.hh:609 (inlined by) ~future_state at ././seastar/include/seastar/core/future.hh:614 (inlined by) ~future at ././seastar/include/seastar/core/scheduling.hh:43 (inlined by) void seastar::futurize >::satisfy_with_result_of::then_wrapped_nrvo, seastar::future::finally_body >(seastar::future::finally_body&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}::operator()(seastar::internal::promise_base_with_type, seastar::internal::promise_base_with_type&&, seastar::future_state::finally_body&&::monostate>) const::{lambda()#1}>(seastar::internal::promise_base_with_type, seastar::future::finally_body&&) at ././seastar/include/seastar/core/future.hh:2120 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1667 (inlined by) seastar::continuation, seastar::future::finally_body, seastar::future::then_wrapped_nrvo, serialized_action::trigger(bool)::{lambda()#2}>(serialized_action::trigger(bool)::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767 seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2344 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2754 seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2923 operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4128 (inlined by) void std::__invoke_impl(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:61 (inlined by) std::enable_if, void>::type std::__invoke_r(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:111 (inlined by) std::_Function_handler::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:291 std::function::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:560 (inlined by) seastar::posix_thread::start_routine(void) at ./build/release/seastar/./seastar/src/core/posix.cc:60 ``` This series handles exception handling to serialized actions triggers that don't handle exceptions. Test: unit(dev) " tag 'handle-serialized_action-trigger-exception-v1' of https://github.com/bhalevy/scylla: migration_manager: passive_announce(version): handle exception view_builder: do_build_step: handle unexpected exceptions storage_service: no need to include utils/serialized_action.hh	2022-02-03 10:17:59 +02:00
Yaron Kaikov	e6ea0e04ed	release: prepare for 5.1.dev	2022-02-03 08:11:24 +02:00
Calle Wilund	1e66043412	commitlog: Fix double clearing of _segment_allocating shared_future. Fixes #10020 Previous fix `445e1d3` tried to close one double invocation, but added another, since it failed to ensure all potential nullings of the opt shared_future happened before a new allocator could reset it. This simplifies the code by making clearing the shared_future a pre-requisite for resolving its contents (as read by waiters). Also removes any need for try-catch etc. Closes #10024	2022-02-02 23:26:17 +02:00
Nadav Har'El	cb6630040d	docker: don't repeat "--alternator-address" option twice If the Docker startup script is passed both "--alternator-port" and "--alternator-https-port", a combination which is supposed to be allowed, it passes to Scylla the "--alternator-address" option twice. This isn't necessary, and worse - not allowed. So this patch fixes the scyllasetup.py script to only pass this parameter once. Fixes #10016. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>	2022-02-02 23:26:11 +02:00
Michał Sala	4903f7a314	docs: add parallel aggregations design doc Added document describes the design of a mechanism that parallelizes execution of aggregation queries.	2022-02-02 17:52:22 +01:00
Benny Halevy	b94c9ed3e6	migration_manager: passive_announce(version): handle exception Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-02 14:54:19 +02:00
Benny Halevy	b56b10a4bb	view_builder: do_build_step: handle unexpected exceptions Exception are handled by do_build_step in principle, Yet if an unhandled exception escapes handling (e.g. get_units(_sem, 1) fails on a broken semaphore) we should warn about it since the _build_step.trigger() calls do no handle exceptions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-02 14:54:19 +02:00
Benny Halevy	71a9524175	storage_service: no need to include utils/serialized_action.hh	2022-02-02 14:42:05 +02:00
Botond Dénes	d309a86708	Merge 'Add keyspace_offstrategy_compaction api' from Benny Halevy This series adds methods to perform offstrategy compaction, if needed, returning a future<bool> so the caller can wait on it until compaction completes. The returned value is true iff offstrategy compaction was needed. The added keyspace_offstrategy_compaction calls perform_offstrategy_compaction on the specified keyspace and tables, return the number of tables that required offstrategy compaction. A respective unit test was added to the rest_api pytest. This PR replaces https://github.com/scylladb/scylla/pull/9095 that suggested adding an option to `keyspace_compaction` since offstrategy compaction triggering logic is different enough from major compaction meriting a new api. Test: unit (dev) Closes #9980 * github.com:scylladb/scylla: test: rest_api: add unit tests for keyspace_offstrategy_compaction api api: add keyspace_offstrategy_compaction compaction_manager: get rid of submit_offstrategy table: add perform_offstrategy_compaction compaction_manager: perform_offstrategy: print ks.cf in log messages compaction_manager: allow waiting on offstrategy compaction	2022-02-02 13:15:31 +02:00
Nadav Har'El	79776ff2ff	alternator: fix error handling during Alternator startup A recent restructuring of the startup of Alternator (and also other protocol servers) led to incorrect error-handling behavior during startup: If an error was detected on one of the shards of the sharded service (in alternator/server.cc), the sharded service itself was never stopped (in alternator/controller.cc), leading to an assertion failure instead of the desired error message. A common example of this problem is when the requested port for the server was already taken (this was issue #9914). So in this patch, exception handling is removed from server.cc - the exception will propegate to the code in controller.cc, which will properly stop the server (including the sharded services) before returning. Fixes #9914. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220130131709.1166716-1-nyh@scylladb.com>	2022-02-02 10:35:57 +01:00
Piotr Wojtczak	0dd7739716	snapshots: Fix snapshot-ctl to include snapshots of dropped tables Snapshot-ctl methods fetch information about snapshots from column family objects. The problem with this is that we get rid of these objects once the table gets dropped, while the snapshots might still be present (the auto_snapshot option is specifically made to create this kind of situation). This commit switches from relying on column family interface to scanning every datadir that the database knows of in search for "snapshots" folders. Fixes #3463 Closes #7122 Closes #9884 Signed-off-by: Piotr Wojtczak <piotr.m.wojtczak@gmail.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-01 22:31:43 +02:00
Benny Halevy	2a90896b79	table: snapshot: add debug messages Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-01 22:31:37 +02:00
Michał Sala	b439d6e710	db: config: add a flag to disable new parallelized aggregation algorithm Just in case the new algorithm turns out to be buggy, add a flag to fall-back to the old algorithm.	2022-02-01 21:26:25 +01:00
Michał Sala	140bab279c	test: add parallelized select count test Added test that checks if a SELECT COUNT(*) query was transformed and processed in a parallel way. Checking is done by looking at the cql statistics and comparing subsequent counts of parallelized aggregation SELECT query executions.	2022-02-01 21:14:41 +01:00
Michał Sala	e6e9553b4a	forward_service: add metrics Introduces metrics for `forward_service`. 3 counters were created, which allows checking how many requests had been dispached or executed.	2022-02-01 21:14:41 +01:00
Michał Sala	354f7a1c34	forward_service: parallelize execution across shards Coordinators processed each vnode sequentially on shards when executing a `forward_request` sent by super-coordinator. This commit changes this behavior and parallelizes execution of `forward_request` across shards. It does that by adding additional layer of dispatching to `forward_service`. When a coordinator receives a `forward_request`, it forwards it to each of its shards. Shards slice `forward_request`'s partition ranges so that they will only query data that is owned by them. Implementation of slicing partition ranges was based on @nyh's `token_ranges_owned_by_this_shard` from `alternator/ttl.cc`.	2022-02-01 21:14:41 +01:00
Michał Sala	aec96be553	forward_service: add tracing	2022-02-01 21:14:41 +01:00
Michał Sala	f344bd0aaa	cql3: statements: introduce parallelized_select_statement Detect whether a statement is a count() query in prepare time. If so, instantiate a new `select_statement` subclass - `parallelized_select_statement`. This subclass has a different execution logic, that enables it to distribute count() queries across a cluster. Also, a new counter was added - `select_parallelized` that counts the number of parallelized aggregation SELECT query executions.	2022-02-01 21:14:41 +01:00
Michał Sala	66a93d3000	cql3: query_processor: add forward_service reference to query_processor	2022-02-01 21:14:41 +01:00
Michał Sala	3789a4d02b	gms: add PARALLELIZED_AGGREGATION feature This new feature will be used to determined whether the whole cluster is ready to parallelize execution of aggregation queries.	2022-02-01 21:14:41 +01:00
Michał Sala	a6cf3f52bd	service: introduce forward_service The new service is responsible for: * spreading forward_request execution across multiple nodes in cluster * collecting forward_request execution results and merging them `forward_service::dispatch` method takes forward_request as an argument, and forwards its execution to group of other nodes (using rpc verb added in previous commits). Each node (in the group chosen by dispatch method) is provided with forward_request, which is no different from the original argument except for changed partition ranges. They are changed so that vnodes contained in them are owned by recipient node. Executing forward_request is realized in `forward_service::execute` method, that is registered to be called on FORWARD_REQUEST verb receipt. Process of executing forward_request consists of mocking few non-serializable object (such as `cql3::selection`) in order to create `service:pager:query_pagers::pager` and `cql3::selection::result_set_builder`. After pager and result_set_builder creation, execution process resembles what might be seen in select_statement's execution path.	2022-02-01 21:14:41 +01:00
Michał Sala	0fe59082ec	storage_proxy: extract query_ranges_to_vnodes_generator to a separate file Such separation allows using query_ranges_to_vnodes_generator by other services without needing a storage_proxy dependency.	2022-02-01 21:14:41 +01:00
Michał Sala	fff454761a	messaging_service: add verb for count() request forwarding Except for the verb addition, this commit also defines forward_request and forward_result structures, used as an argument and result of the new rpc. forward_request is used to forward information about select statement that does count() (or other aggregating functions such as max, min, avg in the future). Due to the inability to serialize cql3::statements::select_statement, I chose to include query::read_command, dht::partition_range_vector and some configuration options in forward_request. They can be serialized and are sufficient enough to allow creation of service::pager::query_pagers::pager.	2022-02-01 21:14:41 +01:00
Michał Sala	bb7edf3785	cql3: selection: detect if a selection represents count() The way that this detection works is a bit clunky, but it does its job given the simplest cases e.g. "SELECT COUNT() FROM ks.t". It fails when there are multiple selectors, or when there is a column name specified ("SELECT COUNT(column_name) FROM ks.t").	2022-02-01 21:14:41 +01:00
Pavel Emelyanov	a026b4ef49	config: Add option to disable config updates via CQL The system.config table allows changing config parameters, but this change doesn't survive restarts and is considered to be dangerous (sometimes). Add an option to disable the table updates. The option is LiveUpdate and can be set to false via CQL too (once). fixes #9976 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220201121114.32503-1-xemul@scylladb.com>	2022-02-01 14:30:47 +02:00
Takuya ASADA	c2ccdac297	move cloud related code from scylla repository to scylla-machine-image Currently, cloud related code have cross-dependencies between scylla and scylla-machine-image. It is not good way to implement, and single change can break both package. To resolve the issue, we need to move all cloud related code to scylla-machine-image, and remove them from scylla repository. Change list: - move cloud part of scylla_util.py to scylla-machine-image - move cloud part of scylla_io_setup to scylla-machine-image - move scylla_ec2_check to scylla-machine-image - move cloud part of scylla_bootparam_setup to scylla-machine-image Closes #9957	2022-02-01 11:26:59 +02:00
Tomasz Grabiec	00a9326ae7	Merge "raft: let `modify_config` finish on a follower that removes itself" from Kamil When forwarding a reconfiguration request from follower to a leader in `modify_config`, there is no reason to wait for the follower's commit index to be updated. The only useful information is that the leader committed the configuration change - so `modify_config` should return as soon as we know that. There is a reason not to wait for the follower's commit index to be updated: if the configuration change removes the follower, the follower will never learn about it, so a local waiter will never be resolved. `execute_modify_config` - the part of `modify_config` executed on the leader - is thus modified to finish when the configuration change is fully complete (including the dummy entry appended at the end), and `modify_config` - which does the forwarding - no longer creates a local waiter, but returns as soon as the RPC call to the leader confirms that the entry was committed on the leader. We still return an `entry_id` from `execute_modify_config` but that's just an artifact of the implementation. Fixes #9981. A regression test was also added in randomized_nemesis_test. * kbr/modify-config-finishes-v1: test: raft: randomized_nemesis_test: regression test for #9981 raft: server: don't create local waiter in `modify_config`	2022-01-31 20:14:50 +01:00
Kamil Braun	97ff98f3a7	service: migration_manager: retry schema change command on transient failures The call to `raft::server::add_entry` in `announce_with_raft` may fail e.g. due to a leader change happening when we try to commit the entry. In cases like this it makes sense to retry the command so we don't prematurely report an error to the client. This may result in double application of the command. Fortunately, the schema change command is idempotent thanks to the group 0 state ID mechanism (originally used to prevent conflicting concurrent changes from happening). Indeed, once a command passes the state ID check, it changes the group 0 history last state ID, causing all later applications of that same command to fail the check. Similarly, once a command fails the state ID check, it means that the last state ID is different than the one observed when the command was being constructed, so all further applications of the command will also fail the check (it is not possible for the last state ID to change from X to Y then back to X). Note that this reasoning only works for commands with `prev_state_id` engaged, such as the ones which we're using in `migration_manager::announce_with_raft`. It would not work with "unconditional commands" where `prev_state_id` is `nullopt` - for those commands no state ID check is performed. It could still be safe to retry those commands if they are idempotent for a different reason. (Note: actually, our schema commands are already idempotent even without the state ID check, because they simply apply a set of mutations, and applying the same mutations twice is the same as applying them once.) Message-Id: <20220131152926.18087-1-kbraun@scylladb.com>	2022-01-31 19:49:31 +01:00
Takuya ASADA	218dd3851c	scylla_swap_setup: add --swap-size-bytes Currently, --swap-size does not able to specify exact file size because the option takes parameter only in GB. To fix the limitation, let's add --swpa-size-bytes to specify swap size in bytes. We need this to implement preallocate swapfile while building IaaS image. see scylladb/scylla-machine-image#285 Closes #9971	2022-01-31 18:32:32 +02:00
Benny Halevy	4272dd0b28	storage_proxy: mutate_counter_on_leader_and_replicate: use container to get to shard proxy Rather than using the global helper, get_local_storage_proxy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220131151516.3461049-2-bhalevy@scylladb.com>	2022-01-31 18:14:31 +02:00
Benny Halevy	8acdc6ebdc	storage_proxy: paxos: don't use global storage_proxy Rather than calling get_local_storage_proxy(), use paxos_response_handler::_proxy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220131151516.3461049-1-bhalevy@scylladb.com>	2022-01-31 18:14:31 +02:00
Calle Wilund	445e1d3e41	commitlog: Ensure we never have more than one new_segment call at a time Refs #9896 Found by @eliransin. Call to new_segment was wrapped in with_timeout. This means that if primary caller timed out, we would leave new_segment calls running, but potentially issue new ones for next caller. This could lead to reserve segment queue being read simultanously. And it is not what we want. Change to always use the shared_future wait, all callers, and clear it only on result (exception or segment) Closes #10001	2022-01-31 16:50:22 +02:00
Nadav Har'El	8a745593a2	Merge 'alternator: fill UnprocessedKeys for failed batch reads' from Piotr Sarna DynamoDB protocol specifies that when getting items in a batch failed only partially, unprocessed keys can be returned so that the user can perform a retry. Alternator used to fail the whole request if any of the reads failed, but right now it instead produces the list of unprocessed keys and returns them to the user, as long as at least 1 read was successful. This series comes with a test based on Scylla's error injection mechanism, and thus is only useful in modes which come with error injection compiled in. In release mode, expect to see the following message: SKIPPED (Error injection not enabled in Scylla - try compiling in dev/debug/sanitize mode) Fixes #9984 Closes #9986 * github.com:scylladb/scylla: test: add total failure case for GetBatchItem test: add error injection case for GetBatchItem test: add a context manager for error injection to alternator alternator: add error injection to BatchGetItem alternator: fill UnprocessedKeys for failed batch reads	2022-01-31 15:28:24 +02:00
Piotr Sarna	c87126198d	test: add total failure case for GetBatchItem The test verifies that if all reads from a batch operation failed, the result is an error, and not a success response with UnprocessedKeys parameter set to all keys.	2022-01-31 14:21:55 +01:00
Piotr Sarna	e79c2943fc	test: add error injection case for GetBatchItem The new test case is based on Scylla error injection mechanism and forces a partial read by failing some requests from the batch.	2022-01-31 14:21:55 +01:00
Piotr Sarna	99c5bec0e2	test: add a context manager for error injection to alternator With the new context manager it's now easier to request an error to be injected via REST API. Note that error injection is only enabled in certain build modes (dev, debug, sanitize) and the test case will be skipped if it's not possible to use this mechanism.	2022-01-31 14:21:55 +01:00
Tomasz Grabiec	8297ae531d	Merge "Automatically retry CQL DDL statements in presence of concurrent changes" from Kamil Schema changes on top of Raft do not allow concurrent changes. If two changes are attempted concurrently, one of them gets `group0_concurrent_modification` exception. Catch the exception in CQL DDL statement execution function and retry. In addition, improve the description of CQL DDL statements in group 0 history table. Add a test which checks that group 0 history grows iff a schema change does not throw `group0_concurrent_modification`. Also check that the retry mechanism works as expected. * kbr/ddl-retry-v1: test: unit test for group 0 concurrent change protection and CQL DDL retries cql3: statements: schema_altering_statement: automatically retry in presence of concurrent changes	2022-01-31 14:12:35 +01:00
Tomasz Grabiec	b78bab7286	Merge "raft: fixes and improvements to the library and nemesis test" from Kamil Raft randomized nemesis test was improved by adding some more chaos: randomizing the network delay, server configuration, ticking speed of servers. This allowed to catch a serious bug, which is fixed in the first patch. The patchset also fixes bugs in the test itself and adds quality of life improvements such as better diagnostics when inconsistency is detected. * kbr/nemesis-random-v1: test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency test: raft: randomized_nemesis_test: print details when detecting inconsistency test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine` test: raft: randomized_nemesis_test: keep server id in impure_state_machine test: raft: randomized_nemesis_test: frequent snapshotting configuration test: raft: randomized_nemesis_test: tick servers at different speeds in generator test test: raft: randomized_nemesis_test: simplify ticker test: raft: randomized_nemesis_test: randomize network delay test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()` test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside test: raft: randomized_nemesis_test: fix obsolete comment raft: fsm: print configuration entries appearing in the log raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration` raft: server: abort snapshot applications before waiting for rpc abort raft: server: logging fix raft: fsm: don't advance commit index beyond matched entries	2022-01-31 13:25:27 +01:00
Calle Wilund	7ca72ffd19	database: Make wrapped version of timed_out_error a timed_out_error Refs #9919 in `a6202ae` throw_commitlog_add_error was added to ensure we had more info on errors generated writing to commit log. However, several call sites catch timed_out_error explicitly, not checking for nested etc. `97bb1be` and `868b572` tried to deal with it, by using check routines. It turns out there are call sites left, and while these should be changed, it is safer and quicker for now to just ensure that iff we have a timed_out_error, we throw yet another timed_out_error. Closes #10002	2022-01-31 14:15:23 +02:00
Piotr Sarna	d50ed944f2	alternator: add error injection to BatchGetItem When error injection is enabled at compile time, it's now possible to inject an error into BatchGetItem in order to produce a partial read, i.e. when only part of the items were retrieved successfully.	2022-01-31 12:56:00 +01:00
Piotr Sarna	31f4f062a2	alternator: fill UnprocessedKeys for failed batch reads DynamoDB protocol specifies that when getting items in a batch failed only partially, unprocessed keys can be returned so that the user can perform a retry. Alternator used to fail the whole request if any of the reads failed, but right now it instead produces the list of unprocessed keys and returns them to the user, as long as at least 1 read was successful. NOTE: tested manually by compiling Scylla with error injection, which fails every nth request. It's rather hard to figure out an automatic test case for this scenario. Fixes #9984	2022-01-31 12:56:00 +01:00
Mikołaj Sielużycki	93d6eb6d51	compacting_reader: Support fast_forward_to position range. Fast forwarding is delegated to the underlying reader and assumes the it's supported. The only corner case requiring special handling that has shown up in the tests is producing partition start mutation in the forwarding case if there are no other fragments. compacting state keeps track of uncompacted partition start, but doesn't emit it by default. If end of stream is reached without producing a mutation fragment, partition start is not emitted. This is invalid behaviour in the forwarding case, so I've added a public method to compacting state to force marking partition as non-empty. I don't like this solution, as it feels like breaking an abstraction, but I didn't come across a better idea. Tests: unit(dev, debug, release) Message-Id: <20220128131021.93743-1-mikolaj.sieluzycki@scylladb.com>	2022-01-31 13:37:36 +02:00
Nadav Har'El	a25e265373	test/alternator: improve comment on why we need "global_random" Improve the comment that explains why we needed to use an explicitly shared random sequence instead of the usual "random". We now understand that we need this workaround to undo what the pytest-randomly plugin does. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220130155557.1181345-1-nyh@scylladb.com>	2022-01-31 10:07:56 +01:00
Nadav Har'El	59fe6a402c	test/cql-pytest: use unique keys instead of random keys Some of the tests in test/cql-pytest share the same table but use different keys to ensure they don't collide. Before this patch we used a random key, which was usually fine, but we recently noticed that the pytest-randomly plugin may cause different tests to run through the same sequence of random numbers and ruin our intent that different tests use different keys. So instead of using a random key, let's use a unique key. We can achieve this uniqueness trivially - using a counter variable - because anyway the uniqueness is only needed inside a single temporary table - which is different in every run. Another benefit is that it will now be clearer that the tests are deterministic and not random - the intent of a random_string() key was never to randomly walk the entire key space (random_string() anyway had a pretty narrow idea of what a random string looks like) - it was just to get a unique key. Refs #9988 (fixes it for cql-pytest, but not for test/alternator) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-01-31 09:01:23 +02:00
Benny Halevy	1c25934399	test: rest_api: add unit tests for keyspace_offstrategy_compaction api Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-30 20:40:40 +02:00
Benny Halevy	f6431824a7	api: add keyspace_offstrategy_compaction Perform offstrategy compaction via the REST API with a new `keyspace_offstrategy_compaction` option. This is useful for performing offstrategy compaction post repair, after repairing all token ranges. Otherwise, offstrategy compaction will only be auto-triggered after a 5 minutes idle timeout. Like major compaction, the api call returns the offstrategy compaction task future, so it's waited on. The `long` result counts the number of tables that required offstrategy compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-30 20:40:39 +02:00
Benny Halevy	02bd84fe79	compaction_manager: get rid of submit_offstrategy Now that the table layer is using perform_offstrategy, submit_offstrategy is no longer in use and can be deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-30 20:09:35 +02:00
Benny Halevy	e03b6eeff8	table: add perform_offstrategy_compaction Expose an async method to perform offstrategy- compaction, if needed. Returns a future<bool> that is resolved when offstrategy_compaction completes. The future value is true iff offstrategy compaction was required. To be used in a following patch by the storage_service api. Call it from `trigger_offstrategy_compaction` that triggers offstrategy compaction in the background and warn about ignored failures. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-30 20:09:35 +02:00
Benny Halevy	6b8e88d047	compaction_manager: perform_offstrategy: print ks.cf in log messages So it would be easier to relate the messages to the table for which it was submitted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-30 20:09:35 +02:00
Benny Halevy	69883d464e	compaction_manager: allow waiting on offstrategy compaction Return a future from perform_offstrategy, resolved when the offstrategy compaction completes so that callers can wait on it. submit_offstrategy still submits the offstrategy compaction in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-30 20:09:35 +02:00
Tomasz Grabiec	b734615f51	util: cached_file: Fix corruption after memory reclamation was triggered from population If memory reclamation is triggered inside _cache.emplace(), the _cache btree can get corrupted. Reclaimers erase from it, and emplace() assumes that the tree is not modified during its execution. It first locates the target node and then does memory allocation. Fix by running emplace() under allocating section, which disables memory reclamation. The bug manifests with assert failures, e.g: ./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed. Fixes #9915 Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>	2022-01-30 19:57:35 +02:00
Benny Halevy	3cee0f8bd9	shared_token_metadata: mutate_token_metadata: bump cloned copy ring_version Currently this is done only in storage_service::get_mutable_token_metadata_ptr but it needs to be done here as well for code paths calling mutate_token_metadata directly. Currently, this it is only called from network_topology_strategy_test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220130152157.2596086-1-bhalevy@scylladb.com>	2022-01-30 18:15:08 +02:00
Piotr Sarna	471205bdcf	test/alternator: use a global random generator for all test cases It was observed (perhaps it depends on the Python implementation) that an identical seed was used for multiple test cases, which violated the assumption that generated values are in fact unique. Using a global generator instead makes sure that it was only seeded once. Tests: unit(dev) # alternator tests used to fail for me locally before this patch was applied Message-Id: <315d372b4363f449d04b57f7a7d701dcb9a6160a.1643365856.git.sarna@scylladb.com>	2022-01-30 16:40:20 +02:00
Tomasz Grabiec	3e31126bdf	Merge "Brush up the initial tokens generation code" from Pavel Emelyanov On start the storage_service sets up initial tokens. Some dangling variables, checks and code duplication had accumulated over time. * xemul/br-storage-service-bootstrap-leftovers: dht: Use db::config to generate initial tookens database, dht: Move get_initial_tokens() storage_service: Factor out random/config tokens generation storage_service: No extra get_replace_address checks storage_service: Remove write-only local variable	2022-01-28 15:54:45 +01:00
Pavel Emelyanov	89a7c750ea	Merge "Deglobalize repair_meta_map" from Benny This series moves the static thread_local repair_meta_map instances into the repair_service shards. Refs #9809 Test: unit(release) (including scylla-gdb) Dtest: repair_additional_test.py::TestRepairAdditional::{test_repair_disjoint_row_2nodes,test_repair_joint_row_3nodes_2_diff_shard_count} replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap[rbo_enabled](release) * git@github.com:bhalevy/scylla.git deglobalize-repair_meta_map-v1 repair_service: deglobalize get_next_repair_meta_id repair_service: deglobalize repair_meta_map repair_service: pass reference to service to row_level_repair_gossip_helper repair_meta: define repair_meta_ptr repair_meta: move static repair_meta map functions out of line repair_meta: make get_set_diff a free function repair: repair_meta: no need to keep sharded<netw::messaging_service> repair: repair_meta: derive subordinate services from repair_service repair: pass repair_service to repair_meta	2022-01-28 14:12:33 +02:00
Avi Kivity	34252eda26	Update seastar submodule * seastar 5524f229b...0d250d15a (6): > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed Fixes #9982 > Merge "Enhance io-tester and its rate-limited job" from Pavel E > queue: pop: assert that the queue is not empty > io_queue: properly declare io_queue_for_tests > reactor: Fix off-by-end-of-line misprint in legacy configuration > fair_queue: Fix move constructor	2022-01-28 14:12:33 +02:00
Tomasz Grabiec	7ee79fa770	logalloc: Add more logging Message-Id: <20220127232009.314402-1-tgrabiec@scylladb.com>	2022-01-28 14:12:33 +02:00
Kamil Braun	d10b508380	test: raft: randomized_nemesis_test: regression test for #9981	2022-01-27 17:50:40 +01:00
Kamil Braun	28b5792481	raft: server: don't create local waiter in `modify_config` When forwarding a reconfiguration request from follower to a leader in `modify_config`, there is no reason to wait for the follower's commit index to be updated. The only useful information is that the leader committed the configuration change - so `modify_config` should return as soon as we know that. There is a reason not to wait for the follower's commit index to be updated: if the configuration change removes the follower, the follower will never learn about it, so a local waiter will never be resolved. `execute_modify_config` - the part of `modify_config` executed on the leader - is thus modified to finish when the configuration change is fully complete (including the dummy entry appended at the end), and `modify_config` - which does the forwarding - no longer creates a local waiter, but returns as soon as the RPC call to the leader confirms that the entry was committed on the leader. We still return an `entry_id` from `execute_modify_config` but that's just an artifact of the implementation. Fixes #9981.	2022-01-27 17:49:40 +01:00
Pavel Emelyanov	1525c04db3	dht: Use db::config to generate initial tookens The replica::database is passed into the helper just to get the config from. Better to use config directly without messing with the database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-27 16:41:29 +03:00
Pavel Emelyanov	77532a6a36	database, dht: Move get_initial_tokens() The helper in question has nothing to do with replica/database and is only used by dht to convert config option to a set of tokens. It sounds like the helper deserves living where it's needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-27 16:41:29 +03:00
Pavel Emelyanov	50170366ea	storage_service: Factor out random/config tokens generation There's a place in normal node start that parses the initial_token option or generates num_tokens random tokens. This code is used almost unchanged since being ported from its java version. Later there appeared the dht::get_bootstrap_token() with the same internal logic. This patch generalizes these two places. Logging messages are unified too (dtest seem not to check those). The change improves a corner case. The normal node startup code doesn't check if the initial_token is empty and num_tokens is 0 generating empty bootstrap_tokens set. It fails later with an obscure 'remove_endpoint should be used instead' message. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-27 16:41:29 +03:00
Pavel Emelyanov	7b521405e4	storage_service: No extra get_replace_address checks The get_replace_address() returns optional<inet_address>, but in many cases it's used under if (is_replacing()) branch which, in turn, returns bool(get_replace_address()) and this is only executed if the returned optional is engaged. Extra checks can be removed making the code tiny bit shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-27 16:41:29 +03:00
Pavel Emelyanov	330f2cfcfc	storage_service: Remove write-only local variable The set of tokens used to be use after being filled, but now it's write-only. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-27 16:41:25 +03:00
Kamil Braun	4a52b802ac	test: unit test for group 0 concurrent change protection and CQL DDL retries Check that group 0 history grows iff a schema change does not throw `group0_concurrent_modification`. Check that the CQL DDL statement retry mechanism works as expected.	2022-01-27 11:26:15 +01:00
Kamil Braun	edd8344706	cql3: statements: schema_altering_statement: automatically retry in presence of concurrent changes Schema changes on top of Raft do not allow concurrent changes. If two changes are attempted concurrently, one of them gets `group0_concurrent_modification` exception. Catch the exception in CQL DDL statement execution function and retry. In addition, the description of CQL DDL statements in group 0 history table was improved.	2022-01-27 11:26:14 +01:00
Benny Halevy	f8db9e1bd8	repair_service: deglobalize get_next_repair_meta_id Rather than using a static unit32_t next_id, move the next_id variable into repair_service shard 0 and manage it there. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 11:34:21 +02:00
Benny Halevy	90ba9013be	repair_service: deglobalize repair_meta_map Move the static repair_meta_map into the repair_service and expose it from there. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 11:01:47 +02:00
Benny Halevy	e6b6fdc9a0	repair_service: pass reference to service to row_level_repair_gossip_helper Note that we can't pass the repair_service container() from its ctor since it's not populated until all shards start. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 11:00:26 +02:00
Benny Halevy	3008ecfd4e	repair_meta: define repair_meta_ptr Keep repair_meta in repair_meta_map as shared_ptr<repair_meta> rather than lw_shared_ptr<repair_meta> so it can be defined in the header file and use only forward-declared class repair_meta. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 09:18:14 +02:00
Benny Halevy	fdc0a9602c	repair_meta: move static repair_meta map functions out of line Define the static {get,insert,remove}_repair_meta functions out of the repair_meta class definition, on the way of moving them, along with the repair_meta_map itself, to repair_service. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 09:15:09 +02:00
Benny Halevy	b5427cc6d1	repair_meta: make get_set_diff a free function Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 09:13:09 +02:00
Benny Halevy	224e7497e0	repair: repair_meta: no need to keep sharded<netw::messaging_service> All repair_meta needs is the local instance. Need be, it's a peering service so the container() can be used if needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 09:13:09 +02:00
Benny Halevy	c4ac92b2b7	repair: repair_meta: derive subordinate services from repair_service Use repair_service as the authoritative source for the database, messaging_service, system_distributed_keyspace, and view_update_generator, similar to repair_info. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 09:12:53 +02:00
Benny Halevy	a71d6333e4	repair: pass repair_service to repair_meta Prepare for old the repair_meta_map in repair_service. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-27 09:12:51 +02:00
Tomasz Grabiec	ba6c02b38a	Merge "Clear old entries from group 0 history when performing schema changes" from Kamil When performing a change through group 0 (which right now means schema changes), clear entries from group 0 history table which are older than one week. This is done by including an appropriate range tombstone in the group 0 history table mutation. * kbr/g0-history-gc-v2: idl: group0_state_machine: fix license blurb test: unit test for clearing old entries in group0 history service: migration_manager: clear old entries from group 0 history when announcing	2022-01-26 16:12:40 +01:00
Kamil Braun	95ac8ead4f	test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency	2022-01-26 16:09:41 +01:00
Kamil Braun	e249ea5aef	test: raft: randomized_nemesis_test: print details when detecting inconsistency If the returned result is inconsistent with the constructed model, print the differences in detail instead of just failing an assertion.	2022-01-26 16:09:41 +01:00
Kamil Braun	1170e47af4	test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine` Useful for debugging.	2022-01-26 16:09:41 +01:00
Kamil Braun	b8158e0b43	test: raft: randomized_nemesis_test: keep server id in impure_state_machine Will be used for logging.	2022-01-26 16:09:41 +01:00
Kamil Braun	3c01449472	test: raft: randomized_nemesis_test: frequent snapshotting configuration With probability 1/2, run the test with a configuration that causes servers to take snapshots frequently.	2022-01-26 16:09:41 +01:00
Kamil Braun	7546a9ebb5	test: raft: randomized_nemesis_test: tick servers at different speeds in generator test Previously all servers were ticked at the same moment, every 10 network/timer ticks. Now we tick each server with probability 1/10 on each network/timer tick. Thus, on average, every server is ticked once per 10 ticks. But now we're able to obtain more interesting behaviors. E.g. we can now observe servers which are stalling for as long as 10 ticks and servers which temporarily speed up to tick once per each network tick.	2022-01-26 16:09:41 +01:00
Kamil Braun	5d986b2682	test: raft: randomized_nemesis_test: simplify ticker Instead of taking a set of functions with different periods, take a single function that is called on every tick. The periodicity can be implemented easily on the user side.	2022-01-26 16:09:41 +01:00
Kamil Braun	173fb2bf36	test: raft: randomized_nemesis_test: randomize network delay As a side effect, this causes messages to be delivered in a different order they were sent, adding even more chaos.	2022-01-26 16:09:41 +01:00
Kamil Braun	00c18adbb0	test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()` The lambda attached to `_crash_fiber` was a coroutine. The coroutine would use `this` captured by the lambda after the `co_await`, where the lambda object (hence its captures) was already destroyed. No idea why it worked before and sanitizers did not complain in debug mode.	2022-01-26 16:09:41 +01:00
Kamil Braun	4c68e6a04c	test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions Two-way RPC functions such as `send_snapshot` had a guard object which was captured in a lambda passed to `with_gate`. The guard object, on destruction, accessed the `rpc` object. Unfortunately, the guard object could outlive the `rpc` object. That's because the lambda, and hence the guard object, was destroyed after `with_gate` finished (it lived in the frame of the caller of `with_gate`, i.e. `send_snapshot` and others), so it could be destroyed after `rpc` (the gate prevents `rpc` from being destroyed). Make sure that the guard object is destroyed before `with_gate` finishes by creating it inside the lambda body - not capturing inside the object.	2022-01-26 16:09:41 +01:00
Kamil Braun	871f0d00ce	test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside The `raft::rpc` interface functions are called by `raft::server_impl` and the exceptions may be propagated outside the server, e.g. through the `add_entry` API. Translate the internal `gate_closed_exception` to an external `raft::stopped_error`.	2022-01-26 16:09:41 +01:00
Kamil Braun	9da4ffc1c7	test: raft: randomized_nemesis_test: fix obsolete comment	2022-01-26 16:09:41 +01:00
Kamil Braun	22092d110a	raft: fsm: print configuration entries appearing in the log When appending or committing configuration entries, print them (on TRACE level). Useful for debugging.	2022-01-26 16:09:41 +01:00
Kamil Braun	44a1a8a8b0	raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration` Useful for debugging. Had to make `configuration` constructor explicit. Otherwise the `operator<<` implementation for `configuration` would implicitly convert the `server_address` to `configuration` when trying to output it, causing infinite recursion. Removed implicit uses of the constructor.	2022-01-26 16:09:41 +01:00
Kamil Braun	46f6a0cca5	raft: server: abort snapshot applications before waiting for rpc abort The implementation of `rpc` may wait for all snapshot applications to finish before it can finish aborting. This is what the randomized_nemesis_test implementation did. This caused rpc abort to hang in some scenarios. In this commit, the order of abort calls is modified a bit. Instead of waiting for rpc abort to finish and then aborting existing snapshot applications, we call `rpc::abort()` and keep the future, then abort snapshot applications, then wait on the future. Calling `rpc::abort()` first is supposed to prevent new snapshot applications from starting; a comment was added at the interface definition. The nemesis test implementation had this property, and `raft_rpc` in group registry was adjusted appropriately. Aborting the snapshot applications then allows `rpc::abort()` to finish.	2022-01-26 16:06:45 +01:00
Kamil Braun	5577ad6c34	raft: server: logging fix	2022-01-26 15:54:14 +01:00
Kamil Braun	1216f39977	raft: fsm: don't advance commit index beyond matched entries Otherwise it was possible to incorrectly mark obsolete entries from earlier terms as committed, leading to inconsistencies between state machine replicas. Fixes #9965.	2022-01-26 15:53:13 +01:00
Avi Kivity	df22396a34	Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA We found that monitor mode of mdadm does not work on RAID0, and it is not a bug, expected behavior according to RHEL developer. Therefore, we should stop enabling mdmonitor when RAID0 is specified. Fixes #9540 ---- This reverts `0d8f932` and introduce correct fix. Closes #9970 * github.com:scylladb/scylla: scylla_raid_setup: use mdmonitor only when RAID level > 0 Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"	2022-01-26 15:34:47 +02:00
Takuya ASADA	32f2eb63ac	scylla_raid_setup: use mdmonitor only when RAID level > 0 We found that monitor mode of mdadm does not work on RAID0, and it is not a bug, expected behavior according to RHEL developer. Therefore, we should stop enabling mdmonitor when RAID0 is specified. Fixes #9540	2022-01-26 22:33:07 +09:00
Takuya ASADA	cd57815fff	Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8" This reverts commit `0d8f932f0b`, because RHEL developer explains this is not a bug, it's expected behavior. (mdadm --monitor does not start when RAID level is 0) see: https://bugzilla.redhat.com/show_bug.cgi?id=2031936 So we should stop downgrade mdadm package and modify our script not to enable mdmonitor.service on RAID0, use it only for RAID5.	2022-01-26 22:33:06 +09:00
Gleb Natapov	579dcf187a	raft: allow an option to persist commit index Raft does not need to persist the commit index since a restarted node will either learn it from an append message from a leader or (if entire cluster is restarted and hence there is no leader) new leader will figure it out after contacting a quorum. But some users may want to be able to bring their local state machine to a state as up-to-date as it was before restart as soon as possible without any external communication. For them this patch introduces new persistence API that allows saving and restoring last seen committed index. Message-Id: <YfFD53oS2j1My0p/@scylladb.com>	2022-01-26 14:06:39 +01:00
Calle Wilund	43f51e9639	commitlog: Ensure we don't run continuation (task switch) with queues modified Fixes #9955 In #9348 we handled the problem of failing to delete segment files on disk, and the need to recompute disk footprint to keep data flow consistent across intermittent failures. However, because _reserve_segments and _recycled_segments are queues, we have to empty them to inspect the contents. One would think it is ok for these queues to be empty for a while, whilst we do some recaclulating, including disk listing -> continuation switching. But then one (i.e. I) misses the fact that these queues use the pop_eventually mechanism, which does _not_ handle a scenario where we push something into an empty queue, thus triggering the future that resumes a waiting task, but then pop the element immediately, before the waiting task is run. In fact, _iff_ one does this, not only will things break, they will in fact start creating undefined behaviour, because the underlying std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push operations -> we will pop an empty queue, immediately making it non-empty, but using undefined memory (with luck null/zeroes). Strictly speakging, seastar::queue::pop_eventually should be fixed to handle the scenario, but nontheless we can fix the usage here as well, by simply copy objects and do the calculation "in background" while we potentially start popping queue again. Closes #9966	2022-01-26 13:51:01 +02:00
Avi Kivity	f5cd6ec419	Update tools/python3 submodule (relicensed to Apache License 2.0) * tools/python3 8a77e76...f725ec7 (2): > Relicense to Apache 2.0 > treewide: use Software Package Data Exchange (SPDX) license identifiers	2022-01-25 18:50:39 +02:00
Kamil Braun	f3c0c73d36	idl: group0_state_machine: fix license blurb	2022-01-25 17:48:46 +01:00
Kamil Braun	bf91dcd1e3	idl: group0_state_machine: fix license blurb	2022-01-25 13:14:47 +01:00
Kamil Braun	b863a63b08	test: unit test for clearing old entries in group0 history We perform a bunch of schema changes with different values of `migration_manager::_group0_history_gc_duration` and check if entries are cleared according to this setting.	2022-01-25 13:13:35 +01:00
Kamil Braun	e9083433a8	service: migration_manager: clear old entries from group 0 history when announcing When performing a change through group 0 (which right now only covers schema changes), clear entries from group 0 history table which are older than one week. This is done by including an appropriate range tombstone in the group 0 history table mutation.	2022-01-25 13:11:14 +01:00
Botond Dénes	eb42213db4	compact_mutation: close active range tombstone on page end The compactor recently acquired the ability to consume a v2 stream. The v2 spec requires that all streams end with a null tombstone. `range_tombstone_assembler`, the component the compactor uses for converting the v2 input into its v1 output enforces this with a check on `consume_end_of_partition()`. Normally the producer of the stream the compactor is consuming takes care of closing the active tombstone before the stream ends. The compactor however (or its consumer) can decide to end the consume early, e.g. to cut the current page. When this happens the compactor must take care of closing the tombstone itself. Furthermore it has to keep this tombstone around to re-open it on the next page. This patch implements this mechanism which was left out of `134601a15e`. It also adds a unit test which reproduces the problems caused by the missing mechanism. The compactor now tracks the last clustering position emitted. When the page ends, this position will be used as the position of the closing range tombstone change. This ensures the range tombstone only covers the actually emitted range. Fixes: #9907 Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>	2022-01-25 09:52:30 +02:00
Gleb Natapov	e56e96ac5a	raft: do not add new wait entries after abort Abort signals stopped_error on all awaited entries, but if an entry is added after this it will be destroyed without signaling and will cause a waiter to get broken_promise. Fixes #9688 Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>	2022-01-25 09:52:30 +02:00
Tomasz Grabiec	c89b1953f8	Merge "Enforce linearizability of group 0 operations using state IDs" from Kamil We introduce a new table, `system.group0_history`. This table will contain a history of all group 0 changes applied through Raft. With each change is an associated unique ID, which also identifies the state of all group 0 tables (including schema tables) after this change is applied, assuming that all such changes are serialized through Raft (they will be eventually). Group 0 commands, additionally to mutations which modify group 0 tables, contain a "previous state ID" and a "new state ID". The group 0 state machine will only modify state during command application if the provided "previous state ID" is equal to the last state ID present in the history table. Otherwise, the command will be a no-op. To ensure linearizability of group 0 changes, the performer of the change must first read the last state ID, only then read the state and send a command for the state machine. If a concurrent change races with this command and manages to modify the state, we will detect that the last state ID does not match during `apply`; all calls to `apply` are serialized, and `apply` adds the new entry to the history table at the end, after modifying the group 0 state. The details of this mechanism are abstracted away with `group0_guard`. To perform a group 0 change, one needs to call `announce`, which requires a `group0_guard` to be passed in. The only way to obtain a `group0_guard` is by calling `start_group0_operation`, which underneath performs a read barrier on group 0, obtains the last state ID from the history table, and constructs a new state ID that the change will append to the history table. The read barrier ensures that all previously completed changes are visible to this operation. The caller can then perform any necessary validation, construct mutations which modify group 0 state, and finally call `announce`. The guard also provides a timestamp which is used by the caller to construct the mutations. The timestamp is obtained from the new state ID. We ensure that it is greater than the timestamp of the last state ID. Thus, if the change is successful, the applied mutations will have greater timestamps than the previously applied mutations. We also add two locks. The more important one, used to ensure correctness, is `read_apply_mutex`. It is held when modifying group 0 state (in `apply` and `transfer_snapshot`) and when reading it (it's taken when obtaining a `group0_guard` and released before a command is sent in `announce`). Its goal is to ensure that we don't read partial state, which could happen without it because group 0 state consist of many parts and `apply` (or `transfer_snapshot`) potentially modifies all of them. Note: this doesn't give us 100% protection; if we crash in the middle of `apply` (or `transfer_snapshot`), then after restart we may read partial state. To remove this possibility we need to ensure that commands which were being applied before restart but not finished are re-applied after restart, before anyone can read the state. I left a TODO in `apply`. The second lock, `operation_mutex`, is used to improve liveness. It is taken when obtaining a `group0_guard` and released after a command is applied (compare to `read_apply_mutex` which is released before a command is sent). It is not taken inside `apply` or `transfer_snapshot`. This lock ensures that multiple fibers running on the same node do not attempt to modify group0 concurrently - this would cause some of them to fail (due to the concurrent modification protection described above). This is mostly important during first boot of the first node, when services start for the first time and try to create their internal tables. This lock serializes these attempts, ensuring that all of them succeed. * kbr/schema-state-ids-v4: service: migration_manager: `announce`: take a description parameter service: raft: check and update state IDs during group 0 operations service: raft: group0_state_machine: introduce `group0_command` service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table service: migration_manager: convert migration request handler to coroutine db: system_keyspace: introduce `system.group0_history` table treewide: require `group0_guard` when performing schema changes service: migration_manager: introduce `group0_guard` service: raft: pass `storage_proxy&` to `group0_state_machine` service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot` service: raft: rename `schema_raft_state_machine` to `group0_state_machine` service: migration_manager: rename `schema_read_barrier` to `start_group0_operation` service: migration_manager: `announce`: split raft and non-raft paths to separate functions treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions service: migration_manager: put notifier call inside `async` service: migration_manager: remove some unused and disabled code db: system_distributed_keyspace: use current time when creating mutations in `start()` redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only	2022-01-25 09:52:30 +02:00
Avi Kivity	a105b09475	build: prepare for Scylla 5.0 We decided to name the next version Scylla 5.0, in honor of Raft based schema management.	2022-01-25 09:52:30 +02:00
Avi Kivity	277303a722	build_indexes_virtual_reader: convert to flat_mutation_reader_v2 Since it doesn't handle range tombstones in any way, the conversion consists of just using the new type names. Closes #9948	2022-01-25 09:52:30 +02:00
Avi Kivity	007145e033	validation: complete transition to data_dictionary module The API was converted in `00de5f4876`, but some #includes remain. Remove them. Closes #9947	2022-01-25 09:52:30 +02:00
Avi Kivity	e74f570eda	alternator: streams: fix use-after-free of data_dictionary in describe_stream() In `4aa9e86924` ("Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity"), we changed alternator to use data_dictionary instead of replica::database. However, data_dictionary::database objects are different from replica::database objects in that they don't have a stable address and need to be captured by value (they are pointer-like). One capture in describe_stream() was capturing a data_dictionary::database by reference and so caused a use-after-free when the previous continuation was deallocated. Fix by capturing by value. Fixes #9952. Closes #9954	2022-01-25 09:52:30 +02:00
Kamil Braun	044e05b0d9	service: migration_manager: `announce`: take a description parameter The description parameter is used for the group 0 history mutation. The default is empty, in which case the mutation will leave the description column as `null`. I filled the parameter in some easy places as an example and left the rest for a follow-up. This is how it looks now in a fresh cluster with a single statement performed by the user: cqlsh> select * from system.group0_history ; key \| state_id \| description ---------+--------------------------------------+------------------------------------------------------ history \| 9ec29cac-7547-11ec-cfd6-77bb9e31c952 \| CQL DDL statement history \| 9beb2526-7547-11ec-7b3e-3b198c757ef2 \| null history \| 9be937b6-7547-11ec-3b19-97e88bd1ca6f \| null history \| 9be784ca-7547-11ec-f297-f40f0073038e \| null history \| 9be52e14-7547-11ec-f7c5-af15a1a2de8c \| null history \| 9be335dc-7547-11ec-0b6d-f9798d005fb0 \| null history \| 9be160c2-7547-11ec-e0ea-29f4272345de \| null history \| 9bdf300e-7547-11ec-3d3f-e577a2e31ffd \| null history \| 9bdd2ea8-7547-11ec-c25d-8e297b77380e \| null history \| 9bdb925a-7547-11ec-d754-aa2cc394a22c \| null history \| 9bd8d830-7547-11ec-1550-5fd155e6cd86 \| null history \| 9bd36666-7547-11ec-230c-8702bc785cb9 \| Add new columns to system_distributed.service_levels history \| 9bd0a156-7547-11ec-a834-85eac94fd3b8 \| Create system_distributed(_everywhere) tables history \| 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a \| Create system_distributed_everywhere keyspace history \| 9bcec89a-7547-11ec-e1b4-34e0010b4183 \| Create system_distributed keyspace	2022-01-24 15:20:37 +01:00
Kamil Braun	6a00e790c7	service: raft: check and update state IDs during group 0 operations The group 0 state machine will only modify state during command application if the provided "previous state ID" is equal to the last state ID present in the history table. Otherwise, the command will be a no-op. To ensure linearizability of group 0 changes, the performer of the change must first read the last state ID, only then read the state and send a command for the state machine. If a concurrent change races with this command and manages to modify the state, we will detect that the last state ID does not match during `apply`; all calls to `apply` are serialized, and `apply` adds the new entry to the history table at the end, after modifying the group 0 state. The details of this mechanism are abstracted away with `group0_guard`. To perform a group 0 change, one needs to call `announce`, which requires a `group0_guard` to be passed in. The only way to obtain a `group0_guard` is by calling `start_group0_operation`, which underneath performs a read barrier on group 0, obtains the last state ID from the history table, and constructs a new state ID that the change will append to the history table. The read barrier ensures that all previously completed changes are visible to this operation. The caller can then perform any necessary validation, construct mutations which modify group 0 state, and finally call `announce`. The guard also provides a timestamp which is used by the caller to construct the mutations. The timestamp is obtained from the new state ID. We ensure that it is greater than the timestamp of the last state ID. Thus, if the change is successful, the applied mutations will have greater timestamps than the previously applied mutations. We also add two locks. The more important one, used to ensure correctness, is `read_apply_mutex`. It is held when modifying group 0 state (in `apply` and `transfer_snapshot`) and when reading it (it's taken when obtaining a `group0_guard` and released before a command is sent in `announce`). Its goal is to ensure that we don't read partial state, which could happen without it because group 0 state consist of many parts and `apply` (or `transfer_snapshot`) potentially modifies all of them. Note: this doesn't give us 100% protection; if we crash in the middle of `apply` (or `transfer_snapshot`), then after restart we may read partial state. To remove this possibility we need to ensure that commands which were being applied before restart but not finished are re-applied after restart, before anyone can read the state. I left a TODO in `apply`. The second lock, `operation_mutex`, is used to improve liveness. It is taken when obtaining a `group0_guard` and released after a command is applied (compare to `read_apply_mutex` which is released before a command is sent). It is not taken inside `apply` or `transfer_snapshot`. This lock ensures that multiple fibers running on the same node do not attempt to modify group0 concurrently - this would cause some of them to fail (due to the concurrent modification protection described above). This is mostly important during first boot of the first node, when services start for the first time and try to create their internal tables. This lock serializes these attempts, ensuring that all of them succeed.	2022-01-24 15:20:37 +01:00
Kamil Braun	509ac2130f	service: raft: group0_state_machine: introduce `group0_command` Objects of this type will be serialized and sent as commands to the group 0 state machine. They contain a set of mutations which modify group 0 tables (at this point: schema tables and group 0 history table), the 'previous state ID' which is the last state ID present in the history table when the operation described by this command has started, and the 'new state ID' which will be appended to the history table if this change is successful (successful = the previous state ID is still equal to the last state ID in the history table at the moment of application). It also contains the address of the node which constructed this command. The state ID mechanism will be described in more detail in a later commit.	2022-01-24 15:20:37 +01:00
Kamil Braun	cc0c54ea15	service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table The MIGRATION_REQUEST verb is currently used to pull the contents of schema tables (in the form of mutations) when nodes synchronize schemas. We will (ab)use the verb to fetch additional data, such as the contents of the group 0 history table, for purposes of group 0 snapshot transfer. We extend `schema_pull_options` with a flag specifying that the puller requests the additional data associated with group 0 snapshots. This flag is `false` by default, so existing schema pulls will do what they did before. If the flag is `true`, the migration request handler will include the contents of group 0 history table. Note that if a request is set with the flag set to `true`, that means the entire cluster must have enabled the Raft feature, which also means that the handler knows of the flag.	2022-01-24 15:20:37 +01:00
Kamil Braun	a944dd44ee	service: migration_manager: convert migration request handler to coroutine	2022-01-24 15:20:37 +01:00
Kamil Braun	fad72daeb4	db: system_keyspace: introduce `system.group0_history` table This table will contain a history of all group 0 changes applied through Raft. With each change is an associated unique ID, which also identifies the state of all group 0 tables (including schema tables) after this change is applied, assuming that all such changes are serialized through Raft (they will be eventually). We will use these state IDs to check if a given change is still valid at the moment it is applied (in `group0_state_machine::apply`), i.e. that there wasn't a concurrent change that happened between creating this change and applying it (which may invalidate it).	2022-01-24 15:20:37 +01:00
Kamil Braun	a664ac7ba5	treewide: require `group0_guard` when performing schema changes `announce` now takes a `group0_guard` by value. `group0_guard` can only be obtained through `migration_manager::start_group0_operation` and moved, it cannot be constructed outside `migration_manager`. The guard will be a method of ensuring linearizability for group 0 operations.	2022-01-24 15:20:35 +01:00
Kamil Braun	742f036261	service: migration_manager: introduce `group0_guard` This object will be used to "guard" group 0 operations. Obtaining it will be necessary to perform a group 0 change (such as modifying the schema), which will be enforced by the type system. The initial implementation is a stub and only provides a timestamp which will be used by callers to create mutations for group 0 changes. The next commit will change all call sites to use the guard as intended. The final implementation, coming later, will ensure linearizability of group 0 operations.	2022-01-24 15:12:50 +01:00
Kamil Braun	f908da919c	service: raft: pass `storage_proxy&` to `group0_state_machine` We'll use it to update the group 0 history table.	2022-01-24 15:12:50 +01:00
Kamil Braun	dce8ece4b6	service: raft: raft_state_machine: pass `snapshot_descriptor` to `transfer_snapshot` Currently it takes just the snapshot ID. Extend it by taking the whole snapshot descriptor. In following commits I use this to perform additional logging.	2022-01-24 15:12:50 +01:00
Kamil Braun	538cc6ecb9	service: raft: rename `schema_raft_state_machine` to `group0_state_machine` Generalize the name so it doesn't suggest that group 0 contains only schema state.	2022-01-24 15:12:50 +01:00
Kamil Braun	86762a1dd9	service: migration_manager: rename `schema_read_barrier` to `start_group0_operation` 1. Generalize the name so it mentions group 0, which schema will be a strict subset of. 2. Remove the fact that it performs a "read barrier" from the name. The function will be used in general to ensure linearizability of group0 operations - both reads and writes. "Read barrier" is Raft-specific terminology, so it can be thought of as an implementation detail.	2022-01-24 15:12:50 +01:00
Kamil Braun	0f24b907b7	service: migration_manager: `announce`: split raft and non-raft paths to separate functions	2022-01-24 15:12:50 +01:00
Kamil Braun	283ac7fefe	treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions The functions which prepare schema change mutations (such as `prepare_new_column_family_announcement`) would use internally generated timestamps for these mutations. When schema changes are managed by group 0 we want to ensure that timestamps of mutations applied through Raft are monotonic. We will generate these timestamps at call sites and pass them into the `prepare_` functions. This commit prepares the APIs.	2022-01-24 15:12:50 +01:00
Kamil Braun	f97edb1dbd	service: migration_manager: put notifier call inside `async` `get_notifier().before_update_column_family(...)` requires being inside `async`. Fix this.	2022-01-24 15:12:50 +01:00
Kamil Braun	3bab5c564a	service: migration_manager: remove some unused and disabled code `include_keyspace_and_announce` was no longer used. `do_announce_new_type` only had a declaration, it was not used and there was no definition.	2022-01-24 15:12:49 +01:00
Kamil Braun	0af5f74871	db: system_distributed_keyspace: use current time when creating mutations in `start()` When creating or updating internal distributed tables in `system_distributed_keyspace::start()`, hardcoded timestamps were used. There two reasons for this: - to protect against issue #2129, where nodes would start without synchronizing schema with the existing cluster, creating the tables again, which would override any manual user changes to these tables. The solution was to use small timestamps (like api::min_timestamp) - the user-created schema mutations would always 'win' (because when they were created, they used current time). - to eliminate unnecessary schema sync. If two nodes created these tables concurrently with different timestamps, the schemas would formally be different and would need to merge. This could happen during upgrades when we upgraded from a version which doesn't have these tables or doesn't have some columns. The #2129 workaround is no longer necessary: when nodes start they always have to sync schema with existing nodes; we also don't allow bootstrapping nodes in parallel. The second problem would happen during parallel bootstrap, which we don't allow, or during parallel upgrade. The procedure we recommend is rolling upgrade - where nodes are upgraded one by one. In this case only one node is going to create/update the tables; following upgraded nodes will sync schema first and notice they don't need to do anything. So if procedures are followed correctly, the workaround is not needed. If someone doesn't follow the procedures and upgrades nodes in parallel, these additional schema synchronizations are not a big cost, so the workaround doesn't give us much in this case as well. When schema changes are performed by Raft group 0, certain constraints are placed on the timestamps used for mutations. For this we'll need to be able to use timestamps which are generated based on current time.	2022-01-24 15:12:49 +01:00
Kamil Braun	63d3449bc3	redis: keyspace_utils: `create_keyspace_if_not_exists_impl`: call `announce` twice only The code would previously `announce` schema mutations once per each keyspace and once per each table. This can be reduced to two calls of `announce`: once to create all keyspaces, and once to create all tables. This should be further reduced to a single `announce` in the future. Left a FIXME. Motivation: after migrating to Raft, each `announce` will require a `read_barrier` to achieve linearizability of schema operations. This introduces latency, as it requires contacting a leader which then must contact a quorum. The fewer announce calls, the better. Also, if all sub-operations are reduced to a single `announce`, we get atomicity - either all of these sub-operations succeed or none do.	2022-01-24 15:12:46 +01:00
Benny Halevy	188cedd533	test: lister_test: test_lister_abort: generate at least one entry Without this fix, generate_random_content could generate 0 entries and the expected exception would never be injected. With it, we generate at least 1 entry and the test passes with the offending random-seed: ``` random-seed=1898914316 Generated 1 dir entries Aborting lister after 1 dir entries test/boost/lister_test.cc(96): info: check 'exception "expected_exception" raised as expected' has passed ``` Fixes #9953 Test: lister_test.test_lister_abort --random-seed=1898914316(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220123122921.14017-1-bhalevy@scylladb.com>	2022-01-23 17:52:44 +02:00
Gleb Natapov	d09864d61f	redis: check for tables existence before creating Do not create redis tables unconditionally on boot since this requires issue raft barrier and cannot be done without a quorum. Message-Id: <YefV0CqEueRL7G00@scylladb.com>	2022-01-23 17:52:44 +02:00
Benny Halevy	f439edca35	test: sstable_compaction_test: twcs_reshape_with_disjoint_set_test: take min_threshold into consideration Take into account that get_reshaping_job selects only buckets that have more than min_threashold sstables in them. Therefore, with 256 disjoint sstables in different windows, allow first or last windows to not be selected by get_reshaping_job that will return at least disjoint_sstable_count - min_threshold + 1 sstables, and not more than disjoint_sstable_count. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220123090044.38449-2-bhalevy@scylladb.com>	2022-01-23 17:52:44 +02:00
Avi Kivity	ae6fdf1599	Update seastar submodule * seastar 5025cd44ea...5524f229bb (3): > Merge "Simplify io-queue configuration" from Pavel E > fix sstring.find(): make find("") compatible with std::string > test: file_utils: test_non_existing_TMPDIR: no need to setenv Contains patch from Pavel Emelyanov <xemul@scylladb.com>: scylla-gdb: Remove _shares_capacity from fair-group debug This field is about to be removed in newer seastar, so it shouldn't be checked in scylla-gdb Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220121115643.6966-1-xemul@scylladb.com>	2022-01-21 17:38:05 +02:00
Piotr Jastrzebski	09d4438a0d	cdc: Handle compact storage correctly in preimage Base tables that use compact storage may have a special artificial column that has an empty type. `c010cefc4d` fixed the main CDC path to handle such columns correctly and to not include them in the CDC Log schema. This patch makes sure that generation of preimage ignores such empty column as well. Fixes #9876 Closes #9910 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2022-01-20 13:23:38 +01:00
Nadav Har'El	350c3d0f6a	alternator: update comment about default timeout The comment explaining where the default Alternator timeout is set became out-of-date. So fix it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220120092631.401563-1-nyh@scylladb.com>	2022-01-20 14:05:58 +02:00
Raphael S. Carvalho	5d654a6b9a	compaction: don't copy owned ranges in cleanup ctor Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220119142322.39791-1-raphaelsc@scylladb.com>	2022-01-20 14:05:58 +02:00
Botond Dénes	a65b38a9f7	reader_permit: release_base_resources(): also update _resources If the permit was admitted, _base_resources was already accounted in _resource and therefore has to be deducted from it, otherwise the permit will think it leaked some resources on destruction. Test: dtest(repair_additional_test.py.test_repair_one_missing_row_diff_shard_count) Refs: #9751 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220119132550.532073-1-bdenes@scylladb.com>	2022-01-20 14:05:58 +02:00
Nadav Har'El	7cb6250c40	Merge 'snapshot_ctl: true_snapshots_size: fix space accounting' from Benny Halevy This pull request fixes two preexisting issues related to snapshot_ctl::true_snapshots_size https://github.com/scylladb/scylla/issues/9897 https://github.com/scylladb/scylla/issues/9898 And adds a couple unit tests to tests the snapshot_ctl functionality. Test: unit(dev), database_test.{test_snapshot_ctl_details,test_snapshot_ctl_true_snapshots_size}(debug) Closes #9899 * github.com:scylladb/scylla: table: get_snapshot_details: count allocated_size snapshot_ctl: cleanup true_snapshots_size snpashot_ctl: true_snapshots_size: do not map_reduce across all shards	2022-01-19 11:57:15 +02:00
Nadav Har'El	4aa9e86924	Merge 'alternator: move uses of replica module to data_dictionary' from Avi Kivity Alternator is a coordinator-side service and so should not access the replica module. In this series all but one of uses of the replica module are replaced with data_dictionary. One case remains - accessing the replication map which is not available (and should not be available) via the data dictionary. The data_dictionary module is expanded with missing accessors. Closes #9945 * github.com:scylladb/scylla: alternator: switch to data_dictionary for table listing purposes data_dictionary: add get_tables() data_dictionary: introduce keyspace::is_internal()	2022-01-19 11:34:25 +02:00
Avi Kivity	7399f3fae7	alternator: switch to data_dictionary for table listing purposes As a coordinator-side service, alternator shouldn't touch the replica module, so it is migrated here to data_dictionary. One use case still remains that uses replica::keyspace - accessing the replication map. This really isn't a replica-side thing, but it's also not logically part of the data dictionary, so it's left using replica::keyspace (using the data_dictionary::database::real_database() escape hatch). Figuring out how to expose the replication map to coordinator-side services is left for later.	2022-01-19 11:03:36 +02:00
Avi Kivity	f80d13c95c	data_dictionary: add get_tables() Unlike replica::database::get_column_families() which is replaces, it returns a vector of tables rather than a map. Map-like access is provided by get_table(), so it's redundant to build a new map container to expose the same functionality.	2022-01-19 09:36:22 +02:00
Benny Halevy	94c2272c8e	table: get_snapshot_details: count allocated_size Rather than the logical file sizes so to account for metadata overhead. Fixes #9898 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-19 08:10:57 +02:00
Benny Halevy	5440739e1b	snapshot_ctl: cleanup true_snapshots_size Cleanup indentation and s/local_total/total/ as it is Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-19 07:50:53 +02:00
Benny Halevy	5db3cbe1e4	snpashot_ctl: true_snapshots_size: do not map_reduce across all shards snapshot_ctl uses map_reduce over all database shards, each counting the size of the snapshots directory, which is shared, not per-shard. So the total live size returned by it is multiples by the number of shards. Add a unit test to test that. Fixes #9897 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-19 07:50:53 +02:00
Nadav Har'El	1ce73c2ab3	Merge 'utils::is_timeout_exception: Ensure we handle nested exception types' from Calle Wilund Fixes #9922 storage proxy uses is_timeout_exception to traverse different code paths. `a6202ae079` broke this (because bit rot and intermixing), by wrapping exception for information purposes. This adds check of nested types in exception handling, as well as a test for the routine itself. Closes #9932 * github.com:scylladb/scylla: database/storage_proxy: Use "is_timeout_exception" instead of catch match utils::is_timeout_exception: Ensure we handle nested exception types	2022-01-18 23:49:41 +02:00
Calle Wilund	868b572ec8	database/storage_proxy: Use "is_timeout_exception" instead of catch match Might miss cases otherwise. v2: Fix broken control flow v3: Avoid throw - use make_exception_future instead.	2022-01-18 15:40:41 +00:00
Avi Kivity	8350cabff3	data_dictionary: introduce keyspace::is_internal() Instead of the replica module's is_internal_keyspace(), provide it as part of data_dictionary. By making it a member of the keyspace class, it is also more future proof in that it doesn't depend on a static list of names.	2022-01-18 15:31:38 +02:00
Avi Kivity	5ed1a8217c	redis: switch from replica/database to data_dictionary redis uses replica/database only for data dictionary purposes; switch it to the much lighter weight data_dictionary module. Closes #9926	2022-01-18 13:26:29 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Raphael S. Carvalho	299ffb1e1a	compaction: make TWCS reshape on a time bucket with tons of files much more efficient Currently, when TWCS reshape finds a bucket containing more than 32 files, it will blindly resize that bucket to 32. That's very bad because it doesn't take into consideration that compaction efficiency depends on relative sizes of files being compacted together, meaning that a huge file can be compacted with a tiny one, producing lots of write amplification. To solve this problem, STCS reshape logic will now be reused in each time bucket. So only similar-sized files are compacted together and the time bucket will be considered reshaped once its size tiers are properly compacted, according to the reshape mode. Fixes #9938. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>	2022-01-18 12:33:54 +02:00
Botond Dénes	8ac7c4f523	docs/design-notes/IDL.md: fix typo: s/on only/only/ Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220118094416.242409-1-bdenes@scylladb.com>	2022-01-18 12:30:39 +02:00
Benny Halevy	84e80f7b99	table: snapshot: handle error from seal_snapshot If seal_snapshot fails we currently do not signal the manifest_write semaphore and shards waiting for it will be blocked forever. Also, call manifest_write.wait in a `finally` clause rather than in a `then` clause, even though `my_work` future never fails at the moment, to make this future proof. Fixes #9936 Test: database_test(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220117181733.3706764-1-bhalevy@scylladb.com>	2022-01-18 12:17:01 +02:00
Avi Kivity	7260d8abed	Merge "index_reader: improve verify_end_state()" from Botond " Said method should take care of checking that parsing stopped in a valid state. This patch-set expands the existing but very lacking implementation by improving the existing error message and adding an additional check for prematurely exiting the parser in the middle of parsing an index entry, something we've seen recently in #9446. To help in debugging such issues, some additional information is added to the trace messages. The series also fixes a bug in the error handling code of the partition index cache. Refs: #9446 Tests: unit(dev) " * 'index-reader-better-verify-end-state/v2.1' of https://github.com/denesb/scylla: sstables/index_reader: process_state(): add additional information to trace logging sstables/index_reader: verify_end_state(): add check for premature EOS sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception sstables/index_reader: add const sstable& to index_consume_entry_context sstables/index_reader: remove unused members from index_consume_entry_context	2022-01-18 12:13:08 +02:00
Benny Halevy	2ae69447b5	sstables: update_info_for_opened_data: accumulate allocated_size into bytes_on_disk bytes_on_disk is intended to reflect the bytes allocated for the sstable files on disk. Accumulating the files logical size, as done today, causes a discrepancy between information retrieved over the storage_service/sstables_info api, like nodetool status or nodetool cfstats and command line tools like df -H /var/lib/scylla. Fixes #9941 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220118070208.3963076-1-bhalevy@scylladb.com>	2022-01-18 11:33:36 +02:00
Botond Dénes	940874f3ff	sstables/index_reader: process_state(): add additional information to trace logging The amount of data available for parsing at the start of each entry, and the parsed key size.	2022-01-18 10:38:11 +02:00
Botond Dénes	afb14508c4	sstables/index_reader: verify_end_state(): add check for premature EOS Add a check which ensures that parsing ended in a valid state and not in the middle of a half-parsed entry.	2022-01-18 10:38:11 +02:00
Botond Dénes	36c0fe904e	sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception Errors during parsing are usually reported via malformed sstable exception to signify their gravity of potentially being caused by corrupt sstables. This patch converts the exception thrown in `index_consume_entry_context::verify_end_state()`. While at it the error message is improved as well. It currently suggests that parsing was ended prematurely because data ran out, while in fact the condition under which this error is thrown is the opposite: parsing ended but there is unconsumed data left. The current state is also added to the error message.	2022-01-18 10:38:11 +02:00
Botond Dénes	7508b4fd22	sstables/index_reader: add const sstable& to index_consume_entry_context To be used by the next patches to throw malformed sstable exception.	2022-01-18 10:38:11 +02:00
Botond Dénes	9f3e5ae801	sstables/index_reader: remove unused members from index_consume_entry_context The unused members are: _s and _file_name.	2022-01-18 10:38:11 +02:00
Avi Kivity	2754e29a9d	Merge "tools: make cli command-based" from Botond " Currently commands are regular switches. This has several disadvantages: * CLI programs nowadays use the command-based UX, so our tools are awkward to use to anybody used to that; * They don't stand out from regular options; * They are parsed at the same time as regular options, so all options have to be dumped to a single description; This series migrates the tools to the command based CLI. E.g. instead of scylla sstable --validate --merge /path/to/sst1 /path/to/sst2 we now have: scylla sstable validate --merge /path/to/sst1 /path/to/sst2 Which makes it much clearer that "validate" is the command and "merge" is an option. And it just looks better. Internally the command is parsed and popped from argv manually just as we do with the tool name in scylla main(). This means we know the command before even building the boost::program_options::options_description representation and thus before creating the seastar::app_template instance. Consequently we can tailor the options registered and the --help content (the application description) to the command run. So now "scylla sstable --help" prints only a general description of the tool and a list of the supported operations. Invoking "scylla sstable {operation} --help" will print a detailed description of the operation along with its specific options. This greatly improves the documentation and the usability of the tool. " Refs #9882 * 'tools-command-oriented-cli/v1' of https://github.com/denesb/scylla: tools/scylla-sstable: update general description tools/scylla-sstable: proper operation-specific --help tools/scylla-sstable: proper operation-specific options tools/scylla-sstable: s/dump/dump-data/ tools/utils: remove now unused get_selected_operation() overload tools: take operations (commands) as positional arguments tools/utils: add positional-argument based overload of get_selected_operation() tools: remove obsolete FIXMEs	2022-01-17 17:03:39 +02:00
Botond Dénes	518abe7415	test/lib/mutation_diff: force textual conversion If the compared mutations have binary keys, `colordiff` will declare the file as binary and will refuse to compare them, beyond a very unhelpful "binary files differ" summary. Add "-a" to the command line to force a treating all files as text. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220117131347.106585-1-bdenes@scylladb.com>	2022-01-17 15:27:53 +02:00
Michael Livshin	d7a993043d	shard_reader: check that _reader is valid before dereferencing After `fc729a804`, `shard_reader::close()` is not interrupted with an exception any more if read-ahead fails, so `_reader` may in fact be null. Fixes #9923 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220117120405.152927-1-michael.livshin@scylladb.com>	2022-01-17 14:39:11 +02:00
Konstantin Osipov	b96f9a3580	migration manager: fix compile error on Ubuntu 20 Thanks to an older boost, there is an ambiguity in name resolution between boost::placeholders and std::placeholders. Message-Id: <20220117094837.653145-2-kostja@scylladb.com>	2022-01-17 12:49:30 +02:00
Pavel Emelyanov	daf686739b	redis: Use local storage proxy The create_keyspace_if_not_exists_impl() gets global instance of storage proxy, but its only caller (controller) already have it and can pass via argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220117104226.22833-1-xemul@scylladb.com>	2022-01-17 12:44:22 +02:00
Benny Halevy	17e006106b	token_metadata: update_normal_tokens: avoid unneeded sort when token ownership doesn't change Currently, we first delete all existing token mappings for the endpoint from _token_to_endpoint_map and then we add all updated token mappings for it and set should_sort_tokens if the token is newly inserted, but since we removed all existing mappings for the endpoint unconditionally, we will sort the tokens even if the token existed and its ownership did not change. This is worthwhile since there are scenarios where none of the token ownership change. Searching and erasing tokens from the tokens unordered_set runs at constant time on average so doing it for n tokens is O(n), while sorting the tokens is O(n*log(n)). Test: unit(dev) DTest: replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap(dev,debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220117101242.122512-2-bhalevy@scylladb.com>	2022-01-17 12:18:42 +02:00
Benny Halevy	25977db7b4	token_metadata: remove update_normal_token entry point It's currently used only by unit tests and it is dangerous to use on a populated token_metadata as update_normal_tokens assumes that the set of tokens owned by the given endpoint is compelte, i.e. previous tokens owned by the endpoint are no longer owned by it, but the single-token update_normal_token interface seems commulative (and has no documentation whatsoever). It is better to remove this interface and calculate a complete map of endpoint->tokens from the tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220117101242.122512-1-bhalevy@scylladb.com>	2022-01-17 12:18:42 +02:00
Nadav Har'El	8fd5041092	cql: INSERT JSON should refuse empty-string partition key Add the missing partition-key validation in INSERT JSON statements. Scylla, following the lead of Cassandra, forbids an empty-string partition key (please note that this is not the same as a null partition key, and that null clustering keys are allowed). Trying to INSERT, UPDATE or DELETE a partition with an empty string as the partition key fails with a "Key may not be empty". However, we had a loophole - you could insert such empty-string partition keys using an "INSERT ... JSON" statement. The problem was that the partition key validation was done in one place - `modification_statement::build_partition_keys()`. The INSERT, UPDATE and DELETE statements all inherited this same method and got the correct validation. But the INSERT JSON statement - insert_prepared_json_statement overrode the build_partition_keys() method and this override forgot to call the validation function. So in this patch we add the missing validation. Note that the validation function checks for more than just empty strings - there is also a length limit for partition keys. This patch also adds a cql-pytest reproducer for this bug. Before this patch, the test passed on Cassandra but failed on Scylla. Reported by @FortTell Fixes #9853. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220116085216.21774-1-nyh@scylladb.com>	2022-01-17 09:46:18 +01:00
Calle Wilund	97bb1be6f7	utils::is_timeout_exception: Ensure we handle nested exception types Fixes #9922 storage proxy uses is_timeout_exception to traverse different code paths. `a6202ae079` broke this (because bit rot and intermixing), by wrapping exception for information purposes. This adds check of nested types in exception handling, as well as a test for the routine itself.	2022-01-17 08:43:41 +00:00
Avi Kivity	985403ab99	view: convert build_progress_virtual_reader to flat_mutation_reader_v2 build_progress_virtual_reader is a virtual reader that trims off the last clustering key column from an underlying base table. It is here converted to flat_mutation_reader_v2. Because range_tombstone_change uses position_in_partition, not clustering_key_prefix, we need a new adjust_ckey() overload. Note the transformation is likely incorrect. When trimming the last clustering key column, an inclusive bound changes should change to exclusive. However, the original code did not do this, so we don't fix it here. It's immaterial anyway since the base table doesn't include range tombstones. Test: unit (dev) (which has a test for this reader) Closes #9913	2022-01-17 10:31:37 +02:00
Gleb Natapov	2aedf79152	idl-compiler: remove no longer used variable types_with_const_appearances is no longer used. Remove it. Message-Id: <YeUnoZXNcW0AdWWK@scylladb.com>	2022-01-17 10:30:30 +02:00
Nadav Har'El	82005b91b6	test/cql-pytest: really flush() in translated Cassandra tests Some of the CQL tests translated from Cassandra into the test/cql-pytest framework used the flush() function to force a flush to sstables - presumably because this exercised yet another code path, or because it reproduced bugs that Cassandra once had that were only visible when reading from sstables - not from memtables. Until now, this flush() function was stubbed and did nothing. But we do have in test/cql-pytest a flush() implementation in nodetool.py - which uses the REST API if possible and if not (e.g., when running against Cassandra) uses the external "nodetool" command. So in this patch flush() starts to use nodetool.flush() instead of doing nothing. The tests continue to pass as before after this patch, and there is no noticable slowdown (the flush does take time, but the few times it's done is negligible in these tests). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220117073112.83994-1-nyh@scylladb.com>	2022-01-17 10:22:04 +02:00
Gleb Natapov	d65427ad81	thrift: correctly check for keyspace existence `d9c315891a` broke the check for keyspace existence. The condition is opposite. Fix it. Fixes #9927 Message-Id: <YeUhtESDHQeMHiUW@scylladb.com>	2022-01-17 10:20:48 +02:00
Avi Kivity	fec0c09756	Merge "Convert scrub and validation to v2" from Botond " As a prerequisite the mutation fragment stream validator is converted to v2 as well (but it still supports v1). We get one step closer to eliminate conversions altogether from compaction.cc. Tests: unit(dev) " * 'scrub-v2/v1' of https://github.com/denesb/scylla: mutation_writer: remove v1 version segregate_by_partition() compaction/compaction: remove v1 version of validate and scrub reader factory methods tools/scylla-sstable: migrate to v2 test/boost/sstable_compaction_test: migrate validation tests to v2 test/boost/sstable_compaction_test: migrate scrub tests to v2 test/lib/simple_schema: add v2 of make_row() and make_static_row() compaction: use v2 version of mutation_writer::segregate_by_partition() mutation_writer: add v2 version of segregate_by_partition() compaction: migrate scrub and validate to v2 mutation_fragment_stream_validator: migrate validator to v2	2022-01-16 18:25:07 +02:00
Avi Kivity	52b7778ae6	Merge "repair: make sure there is one permit per repair with count res" from Botond " Repair obtains a permit for each repair-meta instance it creates. This permit is supposed to track all resources consumed by that repair as well as ensure concurrency limit is respected. However when the non-local reader path is used (shard config of master != shard config of follower), a second permit will be obtained -- for the shard reader of the multishard reader. This creates a situation where the repair-meta's permit can block the shard permit, creating a deadlock situation. This patch solves this by dropping the count resource on the repair-meta's permit when a non-local reader path is executed -- that is a multishard reader is created. Fixes: #9751 " * 'repair-double-permit-block/v4' of https://github.com/denesb/scylla: repair: make sure there is one permit per repair with count res reader_permit: add release_base_resource()	2022-01-16 18:22:29 +02:00
Nadav Har'El	a30e71e27a	alternator: doc, test: fix mentions of reverse queries Now that issues #7586 and #9487 were fixed, reverse queries - even in long partitions - work well, we can drop the claim in alternator/docs/compatibility.md that reverse queries are buggy for large partitions. We can also remove the "xfail" mark from the tes that checks this feature, as it now passes. Refs #7586 Refs #9487 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #9831	2022-01-16 17:46:26 +02:00
Gleb Natapov	dc886d96d1	idl-compiler: update the documentation with new features added recently The series to move storage_proxy verbs to the IDL added not features to the IDL compiler, but was lacking a documentation. This patch documents the features.	2022-01-16 15:12:07 +02:00
Mikołaj Sielużycki	f6d9d6175f	sstables: Harden bad_alloc handling during memtable flush. dirty_memory_manager monitors memory and triggers memtable flushing if there is too much pressure. If bad_alloc happens during the flush, it may break the loop and flushes won't be triggered automatically, leading to blocked writes as memory won't be automatically released. The solution is to add exception handling to the loop, so that the inner part always returns a non-exceptional future (meaning the loop will break only on node shutdown). try/catch is used around on_internal_error instead of on_internal_error_noexcept, as the latter doesn't have a version that accepts an exception pointer. To get the exception message from std::exception_ptr a rethrow is needed anyway, so this was a simpler approach. Fixes: #4174 Message-Id: <20220114082452.89189-1-mikolaj.sieluzycki@scylladb.com>	2022-01-14 16:09:21 +02:00
Botond Dénes	b6828e899a	Merge "Postpone reshape of SSTables created by repair" from Raphael " SSTables created by repair will potentially not conform to the compaction strategy layout goal. If node shuts down before off-strategy has a chance to reshape those files, node will be forced to reshape them on restart. That causes unexpected downtime. Turns out we can skip reshape of those files on boot, and allow them to be reshaped after node becomes online, as if the node never went down. Those files will go through same procedure as files created by repair-based ops. They will be placed in maintenance set, and be reshaped iteratively until ready for integration into the main set. " Fixes #9895. tests: UNIT(dev). * 'postpone_reshape_on_repair_originated_files' of https://github.com/raphaelsc/scylla: distributed_loader: postpone reshape of repair-originated sstables sstables: Introduce filter for sstable_directory::reshape table: add fast path when offstrategy is not needed sstables: add constant for repair origin	2022-01-14 14:05:09 +02:00
Botond Dénes	c727360eca	db: convert data listeners to v2 To remove yet another back-and-forth conversion in table::make_reader_v2(). Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220114085551.565752-1-bdenes@scylladb.com>	2022-01-14 13:57:44 +02:00
Avi Kivity	4995179c6f	Merge "Use data_dictionary in client_state and validation" from Pavel E " The main motivation for the set is to expell query_processor.proxy().local_db() calls from cql3/statements code. The only places that still use q.p. like this are those calling client_state::has_..._access() checkers. Those checks can go with the data_dictionary which is already available on the query processor. This is the continuation of the `9643f84d` ("Eliminate direct storage_proxy usage from cql3 statements") patch set. As a side effect the validation/ code, that's called from has_..._access checks, is also converted to use data_dictionary. tests: unit(dev, debug) " * 'br-cql3-dictionary' of https://github.com/xemul/scylla: validation: Make validate_column_family use data_dictionary::database client_state: Make has_access use data_dictionary::database client_state: Make has_schema_access use data_dictionary::database client_state: Make has_column_family_access use data_dictionary::database client_state: Make has_keyspace_access use data_dictionary::database	2022-01-14 13:55:22 +02:00
Raphael S. Carvalho	ae3b589f12	table: Reduce off-strategy space requirement if multiple compaction rounds are required Off-strategy compaction works by iteratively reshaping the maintenance set until it's ready for integration into the main set. As repair-based ops produces disjoint sstables only, off-strategy compaction can complete the reshape in a single round. But if reshape ends up requiring more than one round, space requirement for off-strategy to succeed can be high. That's because we're only deleting input SSTables on completion. SSTables from maintenance set can be only deleted on completion as we can only merge maintenance set into main one once we're done reshaping[1]. But a SSTable that was created by a reshape and later used as a input in another reshape can be deleted immediately as its existence is not needed anywhere. [1] We don't update maintenance set after each reshape round, because that would mess with its disjointness. We also don't iteratively merge maintenance set into main set, as the data produced by a single round is potentially not ready for integration into main set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220111202950.111456-1-raphaelsc@scylladb.com>	2022-01-14 13:46:31 +02:00
Botond Dénes	3005b9b5f8	Merge "move raft verbs to the IDL" from Gleb Natapov " The series moves raft verbs to the IDL and also fix some verbs to be one way like they were intended to be. " * 'gleb/raft-idl' of github.com:scylladb/scylla-dev: raft service: make one way raft messages truly one way raft: move raft verbs to the IDL raft: split idl to rpc and storage idl-compiler: always produce const variant of serializers raft: simplify raft idl definitions	2022-01-14 13:40:20 +02:00
Pavel Emelyanov	00de5f4876	validation: Make validate_column_family use data_dictionary::database And instantly convert the validate_keyspace() as it's not called from anywhere but the validate_column_family(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-14 13:00:53 +03:00
Pavel Emelyanov	71c3a7525b	client_state: Make has_access use data_dictionary::database This db argument is only needed to be pushed into cdc::is_log_for_some_table() helper. All callers already have the d._d.::database at hands and convert it into .real_database() call-time, so this patch effectively generalizes those calls to the .real_database(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-14 12:59:35 +03:00
Pavel Emelyanov	f22eb22b8b	client_state: Make has_schema_access use data_dictionary::database It's now called with d._d.::database converted to .real_database() right in the argument passing, so this change can be treated as the generalization of that .real_database() call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-14 12:55:53 +03:00
Pavel Emelyanov	b6bc7a9b29	client_state: Make has_column_family_access use data_dictionary::database Straightforward replacement. Internals of the has_column_family_access() temporarily get .real_database(), but it will be changed soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-14 12:55:15 +03:00
Pavel Emelyanov	1ed237120a	client_state: Make has_keyspace_access use data_dictionary::database Straightforward replacement. Internals of the has_keyspace_access() temporarily get .real_database(), but it will be changed soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-14 12:54:01 +03:00
Botond Dénes	3ce526082f	mutation_writer: remove v1 version segregate_by_partition()	2022-01-14 10:19:56 +02:00
Botond Dénes	a7f4ab6b14	compaction/compaction: remove v1 version of validate and scrub reader factory methods	2022-01-14 10:19:56 +02:00
Botond Dénes	77c9f252a1	tools/scylla-sstable: migrate to v2	2022-01-14 08:54:26 +02:00
Botond Dénes	74d3a9223c	test/boost/sstable_compaction_test: migrate validation tests to v2	2022-01-14 08:54:26 +02:00
Botond Dénes	0e1bdca71b	test/boost/sstable_compaction_test: migrate scrub tests to v2	2022-01-14 08:54:26 +02:00
Botond Dénes	da0c5adcc3	test/lib/simple_schema: add v2 of make_row() and make_static_row()	2022-01-14 08:54:26 +02:00
Botond Dénes	d57634ad46	compaction: use v2 version of mutation_writer::segregate_by_partition()	2022-01-14 08:54:26 +02:00
Botond Dénes	e772326b10	mutation_writer: add v2 version of segregate_by_partition() Just a facade using converters behind the scenes. The actual segregator is not worth migrating to v2 while mutation and the flushing readers don't have a v2 versions. Still, migrating all users to a v2 API allows the conversion to happen at a single point where more work is necessary, instead of scattered around all the users. We leave the v1 version in place to aid incremental migration to the v2 one.	2022-01-14 08:54:26 +02:00
Botond Dénes	b315d17c2a	compaction: migrate scrub and validate to v2 We add v2 version of external API but leave the old v1 in place to help incremental migration. The implementation is migrated to v2.	2022-01-14 08:54:26 +02:00
Botond Dénes	f61fcfbada	mutation_fragment_stream_validator: migrate validator to v2 Add support for validating v2 streams while still keeping the v1 support. Since the underlying logic is largely independent of the format version, this is simple to do and will allow incremental migration of users.	2022-01-14 08:54:26 +02:00
Kamil Braun	168c6f47f9	replica: database: allow disabling optimized TWCS queries through compaction strategy options As requested from field engineering, add a way to disable the optimized TWCS query algorithm (use regular query path) just in case a bug or a performance regression shows up in production. To disable the optimized query path, add 'enable_optimized_twcs_queries': 'false' to compaction strategy options, e.g. ``` alter table ks.t with compaction = {'class': 'TimeWindowCompactionStrategy', 'enable_optimized_twcs_queries': 'false'}; ``` Setting the `enable_optimized_twcs_queries` key to anything other than `'false'` (note: a boolean `false` expands to a string `'false'`) or skipping it (re)enables the optimized query path. Note: the flag can be set in a cluster in the middle of upgrade. Nodes which do not understand it simply ignore it, but they do store it in their schema tables (they store the entire `compaction` map). After these nodes are upgraded, they will understand the flag and act accordingly. Note: in the situation above, some nodes may use the optimized path and some may use the regular path. This may happen also in a fully upgraded cluster when compaction options are changed concurrently to reads; there is a short period of time where the schema change propagates and some nodes got the flag but some didn't. These should not be a problem since the optimization does not change the returned read results (unless there is a bug). Generally, the flag is not intended for normal use, but for field engineers to disable it in case of a serious problem. Ref #6418. Closes #9900	2022-01-14 07:10:02 +02:00
Kamil Braun	4c3fb9ac68	conf: update description of `reversed_reads_auto_bypass_cache` in scylla.yaml Message-Id: <20220111123937.10750-1-kbraun@scylladb.com>	2022-01-13 23:49:01 +01:00
Kamil Braun	fe0366f6bc	cdc: `check_and_repair_cdc_streams`: fix indentation	2022-01-13 23:10:18 +02:00
Juliusz Stasiewicz	ea46439858	cdc: `check_and_repair_cdc_streams`: regenerate if too many streams are present If the number of streams exceeds the number of token ranges it indicates that some spurious streams from decommissioned nodes are present. In such a situation - simply regenerate. Fixes #9772 Closes #9780	2022-01-13 23:10:18 +02:00
Nadav Har'El	a0cad9585f	merge: move tests to use new schema announcement API Merged patch series from Gleb Natapov: The series moves tests to use new schema announcement API and removes the old one. Gleb Natapov (7): test: convert database_test to new schema announcement api test use new schema announcement api in cql_test_env.cc test: move cql_query_test.cc to new schema announcement api test: move memtable_test.cc to new schema announcement api test: move schema_change_test.cc to new schema announcement api migration_manager: drop unused announce_ functions migration_manager: assert that raft ops are done on shard 0 service/migration_manager.hh \| 5 --- service/migration_manager.cc \| 52 ++++++++------------------------ test/boost/cql_query_test.cc \| 3 +- test/boost/database_test.cc \| 5 +-- test/boost/memtable_test.cc \| 2 +- test/boost/schema_change_test.cc \| 18 ++++++----- test/lib/cql_test_env.cc \| 2 +- 7 files changed, 31 insertions(+), 56 deletions(-)	2022-01-13 23:10:18 +02:00
Gleb Natapov	0169e4d7ed	migration_manager: assert that raft ops are done on shard 0 Now that all consumers run on shard zero we can assert it.	2022-01-13 23:10:18 +02:00
Gleb Natapov	1ff85020b5	migration_manager: drop unused announce_ functions	2022-01-13 23:10:18 +02:00
Gleb Natapov	f0a41c102a	test: move schema_change_test.cc to new schema announcement api	2022-01-13 23:10:18 +02:00
Gleb Natapov	512556914a	test: move memtable_test.cc to new schema announcement api	2022-01-13 23:10:13 +02:00
Botond Dénes	d6efe27545	Merge 'db: config: add a flag to disable new reversed reads algorithm' from Kamil Braun Just in case the new algorithm turns out to be buggy, or give a performance regression, add a flag to fall-back to the old algorithm for use in the field. Closes #9908 * github.com:scylladb/scylla: db: config: add a flag to disable new reversed reads algorithm replica: table: remove obsolete comment about reversed reads	2022-01-13 23:09:02 +02:00
Gleb Natapov	be46109af6	test: move cql_query_test.cc to new schema announcement api	2022-01-13 23:09:02 +02:00
Avi Kivity	63d254a8d2	Merge 'gms, service: futurize and coroutinize gossiper-related code' from Pavel Solodovnikov This series greatly reduces gossipers' dependence on `seastar::async` (yet, not completely). `i_endpoint_state_change_subscriber` callbacks are converted to return futures (again, to get rid of `seastar::async` dependency), all users are adjusted appropriately (e.g. `storage_service`, `cdc::generation_service`, `streaming::stream_manager`, `view_update_backlog_broker` and `migration_manager`). This includes futurizing and coroutinizing the whole function call chain up to the `i_endpoint_state_change_subscriber` callback functions. To aid the conversion process, a non-`seastar::async` dependent variant of `utils::atomic_vector::for_each` is introduced (`for_each_futurized`). A different name is used to clearly distinguish converted and non-converted code, so that the last step (remove `seastar::async()` wrappers around callback-calling code in gossiper) is easier. This is left for a follow-up series, though. Tests: unit(dev) Closes #9844 * github.com:scylladb/scylla: service: storage_service: coroutinize `set_gossip_tokens` service: storage_service: coroutinize `leave_ring` service: storage_service: coroutinize `handle_state_left` service: storage_service: coroutinize `handle_state_leaving` service: storage_service: coroutinize `handle_state_removing` service: storage_service: coroutinize `do_drain` service: storage_service: coroutinize `shutdown_protocol_servers` service: storage_service: coroutinize `excise` service: storage_service: coroutinize `remove_endpoint` service: storage_service: coroutinize `handle_state_replacing` service: storage_service: coroutinize `handle_state_normal` service: storage_service: coroutinize `update_peer_info` service: storage_service: coroutinize `do_update_system_peers_table` service: storage_service: coroutinize `update_table` service: storage_service: coroutinize `handle_state_bootstrap` service: storage_service: futurize `notify_*` functions service: storage_service: coroutinize `handle_state_replacing_update_pending_ranges` repair: row_level_repair_gossip_helper: coroutinize `remove_row_level_repair` locator: reconnectable_snitch_helper: coroutinize `reconnect` gms: i_endpoint_state_change_subscriber: make callbacks to return futures utils: atomic_vector: introduce future-returning `for_each` function utils: atomic_vector: rename `for_each` to `thread_for_each` gms: gossiper: coroutinize `start_gossiping` gms: gossiper: coroutinize `force_remove_endpoint` gms: gossiper: coroutinize `do_status_check` gms: gossiper: coroutinize `remove_endpoint`	2022-01-13 23:09:02 +02:00
Gleb Natapov	100b44f5ff	test use new schema announcement api in cql_test_env.cc	2022-01-13 23:09:02 +02:00
Avi Kivity	230eac439e	Update seastar submodule * seastar ae8d1c28a2...5025cd44ea (2): > Merge "Lazy IO capacity replenishment" from Pavel E Fixes #9893 > configure.py: don't use deprecated mktemp()	2022-01-13 23:09:02 +02:00
Gleb Natapov	5dffc8ed3e	test: convert database_test to new schema announcement api	2022-01-13 23:09:02 +02:00
Gleb Natapov	c500a90902	raft service: make one way raft messages truly one way Raft core does not expect replies for most messages it sends, but they are defined as two way by the IDL currently. Fix them to be one way.	2022-01-13 13:14:46 +02:00
Gleb Natapov	b1fea20d36	raft: move raft verbs to the IDL	2022-01-13 13:14:46 +02:00
Gleb Natapov	8a25b740df	raft: split idl to rpc and storage Storage uses only small part of the IDL, so it can include only the part that is relevant to it.	2022-01-13 13:14:46 +02:00
Gleb Natapov	b0dee71b34	idl-compiler: always produce const variant of serializers Currently const variant is produced only if a type and its const usage are in the same idl file, but a type can be defined in one file and used as const in another.	2022-01-13 13:14:46 +02:00
Gleb Natapov	c5474f9ac2	raft: simplify raft idl definitions We may use high level types in the IDL.	2022-01-13 13:14:46 +02:00
Nadav Har'El	f842f65794	Merge 'thrift: switch to replica::database uses to data_dictionary' from Avi Kivity replica::database is (as its name indicates) a replica-side service, while thrift is coordinator-side. Convert thrift's use of replica::database for data dictionary lookups to the data_dictionary module. Since data_dictionary was missing a get_keyspaces() operation, add that. Thrift still uses replica::database to get the schema version. That should be provided by migration_manager, but changing that is left for later. Closes #9888 * github.com:scylladb/scylla: thrift: switch from replica module to data_dictionary module thrift: simplify execute_schema_command() calling convention data_dictionary: add get_keyspaces() method	2022-01-13 10:52:30 +02:00
Nadav Har'El	343c521e28	alternator: avoid large contigous allocation in BatchGetItem The BatchGetItem request can return a very large response - according to DynamoDB documentation up to 16 MB, but presently in Alternator, we allow even more (see #5944). The problem is that the existing code prepares the entire response as a large contiguous string, resulting in oversized allocation warnings - and potentially allocation failures. So in this patch we estimate the size of the BatchGetItem response, and if it is "big enough" (currently over 100 KB), we return it with the recently added streaming output support. This streaming output doesn't avoid the extra memory copies unfortunately, but it does avoid a contiguous allocation which is the goal of this patch. After this patch, one oversized allocation warning is gone from the test: test/alternator/run test_batch.py::test_batch_get_item_large (a second oversized allocation is still present, but comes from the unrelated BatchWriteItem issue #8183). Fixes #8522 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220111170541.637176-1-nyh@scylladb.com>	2022-01-13 09:46:08 +01:00
Kamil Braun	e98711cfcb	db: config: add a flag to disable new reversed reads algorithm Just in case the new algorithm turns out to be buggy, or give a performance regression, add a flag to fall-back to the old algorithm for use in the field.	2022-01-12 18:59:19 +01:00
Avi Kivity	6205d40d5f	thrift: switch from replica module to data_dictionary module Thrift is a coordinator-side service and should not touch the replica module. Switch it to data_dictionary. The switch is straightforward with two exceptions: - client_state still receives replica::database parameters. After this change it will be easier to adapt client_state too. - calls to replica::database::get_version() remain. They should be rerouted to migration_manager instead, as that deals with schema management.	2022-01-12 19:54:38 +02:00
Kamil Braun	7fb7a406e7	replica: table: remove obsolete comment about reversed reads	2022-01-12 17:57:08 +01:00
Avi Kivity	85061b694b	thrift: simplify execute_schema_command() calling convention execute_schema_command is always called with the same first two parameters, which are always defined froom the thrift_handler instance that contains its caller. Simplify it by making it a member function. This simplifies migration to data_dictionary in the next patch.	2022-01-12 18:56:47 +02:00
Avi Kivity	631a19884d	data_dictionary: add get_keyspaces() method Mirroring replica::database::get_keyspaces(), for Thrift's use. We return a vector instead of a hash map. Random access is already available via database::find_keyspace(). The name is available via the keyspace metadata, and in fact Thrift ignore the map name and uses the metadata name. Using a simpler type reduces include dependencies for this heavily used module. The function is plumbed to replica::database::get_keyspaces() so it returns the same data.	2022-01-12 18:24:38 +02:00
Raphael S. Carvalho	a144d30162	distributed_loader: postpone reshape of repair-originated sstables SSTables created by repair will potentially not conform to the compaction strategy layout goal. If node shuts down before off-strategy has a chance to reshape those files, node will be forced to reshape them on restart. That causes unexpected downtime. Turns out we can skip reshape of those files on boot, and allow them to be reshaped after node becomes online, as if the node never went down. Those files will go through same procedure as files created by repair-based ops. They will be placed in maintenance set, and be reshaped iteratively until ready for integration into the main set. Fixes #9895. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-01-12 13:14:31 -03:00
Nadav Har'El	8bcd23fa02	Merge: move rest of internal ddl users to use raft from Gleb The patch series moves the rest of internal ddl users to do schema change over raft (if enabled). After that series only tests are left using old API. * 'gleb/raft-schema-rest-v6' of github.com:scylladb/scylla-dev: (33 commits) migration_manager: drop no longer used functions system_distributed_keyspace: move schema creation code to use raft auth: move table creation code to use raft auth: move keyspace creation code to use raft table_helper: move schema creation code to use raft cql3: make query_processor inherit from peering_sharded_service table_helper: make setup_table() static table_helper: co-routinize setup_keyspace() redis: move schema creation code to go through raft thrift: move system_update_column_family() to raft thrift: authenticate a statement before verifying in system_update_column_family() thrift: co-routinize system_update_column_family() thrift: move system_update_keyspace() to raft thrift: authenticate a statement before verifying in system_update_keyspace() thrift: co-routinize system_update_keyspace() thrift: move system_drop_keyspace() to raft thrift: authenticate a statement before verifying in system_drop_keyspace() thrift: co-routinize system_drop_keyspace() thrift: move system_add_keyspace() to raft thrift: co-routinize system_add_keyspace() ...	2022-01-12 18:09:08 +02:00
Raphael S. Carvalho	f9e33f7046	sstables: Introduce filter for sstable_directory::reshape This will be useful to allow sstable_directory user to filter out sstables that should not be reshaped. The default filter is implemented as including everything. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-01-12 11:54:17 -03:00
Gleb Natapov	2aec9009ef	migration_manager: drop no longer used functions	2022-01-12 16:40:06 +02:00
Gleb Natapov	9ce62bcc33	system_distributed_keyspace: move schema creation code to use raft	2022-01-12 16:40:06 +02:00
Gleb Natapov	50b7806c57	auth: move table creation code to use raft	2022-01-12 16:40:06 +02:00
Gleb Natapov	4273a3308c	auth: move keyspace creation code to use raft	2022-01-12 16:40:06 +02:00
Gleb Natapov	03184bd786	table_helper: move schema creation code to use raft	2022-01-12 16:40:06 +02:00
Gleb Natapov	eb62e81843	cql3: make query_processor inherit from peering_sharded_service This what we can get to a distributed object from shard local one.	2022-01-12 16:40:06 +02:00
Gleb Natapov	e2a29d9239	table_helper: make setup_table() static It will make it easier to move schema creation to shard 0.	2022-01-12 16:40:06 +02:00
Gleb Natapov	3995f75b30	table_helper: co-routinize setup_keyspace() Also replace open-coded loops with more modern c++ alternatives.	2022-01-12 16:40:05 +02:00
Gleb Natapov	5b4982d01f	redis: move schema creation code to go through raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	dd36150a7d	thrift: move system_update_column_family() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	bcfdcc51d6	thrift: authenticate a statement before verifying in system_update_column_family() Otherwise it is possible to infer if a table exist without having proper credentials.	2022-01-12 16:33:16 +02:00
Gleb Natapov	aec413d0f7	thrift: co-routinize system_update_column_family()	2022-01-12 16:33:16 +02:00
Gleb Natapov	d9c315891a	thrift: move system_update_keyspace() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	7ffbdde554	thrift: authenticate a statement before verifying in system_update_keyspace() Otherwise it is possible to infer if a table exist without having proper credentials.	2022-01-12 16:33:16 +02:00
Gleb Natapov	1b4538f5bd	thrift: co-routinize system_update_keyspace()	2022-01-12 16:33:16 +02:00
Gleb Natapov	64b8f4fe50	thrift: move system_drop_keyspace() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	52fc815f24	thrift: authenticate a statement before verifying in system_drop_keyspace() Otherwise it is possible to infer if a table exist without having proper credentials.	2022-01-12 16:33:16 +02:00
Gleb Natapov	45ff7e30a1	thrift: co-routinize system_drop_keyspace()	2022-01-12 16:33:16 +02:00
Gleb Natapov	a17f82c647	thrift: move system_add_keyspace() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	3a3a3f693e	thrift: co-routinize system_add_keyspace()	2022-01-12 16:33:16 +02:00
Gleb Natapov	845b617256	thrift: move system_drop_column_family() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	9b6a9b104e	thrift: co-routinize system_drop_column_family()	2022-01-12 16:33:16 +02:00
Gleb Natapov	7cfedb50bb	thrift: move system_add_column_family() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	e4ac3c2777	thrift: authenticate a statement before verifying in system_add_column_family() Otherwise it is possible to infer if a table exist without having proper credentials.	2022-01-12 16:33:16 +02:00
Gleb Natapov	d5f14306d0	thrift: co-routinize system_add_column_family()	2022-01-12 16:33:16 +02:00
Gleb Natapov	1491cc2906	alternator: move create_table() to raft	2022-01-12 16:33:16 +02:00
Gleb Natapov	0cd6d283ad	alternator: move update_table() to raft	2022-01-12 16:33:15 +02:00
Gleb Natapov	7ee39ff94b	alternator: move validation in update_table() to the begining	2022-01-12 16:33:15 +02:00
Gleb Natapov	740b2181e1	alternator: move update_tags() to raft	2022-01-12 16:33:15 +02:00
Gleb Natapov	57be1b773e	alternator: move delete_table() to raft	2022-01-12 16:33:15 +02:00
Gleb Natapov	0ac20b5494	alternator: make some functions static Make add_stream_options, supplement_table_info, supplement_table_stream_info static. They only need a pointer to storage_proxy, so pass it directly.	2022-01-12 16:33:15 +02:00
Gleb Natapov	2e4a8bdfaa	alternator: co-routinize delete_table()	2022-01-12 16:33:15 +02:00
Gleb Natapov	459539e812	migration_manager: do not allow creating keyspace with arbitrary timestamp This was needed to fix issue #2129 which was only manifest itself with auto_bootstrap set to false. The option is ignored now and we always wait for schema to synch during boot.	2022-01-12 16:33:15 +02:00
Botond Dénes	bdcbf3f71b	Merge 'database: Add error message with mutation info on commit log apply failure' from Calle Wilund Fixes #9408 While it is rare, some customer issues have shown that we can run into cases where commit log apply (writing mutations to it) fails badly. In the known cases, due to oversized mutations. While these should have been caught earlier in the call chain really, it would probably help both end users and us (trying to figure out how they got so big and how they got so far) iff we added info to the errors thrown (and printed), such as ks, cf, and mutation content. Somewhat controversial, this makes the apply with CL decision path coroutinized, mainly to be able to do the error handling for the more informative wrapper exception easier/less ugly. Could perhaps do with futurize_invoke + then_wrapper also. But future is coroutines... This is as stated somewhat problematic, it adds an allocation to perf_simple_query::write path (because of crap clang cr frame folding?). However, tasks/op remain constant and actual tps (though unstable) remain more or less the same (on my crappy measurements). Counter path is unaffected, as coroutine frame alloc replaces with(...) dtest for the wrapped exception on separate pr. Closes #9412 * github.com:scylladb/scylla: database: Add error message with mutation info on commit log apply failure database: coroutinize do_apply and apply_with_commitlog	2022-01-12 16:16:29 +02:00
Raphael S. Carvalho	6aa221a247	table: add fast path when offstrategy is not needed If there's nothing in maintenance set, then there's no need to submit a offstrategy request to manager.	2022-01-12 11:15:54 -03:00
Raphael S. Carvalho	34be8842ad	sstables: add constant for repair origin Make comparisons easy and avoid duplication Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-01-12 11:13:58 -03:00
Calle Wilund	a6202ae079	database: Add error message with mutation info on commit log apply failure Fixes #9408 While it is rare, some customer issues have shown that we can run into cases where commit log apply (writing mutations to it) fails badly. In the known cases, due to oversized mutations. While these should have been caught earlier in the call chain really, it would probably help both end users and us (trying to figure out how they got so big and how they got so far) iff we added info to the errors thrown (and printed), such as ks, cf, and mutation content.	2022-01-12 14:04:23 +00:00
Calle Wilund	63ea666ca0	database: coroutinize do_apply and apply_with_commitlog Somewhat controversial. Making the apply with CL decision path coroutinized, mainly to be able to in next patch make error handling more informative (because we will have exceptions that are immediate and/or futurized). This is as stated somewhat problematic, it adds an allocation to perf_simple_query::write path (because of crap clang cr frame folding?). However, tasks/op remain constant and actual tps (though unstable) remain more or less the same (on my crappy measurements). Counter path is unaffected, as coroutine frame alloc replaces with(...) alloc, and all is same and dandy. I am hoping that the simpler error + verbose code will compensate for the extra alloc.	2022-01-12 14:04:15 +00:00
Nadav Har'El	23e93a26b3	Merge 'Alternator: stream results + chunk results to remove large allocations' from Calle Wilund Refs: #9555 When running the "Kraken" dynamodb streams test to provoke the issued observed by QA, I noticed on my setup mainly two things: Large allocation stalls (+ warnings) and timeouts on read semaphores in DB. This tries to address the first issue, partly by making query_result_view serialization using chunked vector instead of linear one, and by introducing a streaming option for json return objects, avoiding linearizing to string before wire. Note that the latter has some overhead issues of its own, mainly data copying, since we essentially will be triple buffering (local, wrapped http stream, and final output stream). Still, normal string output will typically do a lot of realloc which is potential extra copies as well, so... This is not really performance tested, but with these tweaks I no longer get large alloc stalls at least, so that is a plus. :-) Closes #9713 * github.com:scylladb/scylla: alternator::executor: Use streamed result for scan etc if large result alternator::streams: Use streamed result in get_records if large result executor/server: Add routine to make stream object return rjson: Add print to stream of rjson::value query_idl: Make qr_partition::rows/query_result::partitions chunked	2022-01-12 15:53:31 +02:00
Calle Wilund	f73ca9659b	alternator::executor: Use streamed result for scan etc if large result Avoids large allocations for larger scans. Todo: determine threshold	2022-01-12 13:34:49 +00:00
Calle Wilund	0c1ff5c2f5	alternator::streams: Use streamed result in get_records if large result If we have a resonable result set to send back to client, use direct streaming of the object. Todo: determine threshold.	2022-01-12 13:34:49 +00:00
Calle Wilund	4a8a7ef8b4	executor/server: Add routine to make stream object return Simply retains result object and sets json::json_return_type to streaming callback.	2022-01-12 13:34:49 +00:00
Calle Wilund	e2d7225df8	rjson: Add print to stream of rjson::value Allows direct stream of object to seastar::stream. While not 100% efficient, it has the advantage of avoiding large allocations (long string) for huge result messages.	2022-01-12 13:34:49 +00:00
Avi Kivity	134601a15e	Merge "Convert input side of mutation compactor to v2" from Botond " With this series the mutation compactor can now consume a v2 stream. On the output side it still uses v1, so it can now act as an online v2->v1 converter. This allows us to push out v2->v1 conversion to as far as the compactor, usually the next to last component in a read pipeline, just before the final consumer. For reads this is as far as we can go, as the intra-node ABI and hence the result-sets built are v1. For compaction we could go further and eliminate conversion altogether, but this requires some further work on both the compactor and the sstable writer and so it is left to be done later. To summarize, this patchset enables a v2 input for the compactor and it updates compaction and single partition reads to use it. " * 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla: table: add make_reader_v2() querier: convert querier_cache and {data,mutation}_querier to v2 compaction: upgrade compaction::make_interposer_consumer() to v2 mutation_reader: remove unecessary stable_flattened_mutations_consumer compaction/compaction_strategy: convert make_interposer_consumer() to v2 mutation_writer: migrate timestamp_based_splitting_writer to v2 mutation_writer: migrate shard_based_splitting_writer to v2 mutation_writer: add v2 clone of feed_writer and bucket_writer flat_mutation_reader_v2: add reader_consumer_v2 typedef mutation_reader: add v2 clone of queue_reader compact_mutation: make start_new_page() independent of mutation_fragment version compact_mutation: add support for consuming a v2 stream compact_mutation: extract range tombstone consumption into own method range_tombstone_assembler: add get_range_tombstone_change() range_tombstone_assembler: add get_current_tombstone()	2022-01-12 14:37:19 +02:00
Avi Kivity	4118f2d8be	treewide: replace deprecated seastar::later() with seastar::yield() seastar::later() was recently deprecated and replaced with two alternatives: a cheap seastar::yield() and an expensive (but more powerful) seastar::check_for_io_immediately(), that corresponds to the original later(). This patch replaces all later() calls with the weaker yield(). In all cases except one, it's unambiguously correct. In one case (test/perf scheduling_latency_measurer::stop()) it's not so ambiguous, since check_for_io_immediately() will additionally force a poll and so will cause more work to be done (but no additional tasks to be executed). However, I think that any measurement that relies on the measuring the work on the last tick to be inaccurate (you need thousands of ticks to get any amount of confidence in the measurement) that in the end it doesn't matter what we pick. Tests: unit (dev) Closes #9904	2022-01-12 12:19:19 +01:00
Avi Kivity	0e5d196499	Merge "move storage proxy verbs to the IDL" from Gleb * 'gleb/sp-idl-v1' of github.com:scylladb/scylla-dev: storage_proxy: move all verbs to the IDL idl-compiler: allow const references in send() parameter list idl-compiler: support smart pointers in verb's return value idl-compiler: support multiple return value and optional in a return value idl-compiler: handle :: at the beginning of a type idl-compiler: sending one way message without timeout does not require ret value specialization as well storage_proxy: convert more address vectors to inet_address_vector_replica_set	2022-01-12 12:34:18 +02:00
Nadav Har'El	7a9f69ec38	Merge 'lister cleanup and test' from Benny Halevy Split off of #9835. The series removes extraneous includes of lister.hh from header files and adds a unit test for lister::scan_dir to test throwing an exception from the walker function passed to `scan_dir`. Test: unit(dev) Closes #9885 * github.com:scylladb/scylla: test: add lister_list lister: add more overloads of fs::path operator/ for std::string and string_view resource_manager: remove unnecessary include of lister.hh from header file sstables: sstable_directory: remove unncessary include of lister.hh from header file	2022-01-12 08:20:07 +01:00
Nadav Har'El	c5f29fe3ea	configure.py: don't use deprecated mktemp() configure.py uses the deprecated Python function tempfile.mktemp(). Because this function is labeled a "security risk" it is also a magnet for automated security scanners... So let's replace it with the recommended tempfile.mkstemp() and avoid future complaints. The actual security implications of this mktemp() call is negligible to non-existent: First it's just the build process (configure.py), not the build product itself. Second, the worst that an attacker (which needs to run in the build machine!) can do is to cause a compilation test in configure.py to fail because it can't write to its output file. Reported by @srikanthprathi Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220111121924.615173-1-nyh@scylladb.com>	2022-01-11 17:06:14 +02:00
Benny Halevy	1e6829e9f1	test: add lister_list Test the lister class. In particular the ability to abort the lister when the walker function throws an exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-11 17:04:16 +02:00
Benny Halevy	8444e50e6a	lister: add more overloads of fs::path operator/ for std::string and string_view To make it easier to append a std::string to a filesystem::path. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-11 17:04:16 +02:00
Benny Halevy	f4cd535e3d	resource_manager: remove unnecessary include of lister.hh from header file But define namespace fs = std::filesystem in the header since many use sites already depend on it and it's a convention throught scylla's code. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-11 17:04:16 +02:00
Benny Halevy	b9c41dc0fd	sstables: sstable_directory: remove unncessary include of lister.hh from header file The source file depends on it, not the header. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-11 17:04:16 +02:00
Botond Dénes	97d74de8fc	Merge "flat_mutation_reader: clone evictable_reader & convert some others" from Michael Livshin " The first patch introduces evictable_reader_v2, and the second one further simplifies it. We clone instead of converting because there is at least one downstream (by way of multishard_combining_reader) use that is not itself straightforward to convert at the moment (multishard_mutation_query), and because evictable_reader instances cannot be {up,down}graded (since users also access the undelying buffers). This also means that shard_reader, reader_lifecycle_policy and multishard_combining_reader have to be cloned. " * tag 'clone-evictable-reader-to-v2/v3' of https://github.com/cmm/scylla: convert make_multishard_streaming_reader() to flat_mutation_reader_v2 convert table::make_streaming_reader() to flat_mutation_reader_v2 convert make_flat_multi_range_reader() to flat_mutation_reader_v2 view_update_generator: remove unneeded call to downgrade_to_v1() introduce multishard_combining_reader_v2 introduce shard_reader_v2 introduce the reader_lifecycle_policy_v2 abstract base evictable_reader_v2: further code simplifications introduce evictable_reader_v2 & friends	2022-01-11 17:01:08 +02:00
Botond Dénes	d21803c5d0	Merge "Remove global storage proxy from pagers code" from Pavel Emelyanov " The fix is in keeping shared proxy pointer on query_pager. tests: unit(dev) " * 'br-keep-proxy-on-pager-2' of https://github.com/xemul/scylla: pager: Use local proxy pointer pager: Keep shared pointer to proxy onboard	2022-01-11 17:01:08 +02:00
Nadav Har'El	9d0eaeb90a	test/scylla-gdb: enable test for "scylla fiber" After the rewrite of the test/scylla-gdb, the test for "scylla fiber" was disabled - and this patch brings it back. For the "scylla fiber" operation to do something interesting (and not just print an error message and seem to succeed...) it needs a real task pointer. The old code interrupted Scylla in a breakpoint and used get_local_tasks(), but in the new test framework we attach to Scylla while it's idle, so there are no ready tasks. So in this patch we use the find_vptrs() function to find a continuation from http_server::do_accept_one() - it has an interesting fiber of 5 continuations. After this patch all 33 tests in test/scylla-gdb/test_misc.py pass. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220110211813.581807-1-nyh@scylladb.com>	2022-01-11 17:01:08 +02:00
Avi Kivity	861cc1d304	Update seastar submodule * seastar 28fe4214e5...ae8d1c28a2 (3): > cross-tree: convert deprecated later() to yield() > future: deprecate later(), and add two alternatives > reactor: improve lowres_clock, lowres_system_clock granularity	2022-01-11 17:01:08 +02:00
Nadav Har'El	7f5ca5bf3f	Merge 'replica: move distributed_loader to replica module' from Avi Kivity distributed_loader is replica-side thing, so it belongs in the replica module ("distributed" refers to its ability to load sstables in their correct shards). So move it to the replica module. The change exposes a dependency on the construction order of static variables (which isn't defined), so we remove the dependency in the first two patches. Closes #9891 * github.com:scylladb/scylla: replica: move distributed_loader into replica module tracing: make sure keyspace and table names are available to static constructors auth: make sure keyspace and table names are available to static constructors	2022-01-11 17:01:08 +02:00
Pavel Emelyanov	4dd1c15b7b	Merge v3 of "Deglobalize repair tracker" from Benny This series gets rid of the global repair_tracker and thread-local node_ops_metrics instances. It does so by first, make the repair_tracker sharded, with an instance per repair_service shard. The, exposing the repair_service::repair_tracker and keeping a reference to the repair_service in repair_info. Then the node_ops_metrics instances are moved from thread-local global variables to class repair_service. The motivation for this series is two fold: 1. There is a global effor the get rid of global services and instantiate all services on the stack of main() or cql_test_env. 2. As part of https://github.com/scylladb/scylla/issues/9809, we would like to eventually use a generci job tracer for both repair and compaction, so this would be one of the prelimanry steps to get there. Refs #9809 Test: unit(release) (including scylla-gdb) Dtest: repair_additional_test.py::TestRepairAdditional::{test_repair_disjoint_row_2nodes,test_repair_joint_row_3nodes_2_diff_shard_count} replace_address_test.py::TestReplaceAddress::test_serve_writes_during_bootstrap[rbo_enabled] (Still seeing https://github.com/scylladb/scylla/issues/9785 but nothing worse) * github.com:bhalevy/scylla.git deglobalize-repair-tracker-v4 repair: repair_tracker: get rid of _the_tracker repair: repair_service: move free abort_repair_node_ops function to repair_service repair_service: deglobalize node_ops_metrics repair: node_ops_metrics: fixup indentation repair: node_ops_metrics: declare in header file repair: repair_info: add check_in_shutdown method repair: use repair_info to get to the repair tracker repair: move tracker-dependent free functions to repair_service repair: tracker: mark get function const repair_service: add repair_tracker getter repair: make repair_tracker sharded repair: repair_tracker: get rid of unused abort_all_abort_source repair: repair_tracker: get rid of unused shutdown abort source	2022-01-11 17:01:08 +02:00
Nadav Har'El	261c4b80b5	Update tools/java submodule * tools/java 6249bfbe2f...b1e09c8b8f (1): > dist/debian:set either python (>=2.7) or python2	2022-01-11 17:01:08 +02:00
Calle Wilund	706f20442b	query_idl: Make qr_partition::rows/query_result::partitions chunked When doing potentially large (internal) queries, i.e. alternator streams, we can cause large allocations here.	2022-01-11 13:52:40 +00:00
Michael Livshin	1f27e12dc6	convert make_multishard_streaming_reader() to flat_mutation_reader_v2 All changes are mechanical. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	be5118a7c9	convert table::make_streaming_reader() to flat_mutation_reader_v2 All changes are mechanical. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	221cd264db	convert make_flat_multi_range_reader() to flat_mutation_reader_v2 Mechanical changes and a resulting downgrade in one caller (which is itself converted later). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	91d38ef2a9	view_update_generator: remove unneeded call to downgrade_to_v1() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	7f0e228cbb	introduce multishard_combining_reader_v2 All changes are mechanical. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	4bc0deb7e9	introduce shard_reader_v2 Needed for multishard_combining_reader_v2 (see next commit), all changes are mechanical. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	6499361b6a	introduce the reader_lifecycle_policy_v2 abstract base Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	b053716e74	evictable_reader_v2: further code simplifications Almost all mechanical: not passing a `reader` parameter around when we know it's the `_reader` member, folding a short one-use method into its caller. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Michael Livshin	402dbd2ca7	introduce evictable_reader_v2 & friends Cloning instead of converting because there is at least one downstream (via multishard_combining_reader) use that is not straightforward to convert (multishard_mutation_query). The clone is mostly mechanical and much simpler than the original, because it does not have to deal with range tombstones when deciding if it is safe to pause the wrapped reader, and also does not have to trim any range tombstones. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Pavel Solodovnikov	236591be83	service: storage_service: coroutinize `set_gossip_tokens` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:37:47 +03:00
Pavel Solodovnikov	6aeccbb3b8	service: storage_service: coroutinize `leave_ring` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:37:47 +03:00
Pavel Solodovnikov	648c79347a	service: storage_service: coroutinize `handle_state_left` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:37:47 +03:00
Pavel Solodovnikov	b23c19bfb6	service: storage_service: coroutinize `handle_state_leaving` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:37:47 +03:00
Pavel Solodovnikov	99195d637d	service: storage_service: coroutinize `handle_state_removing` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:37:47 +03:00
Pavel Solodovnikov	8052ad12cc	service: storage_service: coroutinize `do_drain` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:37:45 +03:00
Pavel Solodovnikov	1593507f32	service: storage_service: coroutinize `shutdown_protocol_servers` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	0bee6976e3	service: storage_service: coroutinize `excise` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	c7d2a09424	service: storage_service: coroutinize `remove_endpoint` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	210c482c4f	service: storage_service: coroutinize `handle_state_replacing` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	adfc8f8346	service: storage_service: coroutinize `handle_state_normal` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	ba113439de	service: storage_service: coroutinize `update_peer_info` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	b46ebd4fe5	service: storage_service: coroutinize `do_update_system_peers_table` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	aa363acc4b	service: storage_service: coroutinize `update_table` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	f8dbaa3722	service: storage_service: coroutinize `handle_state_bootstrap` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	f0f4a74817	service: storage_service: futurize `notify_*` functions Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	9edf2182ab	service: storage_service: coroutinize `handle_state_replacing_update_pending_ranges` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	4fcf31f11c	repair: row_level_repair_gossip_helper: coroutinize `remove_row_level_repair` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	badbfd521c	locator: reconnectable_snitch_helper: coroutinize `reconnect` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	5dcfb94d5a	gms: i_endpoint_state_change_subscriber: make callbacks to return futures Coroutinize a few simple callbacks in the process. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	adf7138b3b	utils: atomic_vector: introduce future-returning `for_each` function Introduce a variant of `for_each` function not requiring `seastar::async` context. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	b958e85c54	utils: atomic_vector: rename `for_each` to `thread_for_each` To emphasize that the function requires `seastar::thread` context to function properly. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	445876a125	gms: gossiper: coroutinize `start_gossiping` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	04b3172e6b	gms: gossiper: coroutinize `force_remove_endpoint` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	a01c900d66	gms: gossiper: coroutinize `do_status_check` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	42ff01eee2	gms: gossiper: coroutinize `remove_endpoint` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Raphael S. Carvalho	49eeacff37	compaction_manager: make run_with_compaction_disabled() barrier out non-regular compactions run_with_compaction_disabled() is used to temporarily disable compaction for a table T. Not only regular compaction, but all types. Turns out it's stopping all types but it's only preventing new regular compactions from starting. So major for example can start even with compaction temporarily disabled. This is fixed by not allowing compaction of any type if disabled. This wasn't possible before as scrub incorrectly ran entirely with compaction disabled, so it wouldn't be able to start, but now it only disables compaction while retrieving its candidate list. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220107154942.59800-1-raphaelsc@scylladb.com>	2022-01-10 18:57:16 +02:00
Raphael S. Carvalho	1c23d1099a	Make population more resilient when reshape fails Reshape isn't mandatory for correctness, unlike resharding. So we can allow boot to continue even in face of reshape failure. Without this, boot will fail right away due to unhandled exception. This is intended to make population more resilient as any exception, even "benign" ones, may cause boot to fail. It's better to allow boot to continue from where it left off, as if there's an exception like io error, or OOM, population will be unable to complete anyway. This patch was written based on observation that dangling errors in interposer consumer used by compaction can cause a different exception to be triggered, like broken_promise, when user asked reshape to stop. This can no longer happen now, but better safe than sorry. So regular compaction can now pick on backlog once node is online. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220107130539.14899-1-raphaelsc@scylladb.com>	2022-01-10 18:57:16 +02:00
Avi Kivity	4392c20bd3	replica: move distributed_loader into replica module distributed_loader is replica-side thing, so it belongs in the replica module ("distributed" refers to its ability to load sstables in their correct shards). So move it to the replica module.	2022-01-10 15:25:28 +02:00
Avi Kivity	bfa4abaf6b	tracing: make sure keyspace and table names are available to static constructors Static constructors (specifically for the `system_keyspaces` global variable) need their dependencies to be already constructed when their own construction begins. Because tracing uses seastar::sstring, which is not constexpr, we must change it to std::string_view (which is). Change the type and perform the required adjustments. The definition is moved to the header file for simplicity.	2022-01-10 15:24:57 +02:00
Gleb Natapov	1db151bd75	storage_proxy: move all verbs to the IDL Define all verbs in the IDL instead of manually codding them.	2022-01-10 14:58:28 +02:00
Gleb Natapov	c998f77cd2	idl-compiler: allow const references in send() parameter list Currently send function parameters and rpc handler's function parameters have both to be values, but sometimes we want send function to receive a const reference to a value to avoid copying, but a handler still needs to get it by value obviously. Support that by introducing one more type attribute [[ref]]. If present the code generator makes send function argument to look like 'const type&' and handler's argument will be 'type'.	2022-01-10 14:44:20 +02:00
Gleb Natapov	f3d5507f86	idl-compiler: support smart pointers in verb's return value A verb's handler may return a 'foreign_ptr<smart_ptr<type>>' value which is received on a client side as a naked 'type'. Current verb generator code can only support symmetric handler/send helper where return type pf a handler matches return type of a send function. Fix that by adding two new attributes that can annotate a return type: unique_ptr, lw_shared_ptr. If unique_ptr attribute is present the return type of a handler will be 'foreign_ptr<unique_ptr<type>>' and the return type of a send function will be just 'type'.	2022-01-10 14:29:37 +02:00
Gleb Natapov	9329234941	idl-compiler: support multiple return value and optional in a return value RPC verbs can be extended to return more then one value and new values are returned as rpc::optional. When adding a return value to a verb its return values becomes rpc::tuple<type1, type2, type3>. In addition new return values may be marked as rpc::optional for backwards compatibility. The patch allow to part return expression of the form: -> type1, type2 [[version 1.1.0]] which will be translated into: rpc::tuple<type1, rpc::optional<type2>>	2022-01-10 14:23:51 +02:00
Gleb Natapov	9c88ea2303	idl-compiler: handle :: at the beginning of a type Currently types starting from '::' like '::ns::type' cause parsing errors. Fix it.	2022-01-10 14:22:48 +02:00
Gleb Natapov	cf8c42ee42	idl-compiler: sending one way message without timeout does not require ret value specialization as well	2022-01-10 14:16:20 +02:00
Gleb Natapov	ff6a0fffaf	storage_proxy: convert more address vectors to inet_address_vector_replica_set	2022-01-10 13:48:20 +02:00
Benny Halevy	50a361c280	repair: repair_tracker: get rid of _the_tracker the global _the_tracker pointer is no longer used, remove it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 12:03:57 +02:00
Benny Halevy	ceb08b9302	repair: repair_service: move free abort_repair_node_ops function to repair_service Do not depend on the_repair_tracker(). With that, the_repair_tracker() is no longer used and should be deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:59:22 +02:00
Benny Halevy	6bd78eb9a6	repair_service: deglobalize node_ops_metrics Embed the node_ops_metrics instance in a sharded repair_service member. Test: curl -silent http://127.0.0.1:9180/metrics \| grep node_ops \| grep -v "^#" on a freshly started scylla instance. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:57:54 +02:00
Benny Halevy	a9c30f47fe	repair: node_ops_metrics: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:52:58 +02:00
Benny Halevy	91cee22792	repair: node_ops_metrics: declare in header file For de-globalizing its thread-local instance by placing a node_ops_metrics member in repair_service. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:52:54 +02:00
Benny Halevy	95176098d1	repair: repair_info: add check_in_shutdown method Replacing the free check_in_shutdown function. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:49:40 +02:00
Benny Halevy	abeca95093	repair: use repair_info to get to the repair tracker Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:41:10 +02:00
Benny Halevy	4db57267a6	repair: move tracker-dependent free functions to repair_service These functions are called from the api layer. Continue to hide the repair tracker from the caller but use the repair_service already available at the api layer to invoke the respective high-level methods without requiring `the_repair_tracker()`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:40:09 +02:00
Benny Halevy	6f7acc2029	repair: tracker: mark get function const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:26:29 +02:00
Benny Halevy	861852214c	repair_service: add repair_tracker getter And rename the global repair_tracker getter to `the_repair_tracker` as the first step to get rid of it. repair_service methods now use the repair_service::repair_tracker method. The global getter was renamed to `the_repair_tracker()` temporarily while eliminating it in this series to help distinguish it from repair_service::repair_tracker(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:25:32 +02:00
Benny Halevy	2f9e701570	repair: make repair_tracker sharded Rather than keeping all shards' semaphore and repair_info:s on the tracker's single-shard instance, instantiate it on all shards, tracking the local repair jobs on its local shard. For now, until it's deglobalized, turn _the_tracker into static thread_local pointer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 11:04:37 +02:00
Benny Halevy	415e67f3c2	repair: repair_tracker: get rid of unused abort_all_abort_source Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 10:57:10 +02:00
Benny Halevy	6650cb543b	repair: repair_tracker: get rid of unused shutdown abort source Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-10 10:54:57 +02:00
Pavel Emelyanov	281ce3cbc6	pager: Use local proxy pointer There are few places that need storage proxy and that use global method to acheive it. Since previous patch there's a pager local non-null pointer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-10 07:58:57 +03:00
Pavel Emelyanov	095d93eaf8	pager: Keep shared pointer to proxy onboard Pagers are created by alternator and select statement, both have the proxy reference at hands. Next, the pager's unique_ptr is put on the lambda of its fetch_page() continuation and thus it survives the fetch_page execution and then gets destroyed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-01-10 07:58:57 +03:00
Avi Kivity	05fa3e07f4	Update seastar submodule * seastar 655078dfdb...28fe4214e5 (2): > program_options: avoid including boost/program_options.hpp when possible > smp: split smp_options out of smp.hh	2022-01-09 19:56:39 +02:00
Nadav Har'El	3cc058d193	sstables: add missing include of seastar/core/metrics.hh sstables/sstables.cc uses seastar::metrics but was missing an include of <seastar/core/metrics.hh>. It probably received this include through some other random included Seastar header (e.g., smp.hh). Now that we're reducing the unnecessary inclusions in Seastar (an ongoing effort of Seastar patches), it is no longer included implicitly, and we need to include it explicitly in sstables.cc. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220109162823.511781-1-nyh@scylladb.com>	2022-01-09 18:30:50 +02:00
Nadav Har'El	63bd0807b4	test/scylla-gdb: skip tests on aarch64 As already noted in commit `eac6fb8`, many of the scylla-gdb tests fail on aarch64 for various reasons. The solution used in that commit was to have test/scylla-gdb/run pretend to succeed - without testing anything - when not running on x86_64. This workaround was accidentally lost when scylla-gdb/run was recently rewritten. This patch brings this workaround back, but in a slightly different form - Instead of the run script not doing anything, the tests do get called, but the "gdb" fixture in test/scylla-gdb/conftest.py causes each individual test to be skipped. The benefit of this approach is that it can easily be improved in the future to only skip (or xfail) specific tests which are known to fail on aarch64, instead of all of them - as half of the tests do pass on aarch64. Fixes #9892. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220109152630.506088-1-nyh@scylladb.com>	2022-01-09 17:34:23 +02:00
Avi Kivity	57188de09e	Merge 'Make dc/rack encryption work for some cases where Nat hides ednpoint ips' from Eliran Sinvani This is a consolidation of #9714 and #9709 PRs by @elcallio that were reviewed by @asias The last comment on those was that they should be consolidated in order not to create a security degradation for ec2 setups. For some cases it is impossible to determine dc or rack association for nodes on outgoing connections. One example is when some IPs are hidden behind Nat layer. In some cases this creates problems where one side of the connection is aware of the rack/dc association where the other doesn't. The solution here is a two stage one: 1. First add a gossip reverse lookup that will help us determine the rack/dc association for a broader (hopefully all) range of setups and NAT situations. 2. When this fails - be more strict about downgrading a node which tries to ensure that both sides of the connection will at least downgrade the connection instead of just fail to start when it is not possible for one side to determine rack/dc association. Fixes #9653 /cc @elcallio @asias Closes #9822 * github.com:scylladb/scylla: messaging_service: Add reverse mapping of private ip -> public endpoint production_snitch_base: Do reverse lookup of endpoint for info messaging_service: Make dc/rack encryption check for connection more strict	2022-01-09 16:40:49 +02:00
Nadav Har'El	7b5a8d3bcc	init.hh: add missing include of boost/program_options.hpp init.hh relies on boost::program_options but forgot to include the header file <boost/program_options.hpp> for it. Today, this doesn't matter, because Seastar unnecessarily includes <boost/program_options.hpp> from unrelated header files (such as smp.hh) - so it ends up not being missing. But we plan to clean up Seastar from those unnecessary includes, and then including what we need in init.hh will become important. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220109123152.492466-1-nyh@scylladb.com>	2022-01-09 15:56:58 +02:00
Avi Kivity	7f285965d8	auth: make sure keyspace and table names are available to static constructors Static constructors (specifically for the `system_keyspaces` global variable) need their dependencies to be already constructed when their own construction begins. Enforce that for auth keyspace and table names using the constinit keyword.	2022-01-09 12:51:22 +02:00
Avi Kivity	6c53717a39	replica, atomic_cell: move atomic_cell merge code from replica module to atomic_cell.cc compare_atomic_cell_for_merge() was placed in database.cc, before atomic_cell.cc existed. Move it to its correct place. Closes #9889	2022-01-09 11:08:10 +02:00
Botond Dénes	31777dfec8	repair: make sure there is one permit per repair with count res Repair obtains a permit for each repair-meta instance it creates. This permit is supposed to track all resources consumed by that repair as well as ensure concurrency limit is respected. However when the non-local reader path is used (shard config of master != shard config of follower), a second permit will be obtained -- for the shard reader of the multishard reader. This creates a situation where the repair-meta's permit can block the shard permit, creating a deadlock situation. This patch solves this by dropping the count resource on the repair-meta's permit when a non-local reader path is executed -- that is a multishard reader is created.	2022-01-07 14:06:31 +02:00
Botond Dénes	4762ddec0f	reader_permit: add release_base_resource() Signals base resources to the semaphore and zeros it. This basically undoes admission.	2022-01-07 14:06:31 +02:00
Botond Dénes	1a97f4c355	table: add make_reader_v2() In fact the existing `make_reader()` is renamed to `make_reader_v2()`, dropping the `downgrade_to_v1()` from the returned reader. To ease incremental migration we add a `make_reader()` implementation which downgrades this reader back to v1. `table::as_mutation_source()` is also updated to use the v2 reader factory method.	2022-01-07 13:52:43 +02:00
Botond Dénes	85c42a5d76	querier: convert querier_cache and {data,mutation}_querier to v2 The shard_mutation_querier is left using a v1 reader in its API as the multishard query code is not ready yet. When saving this reader it is upgraded to v2 and on lookup it is downgraded to v1. This should cancel out thanks to upgrade/downgrade unwrapping.	2022-01-07 13:52:26 +02:00
Botond Dénes	15d8ea983e	compaction: upgrade compaction::make_interposer_consumer() to v2 Almost all (except the scrub one) actual interposer consumers are v2.	2022-01-07 13:52:14 +02:00
Botond Dénes	aa3c943f4c	mutation_reader: remove unecessary stable_flattened_mutations_consumer Said wrapper was conceived to make unmovable `compact_mutation` because readers wanted movable consumers. But `compact_mutation` is movable for years now, as all its unmovable bits were moved into an `lw_shared_ptr<>` member. So drop this unnecessary wrapper and its unnecessary usages.	2022-01-07 13:52:07 +02:00
Botond Dénes	1ba19c2aa4	compaction/compaction_strategy: convert make_interposer_consumer() to v2 The underlying timestamp-based splitter is v2 already.	2022-01-07 13:51:59 +02:00
Botond Dénes	9826b5d732	mutation_writer: migrate timestamp_based_splitting_writer to v2	2022-01-07 13:51:48 +02:00
Botond Dénes	0601a465a2	mutation_writer: migrate shard_based_splitting_writer to v2	2022-01-07 13:48:53 +02:00
Botond Dénes	92244ae8ec	mutation_writer: add v2 clone of feed_writer and bucket_writer Since we have multiple writers using this that we don't want to migrate all at once, we create a v2 version of said classes so we can migrate them incrementally.	2022-01-07 13:48:43 +02:00
Botond Dénes	2d7625f4b3	flat_mutation_reader_v2: add reader_consumer_v2 typedef v2 version of the reader_consumer typedef.	2022-01-07 13:48:36 +02:00
Botond Dénes	8556cb78cc	mutation_reader: add v2 clone of queue_reader As this reader is used in a wide variety of places, it would be a nightmare to upgrade all such sites in one go. So create a v2 clone and migrate users incrementally.	2022-01-07 13:47:53 +02:00
Botond Dénes	e8a918b25c	compact_mutation: make start_new_page() independent of mutation_fragment version By using partition_region instead of mutation_fragment::kind. This will make incremental migration of users to v2 easier.	2022-01-07 13:47:39 +02:00
Botond Dénes	790e73141f	compact_mutation: add support for consuming a v2 stream Consuming either a v1 or v2 stream is supported now, but compacted fragments are still emitted in the v1 format, thus the compactor acts an online downgrader when consuming a v2 stream. This allows pushing out downgrade to v1 on the input side all the way into the compactor. This means that reads for example can now use an all v2 reader pipeline, the still mandatory downgrade to v1 happening at the last possible place: just before creating the result-set. Mandatory because our intra-node ABI is still v1. There are consumers who are ready for v2 in principle (e.g. compaction), they have to wait a little bit more.	2022-01-07 13:42:31 +02:00
Botond Dénes	1d842e980a	compact_mutation: extract range tombstone consumption into own method Next patch wants to reuse the same code.	2022-01-07 13:42:17 +02:00
Botond Dénes	172c094388	range_tombstone_assembler: add get_range_tombstone_change()	2022-01-07 13:41:34 +02:00
Botond Dénes	3efb17a661	range_tombstone_assembler: add get_current_tombstone()	2022-01-07 13:41:25 +02:00
Botond Dénes	0f60cc84f4	Merge 'replica: create a replica module' from Avi Kivity Move the ::database, ::keyspace, and ::table classes to a new replica namespace and replica/ directory. This designates objects that only have meaning on a replica and should not be used on a coordinator (but note that not all replica-only classes should be in this module, for example compaction and sstables are lower-level objects that deserve their own modules). The module is imperfect - some additional classes like distributed_loader should also be moved, but there is only one way to untie Gordian knots. Closes #9872 * github.com:scylladb/scylla: replica: move ::database, ::keyspace, and ::table to replica namespace database: Move database, keyspace, table classes to replica/ directory	2022-01-07 13:37:40 +02:00
Botond Dénes	4f4df25687	tools/scylla-sstable: update general description We now have detailed per-operation descriptions, so remove operation-specific parts of the general one and instead add more details on the common options and arguments.	2022-01-07 12:05:49 +02:00
Botond Dénes	c6d61d47b7	tools/scylla-sstable: proper operation-specific --help Add a detailed description to each of the operations. This description replaces the general one when the operation specific help is displayed (scylla sstable {operation} --help). The existing short description of the operations is demoted to a summary and is made even shorter. This will serve as the headline on the operation specific help page, as well as the summary on the operation listing. This allows the specifics of each operation to be detailed in length instead of the terse summary that was available before.	2022-01-07 12:05:48 +02:00
Botond Dénes	51deb051d9	tools/scylla-sstable: proper operation-specific options Operation-specific options are a mess currently. Some of them are in the general options, all individual operations having to check for their presence and warn if unsupported ones are set. These options were general only when scylla-sstable had a single operation (dump). They (most of them) became specific as soon as a second one was added. Other specific options are in the awkward to use (both on the CLI and in code) operation-specific option map. This patch cleans this mess up. Each operation declares the option it supports and these are only added to the command line when the specific operation is chosen. General options now only contain options that are truly universal. As a result scylla-sstable has a operation-specific --help content now. Operation-specific options are only printed when the operation is selected: scylla sstable --help will only print generic options, while: scylla sstable dump-data --help will also print options specific to said operation. The description is the same still, but this will be fixed in the next patch too.	2022-01-07 12:05:48 +02:00
Avi Kivity	bbad8f4677	replica: move ::database, ::keyspace, and ::table to replica namespace Move replica-oriented classes to the replica namespace. The main classes moved are ::database, ::keyspace, and ::table, but a few ancillary classes are also moved. There are certainly classes that should be moved but aren't (like distributed_loader) but we have to start somewhere. References are adjusted treewide. In many cases, it is obvious that a call site should not access the replica (but the data_dictionary instead), but that is left for separate work. scylla-gdb.py is adjusted to look for both the new and old names.	2022-01-07 12:04:38 +02:00
Botond Dénes	9b5fa12c3d	tools/scylla-sstable: s/dump/dump-data/ We now have dump-{component} for all sstable components, so rename dump to dump-data to follow the established naming scheme and to clear any possible confusion about what it dumps.	2022-01-07 11:23:54 +02:00
Botond Dénes	41dec2dd50	tools/utils: remove now unused get_selected_operation() overload	2022-01-07 11:23:54 +02:00
Botond Dénes	6d4b17976f	tools: take operations (commands) as positional arguments Instead of switches. E.g.: scylla sstable dump ... instead of: scylla sstable --dump This is more inline with how most CLI interfaces work nowadays.	2022-01-07 09:38:05 +02:00
Botond Dénes	062ffaa571	tools/utils: add positional-argument based overload of get_selected_operation() As opposed to the current one, which expects the operation to be given with the --operation syntax, this new overload expects it as the first positional argument. If found and valid, it is extracted from the arglist and returned. Otherwise exit() is invoked to simplify error handling.	2022-01-07 09:38:05 +02:00
Botond Dénes	2c16fc8e9b	tools: remove obsolete FIXMEs	2022-01-07 07:21:05 +02:00
Raphael S. Carvalho	07fba4ab5d	compaction_manager: Abort reshape for tables waiting for a chance to run Tables waiting for a chance to run reshape wouldn't trigger stop exception, as the exception was only being triggered for ongoing compactions. Given that stop reshape API must abort all ongoing tasks and all pending ones, let's change run_custom_job() to trigger the exception if it found that the pending task was asked to stop. Tests: dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces unit: dev Fixes #9836. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com>	2022-01-06 18:04:16 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Raphael S. Carvalho	4c28c49bc7	compaction_manager: make return of maybe_stop_on_error less confusing maybe_stop_on_error() is confusing because it returns true if the task can be retried which goes in opposite direction of its semantics. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220106143233.459903-1-raphaelsc@scylladb.com>	2022-01-06 16:39:15 +02:00
Avi Kivity	b850b34bcc	build: reduce inline threshold on aarch64 to 300 We see coroutine miscompiles with 600. Fixes #9881. Closes #9883	2022-01-06 15:13:27 +02:00
Nadav Har'El	6e2d29300c	test/scylla-gdb: a rewrite, using pytest This patch is an almost complete rewrite of the test/scylla-gdb framework for testing Scylla's gdb commands. The goals of this rewrite are described in issue #9864. In short, the goals are: 1. Use pytest to define individual test cases instead one long Python script. This will make it easier to add more tests, to run only individual tests (e.g., test/scylla-gdb/run somefile.py::sometest), to understand which test failed when it fails - and a lot of other pytest conveniences. 2. Instead of an ad-hoc shell script to run Scylla, gdb, and the test, use the same Python code which is used in other test suites (alternator, cql-pytest, redis, and more). The resulting handling of the temporary resources (processes, directories, IP address) is more robust, and interrupting test/scylla-gdb/run will correctly kill its child processes (both Scylla and gdb). All existing gdb tests (except one - more on this below...) were easily rewritten in the new framework. The biggest change in this patch is who starts what. Before this patch, "run" starts gdb, which in turn starts Scylla, stops it on a breakpoint, and then runs various tests. After this patch, "run" starts Scylla on its own (like it does in test/cql-pytest/run, et al.), and then gdb runs pytest - and in a pytest fixture attaches to the running Scylla process. The biggest benefit of this approach is that "run" is aware of both gdb and Scylla, and can kill both with abruptly with SIGKILL to end the test. But there's also a downside to this change: One of the tests (of "scylla fiber") needs access to some task object. Before this patch, Scylla was stopped on a breakpoint, and a task was available at that point. After this patch, we attach gdb to an idle Scylla, and the test cannot find any task to use. So the test_fiber() test fails for now. One way we could perhaps fix it is to add a breakpoint and "continue" Scylla a bit more after attaching to it. However, I could find the right breakpoint - and we may also need to send a request to Scylla to get it to reach that breakpoint. I'm still looking for a better way to have access to some "task" object we can test on. Fixes #9864. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220102221534.1096659-1-nyh@scylladb.com>	2022-01-06 11:29:55 +02:00
Nadav Har'El	d9fe6f4c96	Merge: main: improve tool integration This set contains follow-up fixes to folding tools into the scylla executable: * Improve the app description of scylla w.r.t. tools * Add a new --list-tools option * Error out when the first argument is unrecognized Tests: unit(dev) Botond Dénes (3): main: rephrase app description main: add move tool listing to --list-tools main: improve handling of non-matching argv[1] main.cc \| 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-)	2022-01-06 10:06:28 +02:00
Botond Dénes	a37b4bbbaf	main: improve handling of non-matching argv[1] Be silent when argv[1] starts with "-", it is probably an option to scylla (and "server" is missing from the cmd line). Print an error and stop when argv[1] doesn't start with "-" and thus the user assumably meant to start either the server or a tool and mis-typed it. Instead of trying to guess what they meant stop with a clear error message.	2022-01-06 06:59:59 +02:00
Botond Dénes	fe0bfa1d7b	main: add move tool listing to --list-tools And make it the central place listing available tools (to minimize the places to update when adding a new one). The description is edited to point to this command instead of listing the tools itself.	2022-01-06 06:58:44 +02:00
Botond Dénes	ab0e39503b	main: rephrase app description Remove "compatible with Apache Cassandra", scylla is much more than that already. Rephrase the part describing the included tools such that it is clear that the scylla server is the main thing and the tools are the "extra" additions. Also use the term "tool" instead of the term "app".	2022-01-06 06:37:32 +02:00
Botond Dénes	92727ac36c	sstables/partition_index_cache: destroy entry ptr on error The error-handling code removes the cache entry but this leads to an assertion because the entry is still referenced by the entry pointer instance which is returned on the normal path. To avoid this clear the pointer on the error path and make sure there are no additional references kept to it. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220105140859.586234-2-bdenes@scylladb.com>	2022-01-05 19:03:24 +01:00
Nadav Har'El	6ebf32f4d7	types: deinline template throw_with_backtrace<marshal_exception, sstring> When a template is instantiated in a header file which is included by many source files, the compiler needs to compile it again and again. ClangBuildAnalyzer helps find the worst cases of this happening, and one of the worst happens to be seastar::throw_with_backtrace<marshal_exception, sstring> This specific template function takes (according to ClangBuildAnalyzer) 362 milliseconds to instantiate, and this is done 312 (!) times, because it reaches virtually every Scylla source file via either types.hh or compound.hh which use this idiom. Unfortunately, C++ as it exists today does not have a mechanism to avoid compiling a specific template instantiation if this was already done in some other source file. But we can do this manually using the C++11 feature of "extern template": 1. For a specific template instance, in this case seastar::throw_with_backtrace<marhsal_exception, sstring>, all source files except one specify it as "extern template". This means that the code for it will NOT be built in this source file, and the compiler assumes the linker will eventually supply it. 2. At the same time, one source file instantiates this template instance once regularly, without "extern". The numbers from ClangBuildAnalyzer suggest that this patch should reduce total build time by 1% (in dev build mode), but this is hard to measure in practice because the very long build time (210 CPU minutes on my laptop) usually fluctuates by more than 1% in consecutive runs. However, we've seen in the past that a good estimate of build time is the total produced object size (du -bc build/dev/*/.o). This patch indeed reduces this total object size (in dev build mode) by exactly 1%. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220105171453.308821-1-nyh@scylladb.com>	2022-01-05 19:23:40 +02:00
Avi Kivity	d01e1a774b	Merge 'Build performance: do not include the entire <seastar/net/ip.hh>' from Nadav Har'El The header file <seastar/net/ip.hh> is a large collection of unrelated stuff, and according to ClangBuildAnalyzer, takes 2 seconds to compile for every source file that included it - and unfortunately virtually all Scylla source files included it - through either "types.hh" or "gms/inet_address.hh". That's 2300 CPU seconds wasted. In this two-patch series we completely eliminate the inclusion of <seastar/net/ip.hh> from Scylla. We still need the ipv4_address, ipv6_address types (e.g., gms/inet_address.hh uses it to hold a node's IP address) so those were split (in a Seastar patch that is already in) from ip.hh into separate small header files that we can include. This patch reduces the entire build time (of build/dev/scylla) by 4% - reducing almost 10 sCPU minutes (!) from the build. Closes #9875 github.com:scylladb/scylla: build performance: do not include <seastar/net/ip.hh> build performance: speed up inclusion of <gm/inet_address.hh>	2022-01-05 17:55:07 +02:00
Nadav Har'El	6012f6f2b6	build performance: do not include <seastar/net/ip.hh> In a previous patch, we noticed that the header file <gm/inet_address.hh>, which is included, directly or indirectly, by most source files, includes <seastar/net/ip.hh> which is very slow to compile, and replaced it by the much faster-to-include <seastar/net/ipv[46]_address.hh>. However, we also included <seastar/net/ip.hh> in types.hh - and that too is included by almost every file, so the actual saving from the above patch was minimal. So in this patch we replace this include too. After this patch Scylla does not include <seastar/net/ip.hh> at all. According to ClangBuildAnalyzer, this reduces the average time to include types.hh (multiply this by 312 times!) from 4 seconds to 1.8 seconds, and reduces total build time (dev mode) by about 3%. Some of the source files were now missing some include directives, that were previously included in ip.hh - so we need to add those explicitly. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-01-05 17:29:21 +02:00
Tomasz Grabiec	382797a627	tests: perf: perf_fast_forward: Fix test_large_partition_slicing_clustering_keys for scylla_bench_large_part_ds1 schema The test case assumed int32 partition key, but scylla_bench_large_part_ds1 has int64 partition key. This resulted in no results to be returned by the reader. Fixs by introducing a partition key factory on the data source level. Message-Id: <20220105150550.67951-1-tgrabiec@scylladb.com>	2022-01-05 17:18:06 +02:00
Nadav Har'El	788b9c7bc0	dbuild: better documentation for how to use with ccache dbuild's README contained some vague and very partial hints on how to use ccache with dbuild. Replace them with more concrete instructions. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211229180433.781906-1-nyh@scylladb.com>	2022-01-05 16:53:08 +02:00
Botond Dénes	015d09a926	tools: utils: add configure_tool_mode() Which configures seastar to act more appropriate to a tool app. I.e. don't act as if it owns the place, taking over all system resources. These tools are often run on a developer machine, or even next to a running scylla instance, we want them to be the least intrusive possible. Also use the new tool mode in the existing tools. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211220143104.132327-1-bdenes@scylladb.com>	2022-01-05 15:33:57 +02:00
Asias He	c5784c1149	repair: Sort follower nodes by proximity Sort follower nodes by the proximity so that in the step where the master node gets missing rows from repair follower nodes，the master node has a chance to get the missing rows from a near node first (e.g., local dc node), avoding getting rows from a far node. For example: dc1: n1, n2 dc2: n3, n4 dc3: n5, n6 Run repair on n1, with this patch, n1 will get data from n2 which is in the same dc first. [shard 0] repair - Repair 1 out of 1 ranges, id=[id=1, uuid=8b0040bd-5aa5-42e1-bb9f-58c5e7052aec], shard=0, keyspace=ks, table={cf}, range=(-6734413101754081925, -6539883972247625343], peers={127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3}, live_peers={127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3} [shard 0] repair - Before sort = {127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3} [shard 0] repair - After sort = {127.0.39.2, 127.0.39.5, 127.0.39.6, 127.0.39.4, 127.0.39.3} [shard 0] repair - Started Row Level Repair (Master): local=127.0.39.1, peers={127.0.39.2, 127.0.39.5, 127.0.39.6, 127.0.39.4, 127.0.39.3} Closes #9769	2022-01-05 14:09:59 +02:00
Nadav Har'El	e7e9001808	test/alternator: add more tests for GSI "Projection" We already have multiple tests for the unimplemented "Projection" feature of GSI and LSI (see issue #5036). This patch adds seven more test cases, focusing on various types of errors conditions (e.g., trying to project the same attribute twice), esoteric corner cases (it's fine to list a key in NonKeyAttributes!), and corner cases that I expect we will have in our implementation (e.g., a projected attribute may either be a real Scylla column or just an element in a map column). All new tests pass on DynamoDB and fail on Alternator (due to #5036), so marked with "xfail". Refs #5036. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211228193748.688060-1-nyh@scylladb.com>	2022-01-05 10:35:36 +02:00
Avi Kivity	53a83c4b1e	Merge "flat_mutation_reader: convert flat_mutation_reader_from_mutations to v2" from Botond " Like flat_mutation_reader_from_fragments, this reader is also heavily used by tests to compose a specific workload for readers above it. So instead of converting it, we add a v2 variant and leave the v1 variant in place. The v2 variant was written from scratch to have built-in support for reading in reverse. It is built-on `mutation::consume()` to avoid duplicating the logic of consuming the contents of the mutation. To avoid stalls, `mutation::consume()` gets support for pausing and resuming consuming a mutation. Tests: unit(dev) " * 'flat_mutation_reader_from_mutations_v2/v2' of https://github.com/denesb/scylla: flat_mutation_reader: convert make_flat_mutation_reader_from_mutation() v2 flat_mutation_reader: extract mutation slicing into a function mutation: consume(): make it pausable/resumable mutation: consume(): restructure clustering iterator initialization test/boost/mutation_test: add rebuild test for mutation::consume()	2022-01-05 10:23:17 +02:00
Avi Kivity	2e958b3555	Merge "Coroutinization of compaction sstable rewrite procedure" from Raphael " Completes coroutinization of rewrite_sstables(). tests: UNIT(debug) " * 'rewrite_sstable_coroutinization' of https://github.com/raphaelsc/scylla: compaction_manager: coroutinize main loop in sstable rewrite procedure compaction_manager: coroutinize exception handling in sstable rewrite procedure compaction_manager: mark task::finish_compaction() as noexcept compaction_manager: make maybe_stop_on_error() more flexible	2022-01-05 10:15:19 +02:00
Raphael S. Carvalho	426450dc04	treewide: remove useless include of database.hh Wrote a script based on cpp-include to find places that needlessly included database.hh, which is expensive to process during build time. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220104204359.168895-1-raphaelsc@scylladb.com>	2022-01-05 10:15:19 +02:00
Nadav Har'El	dcc42d3815	configure.py: re-run configure.py if the build/ directory is gone When you run "configure.py", the result is not only the creation of ./build.ninja - it also creates build/<mode>/seastar/build.ninja and build/<mode>/abseil/build.ninja. After a "rm -r build" (or "ninja clean"), "ninja" will no longer work because those files are missing when Scylla's ninja tries to run ninja in those internal project. So we need to add a dependency, e.g., that running ninja in Seastar requires build/<mode>/seastar/build.ninja to exist, and also say that the rule that (re)runs "configure.py" generates those files. After this patch, configure.py --with-some-parameters --of-your-choice rm -r build ninja works - "ninja" will re-run configure.py with the same parameters when it needs Seastar's or Abseil's build.ninja. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211230133702.869177-1-nyh@scylladb.com>	2022-01-05 10:15:19 +02:00
Nadav Har'El	5fbeae9016	cql-pytest: add a couple of default-TTL tests This patch adds a new cql-pytest test file - test_ttl.py - with currently just a couple of tests for the "with default_time_to_live" feature. One is a basic test, and second reproduces issue #9842 - that "using ttl 0" should override the default time to live, but doesn't. The test for #9842, test_default_ttl_0_override, fails on Scylla and passes on Cassandra, and is marked "xfail". Refs #9842. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211227091502.553577-1-nyh@scylladb.com>	2022-01-05 10:15:19 +02:00
Benny Halevy	e0a351e0c6	compaction_manager: stop_compaction: disallow specific types We can stop only specific compaction types. Reshard should be excluded since it mustn't be stopped. And other types of compaction types like "VALIDATION" or "INDEX_BUILD" are valid in terms of their syntax but unsupported by scylla so we better return an error rather than appear to support them. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211222133449.2177746-1-bhalevy@scylladb.com>	2022-01-05 09:32:20 +02:00
Botond Dénes	62d82b8b0e	flat_mutation_reader: convert make_flat_mutation_reader_from_mutation() v2 Since this reader is also heavily used by tests to compose a specific workload for readers above it, we just add a v2 variant, instead of changing the existing v1 one. The v2 variant was written from scratch to have built-in support for reading in reverse. It is built-on `mutation::consume()` to avoid duplicating the logic of consuming the contents of the mutation. A v2 native unit test is also added.	2022-01-05 09:06:16 +02:00
Botond Dénes	2d1bb90c8e	flat_mutation_reader: extract mutation slicing into a function	2022-01-05 09:06:16 +02:00
Botond Dénes	e8ca07abed	mutation: consume(): make it pausable/resumable To avoid stalls or overconsumption for consumers which have a limit on how much they want to consume in one go, the mutation::consume() is made pausable/resumable. This happens via a cookie which is now returned as part of the returned result, and which can be passed to a later consume call to resume the previous one.	2022-01-05 09:06:16 +02:00
Botond Dénes	f1391d5c27	mutation: consume(): restructure clustering iterator initialization Instead of having a branch per each value of `consume_in_reverse`, have just two ifs with two branches each for clustering rows and range tombstones respectively, to facilitate further patching.	2022-01-05 07:29:36 +02:00
Nadav Har'El	3fbbad7d60	build performance: speed up inclusion of <gm/inet_address.hh> The header file <gm/inet_address.hh> is included, directly or indirectly, from 291 source files in Scylla. It is hard to reduce this number because Scylla relies heavily on IP addresses as keys to different things. So it is important that this header file be fast to include. Unfortunately it wasn't... ClangBuildAnalyzer measurements showed that each inclusion of this header file added a whopping 2 seconds (in dev build mode) to the build. A total of 600 CPU seconds - 10 CPU minutes - were spent just on this header file. It was actually worse because the build also spent additional time on template instantiation (more on this below). So in this patch we: 1. Remove some unnecessary stuff from gms/inet_address.hh, and avoid including it in one place that doesn't need it. This is just cosmetic, and doesn't significantly speed up the build. 2. Move the to_sstring() implementation for the .hh to .cc. This saves a lot of time on template instantiations - previously every source file instantiated this to_sstring(), which was slow (that "format" thing is slow). 3. Do not include <seastar/net/ip.hh> which is a huge file including half the world. All we need from it is the type "ipv4_address", so instead include just the new <seastar/net/ipv4_address.hh>. This change brings most of the performance improvement. So source files forgot to include various Seastar header files because the includes-everything ip.hh did it - so we need to add these missing includes in this patch. After this patch, ClangBuildAnalyzer's reports that the cost of inclusion of <gms/inet_address.hh> is down from 2 seconds to 0.326 seconds. Additionally the format<inet_address> template instantiation 291 times - about half a second each - is also gone. All in all, this patch should reduce around 10 CPU minutes from the build. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-01-04 21:07:23 +02:00
Raphael S. Carvalho	f0b816d8e8	compaction_manager: coroutinize main loop in sstable rewrite procedure with this patch, rewrite_sstables() is now fully coroutinized. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-01-04 16:03:23 -03:00
Raphael S. Carvalho	c85ba1e694	compaction_manager: coroutinize exception handling in sstable rewrite procedure Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-01-04 15:39:54 -03:00
Raphael S. Carvalho	59a65742f9	compaction_manager: mark task::finish_compaction() as noexcept As it's intended to be used in a deferred action. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-01-04 15:30:04 -03:00
Raphael S. Carvalho	3fe4c2e517	compaction_manager: make maybe_stop_on_error() more flexible It's hard to integrate maybe_stop_on_error() with coroutines as it accepts a resolved future, not an exception pointer. Let's adjust its interface, making it more flexible to work with. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-01-04 15:28:30 -03:00
Raphael S. Carvalho	9a1fdb0635	sstables: stop including unused expensive headers database.hh is expensive to include, and turns out it's no longer needed. also stop including other unused ones. build time of sstables.o reduces by ~3% (cleared all caches and set cpu frequency to a fixed value before building sstables.o from scratch) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220104175908.98833-1-raphaelsc@scylladb.com>	2022-01-04 20:14:01 +02:00
Asias He	25b036f35b	repair: Improve memory usage tracking and oom protection Currently, the repair parallelism is calculated by the number of memory allocated to repair and memory usage per repair instance. However, the memory usage per repair instance does not take the max possible memory usage caused by repair followers. As a result, when repairing a table with more replication factors, e.g., 3 DCs, each has 3 replicas, the repair master node would use 9X repair buffer size in worse cases. This would cause OOM when the system is under pressure. This patch introduces a semaphore to cap the max memory usage. Each repair instance takes the max possible memory usage budget before it starts. This ensures repair would never use more than the memory allocated to repair. Fixes #9817 Closes #9818.	2022-01-04 20:11:36 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Avi Kivity	5eccb42846	Merge "Host tool executables in the scylla main executable" from Botond " A big problem with scylla tool executables is that they include the entire scylla codebase and thus they are just as big as the scylla executable itself, making them impractical to deploy on production machines. We could try to combat this by selectively including only the actually needed dependencies but even ignoring the huge churn of sorting out our depedency hell (which we should do at one point anyway), some tools may genuinely depend on most of the scylla codebase. A better solution is to host the tool executables in the scylla executable itself, switching between the actual main function to run some way. The tools themselves don't contain a lot of code so this won't cause any considerable bloat in the size of the scylla executable itself. This series does exactly this, folds all the tool executables into the scylla one, with main() switching between the actual main it will delegate to based on a argv[1] command line argument. If this is a known tool name, the respective tool's main will be invoked. If it is "server", missing or unrecognized, the scylla main is invoked. Originally this series used argv[0] as the mean to switch between the main to run. This approach was abandoned for the approach mentioned above for the following reasons: * No launcher script, hard link, soft link or similar games are needed to launch a specific tool. * No packaging needed, all tools are automatically deployed. * Explicit tool selection, no surprises after renaming scylla to something else. * Tools are discoverable via scylla's description. * Follows the trend set by modern command line multi-command or multi-app programs, like git. Fixes: #7801 Tests: unit(dev) " * 'tools-in-scylla-exec-v5' of https://github.com/denesb/scylla: main,tools,configure.py: fold tools into scylla exec tools: prepare for inclusion in scylla's main main: add skeleton switching code on argv[1] main: extract scylla specific code into scylla_main()	2022-01-04 17:55:07 +02:00
Calle Wilund	73c4a2f42b	messaging_service: Add reverse mapping of private ip -> public endpoint For quick reverse lookup. (cherry picked from commit `c86296f2a8`)	2022-01-04 15:14:58 +02:00
Botond Dénes	5e547dcc8a	test/boost/mutation_test: add rebuild test for mutation::consume() In the next patches we will refactor mutation::consume(). Before doing that add another test, which rebuilds the consumed mutation, comparing it with the original.	2022-01-04 11:43:46 +02:00
Nadav Har'El	e0ebde0f4f	Update seastar submodule The split of <seastar/net/ip.hh> will be useful for reducing the build time (ip.hh is huge and we don't need to include most of it) Refs #1 * seastar 8d15e8e6...655078df (13): > net: split <seastar/net/ip.hh> > Merge "Rate-limited IO capacity management" from Pavel E > util: closeable/stoppable: Introduce cancel() > loop: Improve concepts to match requirements > Merge "scoped_critical_alloc_section make conditional and volatile" from Benny > Added variadic version of when_any > websocket: define CryptoPP::byte for older cryptopp > tests: fix build (when libfmt >= 8) by adding fmt::runtime() > foreign_ptr: destroy_on: fixup indentation > foreign_ptr: expose async destroy method > when_all: when_all_state::wait_all move scoped_critical_alloc_section to captures > json: json_return_type: provide copy constructor and assignment operator > json: json_element: mark functions noexcept	2022-01-03 22:52:24 +02:00
Calle Wilund	3c02cab2f7	commitlog: Don't allow error_handler to swallow exception Fixes #9798 If an exception in allocate_segment_ex is (sub)type of std::system_error, commit_error_handler might _not_ cause throw (doh), in which case the error handling code would forget the current exception and return an unusable segment. Now only used as an exception pointer replacer. Closes #9870	2022-01-03 22:46:31 +02:00
Nadav Har'El	8774fc83d3	test/rest_api: fix "--ssl" option test/rest_api has a "--ssl" option to use encrypted CQL. It's not clear to me why this is useful (it doesn't actually test encryption of the REST API!), but as long as we have such an option, it should work. And it didn't work because of a typo - we set a "check_cql" variable to the right function, but then forgot to use it and used run.check_cql instead (which is just for unencrypted cql). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220102123202.1052930-1-nyh@scylladb.com>	2022-01-02 15:53:25 +02:00
Benny Halevy	fc729a804b	shard_reader: Continue after read_ahead error If read ahead failed, just issue a log warning and proceed to close the reader. Currently co_await will throw and the evictable reader won't be closed. This is seen occasionally in testing, e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/1010/artifact/logs-all.debug.2/1640918573898_lwt_banking_load_test.py%3A%3ATestLWTBankingLoad%3A%3Atest_bank_with_nemesis/node2.log ``` ERROR 2021-12-31 02:40:56,160 [shard 0] mutation_reader - shard_reader::close(): failed to stop reader on shard 1: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) ``` Fixes #9865. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220102124636.2791544-1-bhalevy@scylladb.com>	2022-01-02 15:52:09 +02:00
Pavel Emelyanov	36905ce19d	scylla-gdb: Do not try to unpack None-s When 'scylla fiber' calls _walk the latter can validly return back None pointer (see `74ffafc8a7` scylla-gdb.py: scylla fiber: add actual return to early return). This None is not handled by the caller but is unpacked as if it was a valid tuple. fixes: #9860 tests: scylla-gdb(release, failure not reproduced though) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211231094311.2495-1-xemul@scylladb.com>	2021-12-31 22:21:58 +02:00
Pavel Emelyanov	946e03351e	scylla-gdb: Handle rate-limited IO scheduler groups The capacity accounting was changed, scylla-gdb.py should know the new layout. On error -- fall back to current state. tests: scylla-gdb(release, current and patched seastar) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211231073427.32453-1-xemul@scylladb.com>	2021-12-31 22:20:45 +02:00
Pavel Solodovnikov	904de0a094	gms: introduce two gossip features for raft-based cluster management The patch adds the `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` and `USES_RAFT_CLUSTER_MANAGEMENT` gossiper features. These features provide a way to organize the automatic switch to raft-based cluster management. The scheme is as follows: 1. Every new node declares support for raft-based cluster ops. 2. At the moment, no nodes in the cluster can actually use raft for cluster management, until the `SUPPORTS` feature is enabled (i.e. understood by every node in the cluster). 3. After the first `SUPPORTS` feature is enabled, the nodes can declare support for the second, `USES*` feature, which means that the node can actually switch to use raft-based cluster ops. The scheme ensures that even if some nodes are down while transitioning to new bootstrap mechanism, they can easily switch to the new procedure, not risking to disrupt the cluster. The features are not actually wired to anything yet, providing a framework for the integration with `raft_group0` code, which is subject for a follow-up series. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20211220081318.274315-1-pa.solodovnikov@scylladb.com>	2021-12-30 11:05:45 +02:00
Tomasz Grabiec	7038dc7003	lsa: Fix segment leak on memory reclamation during alloc_buf alloc_buf() calls new_buf_active() when there is no active segment to allocate a new active segment. new_buf_active() allocates memory (e.g. a new segment) so may cause memory reclamation, which may cause segment compaction, which may call alloc_buf() and re-enter new_buf_active(). The first call to new_buf_active() would then override _buf_active and cause the segment allocated during segment compaction to be leaked. This then causes abort when objects from the leaked segment are freed because the segment is expected to be present in _closed_segments, but isn't. boost::intrusive::list::erase() will fail on assertion that the object being erased is linked. Introduced in `b5ca0eb2a2`. Fixes #9821 Fixes #9192 Fixes #9825 Fixes #9544 Fixes #9508 Refs #9573 Message-Id: <20211229201443.119812-1-tgrabiec@scylladb.com>	2021-12-30 11:02:08 +02:00
Piotr Jastrzebski	85f5277a05	max_result_size: Expand the comment Add describtion about how SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT cluster feature is used and note that only coordinators check it. Decision made by a coordinator is immutable for the whole request and can be checked by looking at page_size field. If it's set to 0 or unset then we're handling the struct in the old way. Otherwise, new way is used. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #9855	2021-12-29 17:34:15 +02:00
Avi Kivity	e6f7ade60c	Update tools/java submodule (python2 dependency) * tools/java 8fae618f7f...6249bfbe2f (1): > dist/debian: replace "python (>=2.7)" with "python2" Ref #9498.	2021-12-29 17:31:53 +02:00
Avi Kivity	9e74556413	Merge 'Support reverse reads in the row cache natively' from Tomasz Grabiec This change makes row cache support reverse reads natively so that reversing wrappers are not needed when reading from cache and thus the read can be executed efficiently, with similar cost as the forward-order read. The database is serving reverse reads from cache by default after this. Before, it was bypassing cache by default after `703aed3277`. Refs: #1413 Tests: - unit [dev] - manual query with build/dev/scylla and cache tracing on Closes #9454 * github.com:scylladb/scylla: tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries row_cache: partition_snapshot_row_cursor: Print more details about the current version vector row_cache: Improve trace-level logging config: Use cache for reversed reads by default config: Adjust reversed_reads_auto_bypass_cache description row_cache: Support reverse reads natively mvcc: partition_snapshot: Support slicing range tombstones in reverse test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition row_cache: Log produced range tombstones test: Make produces_range_tombstone() report ck_ranges tests: lib: random_mutation_generator: Extract make_random_range_tombstone() partition_snapshot_row_cursor: Support reverse iteration utils: immutable-collection: Make movable intrusive_btree: Make default-initialized iterator cast to false	2021-12-29 16:53:25 +02:00
Avi Kivity	4a323772c1	Merge 'Use the same page size limit in reverse queries as in forward reads' from Piotr Jastrzębski The default for get_unlimited_query_max_result_size() is 100MB (adjustable through config), whereas query::result_memory_limiter::maximum_result_size is 1MB (hard coded, should be enough for everybody) This limit is then used by the replica to decide when to break pages and, in case of reversed clustering order reads, when to fail the read when accumulated data crosses the threshold. The latter behavior stems from the fact that reversed reads had to accumulate all the data (read in forward order) before they can reverse it and return the result. Reverse reads thus need a higher limit so that they have a higher chance of succeeding. Most readers are now supporting reading in reverse natively, and only reversing wrappers (make_reversing_reader()) inserted on top of ka/la sstable readers need to accumulate all the data. In other cases, we could break pages sooner. This should lead to better stability (less memory usage) and performance (lower page build latency, higher read concurrency due to less memory footprint). Tests: unit(dev) Closes #9815 * github.com:scylladb/scylla: storage_proxy: Send page_size in the read_command gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature result_memory_accounter: use new max_result_size::get_page_size in check_local_limit max_result_size: Add page_size field	2021-12-29 15:04:01 +02:00
Nadav Har'El	4374c73d82	Merge 'Fix bad lowres_clock::duration assumptions' from Avi Kivity Some code assumes that lowres_clock::duration is milliseconds, but public documentation never claimed that. Harden the code for a change in the definition by removing the assumptions. Closes #9850 * github.com:scylladb/scylla: loading_cache: fix mixup of std::chrono::milliseconds and lowres_clock::duration service: storage_proxy: fix lowres_clock::duration assumption service: misc_services: fix lowres_clock::duration assumption gossip: fix lowres_clock::duration assumption	2021-12-28 23:32:26 +02:00
Avi Kivity	d40722d598	loading_cache: fix mixup of std::chrono::milliseconds and lowres_clock::duration lowres_clock uses the two types interchangably, although they are not defined to be the same. Fix by using only lowres_clock::duration.	2021-12-28 21:19:08 +02:00
Avi Kivity	966bb3c8f0	service: storage_proxy: fix lowres_clock::duration assumption calculate_delay() implicitly converts a lowres_clock::duration to std::chrono::microseconds. This fails if lowres_clock::duration has higher resolution than microseconds. Fix by using an explicit conversion, which always works.	2021-12-28 21:17:14 +02:00
Avi Kivity	e2a3f974d6	service: misc_services: fix lowres_clock::duration assumption recalculate_hitrates() is defined to return future<lowres_clock::duration> but actually returns future<std::chrono::milliseconds>. This fails if the two types are not the same. Fix by returning the declared type.	2021-12-28 21:15:40 +02:00
Avi Kivity	49a603af39	gossip: fix lowres_clock::duration assumption The variable diff is assigned a type of std::chrono::milliseconds but later used to store the difference between two lowres_clock::time_point samples. This works now because the two types are the same, but fails if lowres_clock::duration changes. Remove the assumption by using lowres_clock::duration.	2021-12-28 21:13:59 +02:00
Piotr Jastrzebski	7fa3fa6e65	storage_proxy: Send page_size in the read_command When the whole cluster is already supporting separate_page_size_and_safety_limit, start sending page_size in read_command. This new value will be used for determining the page size instead of hard_limit. Fixes #9487 Fixes #7586 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-12-28 16:38:02 +01:00
Piotr Jastrzebski	02d5997377	gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature This new feature will be used to determined whether the whole cluster is ready to use additional page_size field in max_result_size. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-12-28 16:38:02 +01:00
Piotr Jastrzebski	1ca39458f2	result_memory_accounter: use new max_result_size::get_page_size in check_local_limit This means when page_size is sent together with read_command it will be used for paged queries instead of the hard_limit. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-12-28 16:38:01 +01:00
Piotr Jastrzebski	ae2c199bcd	max_result_size: Add page_size field With this new field comes a new member function called get_page_size. This new function will be used by the result_memory_accounter to decide when to cut a page. The behaviour of get_page_size depends on whether page_size field is set. This is distinguished by page size being equal to 0 or not. When page_size is equal to 0 then it's not set and hard_limit will be returned from get_page_size. Otherwise, get_page_size will return page_size field. When read_command is received from an old node, page_size will be equal to 0 and hard_limit will be used to determine the page size. This is consistent with the behaviour on the old nodes. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-12-28 16:37:49 +01:00
Valerii Ponomarov	12fa68fe67	scylla_util: return boolean calling systemd_unit.available As of now, 'systemd_unit.available' works ok only when provided unit is present. It raises Exception instead of returning boolean when provided systemd unit is absent. So, make it return boolean in both cases. Fixes https://github.com/scylladb/scylla/issues/9848 Closes #9849	2021-12-28 15:14:04 +02:00
Tomasz Grabiec	2a3450dfb7	Merge "db: save supported features after passing gossip feature check" from Pavel Solodovnikov Move saving features to `system.local#supported_features` to the point after passing all remote feature checks in the gossiper, right before joining the ring. This makes `system.local#supported_features` column to store advertised feature set. Leave a comment in the definition of `system.local` schema to reflect that. Since the column value is not actually used anywhere for now, it shouldn't affect any tests or alter the existing behavior. Later, we can optimize the gossip communication between nodes in the cluster, removing the feature check altogether in some cases (since the column value should now be monotonic). * manmanson/save_adv_features_v2: db: save supported features after passing gossip feature check db: add `save_local_supported_features` function	2021-12-28 11:26:11 +02:00
Nadav Har'El	b8786b96f4	commitlog: fix missing wait for semaphore units Commit `dcc73c5d4e` introduced a semaphore for excluding concurrent recalculations - _reserve_recalculation_guard. Unfortunately, the two places in the code which tried to take this guard just called get_units() - which returns a future<units>, not units - and never waited for this future to become available. So this patch adds the missing "co_await" needed to wait for the units to become available. Fixes #9770. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211214122612.1462436-1-nyh@scylladb.com>	2021-12-27 16:56:30 +02:00
Eliran Sinvani	6d9d00ec9c	conofigure.py: Set seastar scheduling groups count explicitly In order to have stability and also regression control, we set the scheduling groups parameter explicitly. Closes #9847	2021-12-27 15:48:45 +02:00
Takuya ASADA	6a834261fb	scylla_coredump_setup: prevent coredump timeout on systemd-coredump@.service On newer version of systemd-coredump, coredump handled in systemd-coredump@.service, and may causes timeout while running the systemd unit, like this: systemd[1]: systemd-coredump@xxxx.service: Service reached runtime time limit. Stopping. To prevent that, we need to override TimeoutStartSec=infinity. Fixes #9837 Closes #9841	2021-12-27 13:58:07 +02:00
Takuya ASADA	0d8f932f0b	scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8 On CentOS8, mdmonitor.service does not works correctly when using mdadm-4.1-15.el8.x86_64 and later versions. Until we find a solution, let's pinning the package version to older one which does not cause the issue (4.1-14.el8.x86_64). Fixes #9540 Closes #9782	2021-12-27 12:07:34 +02:00
Takuya ASADA	7064ae3d90	dist: fix scylla-housekeeping uuid file chmod call Should use chmod() on a file, not fchmod() Fixes #9683 Closes #9802	2021-12-27 11:47:06 +02:00
Raphael S. Carvalho	ad82ede5f3	compaction: simplify rewrite_sstables() with coroutine rewrite_sstables() is terribly nested, making it hard to read. as usual, can be nicely simplified with coroutines. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211223135012.56277-1-raphaelsc@scylladb.com>	2021-12-26 14:10:52 +02:00
Piotr Sarna	a36c8990ab	docs: move service_levels.md to design-notes Along the way, our flat structure for docs was changed to categorize the documents, but service_levels.md was forward-ported later and missed the created directory structure, so it was created as a sole document in the top directory. Move it to where the other similar docs live. Message-Id: <68079d9dd511574ee32fce15fec541ca75fca1e2.1640248754.git.sarna@scylladb.com>	2021-12-26 14:10:52 +02:00
Piotr Sarna	483a98aa14	docs: add AssemblyScript example to wasm.md The paragraph about WebAssembly missed a very useful language, AssemblyScript. An example for it is provided in this patch. Message-Id: <8d6ea1038f2944917316de29c7ca5cce88b2a148.1640248754.git.sarna@scylladb.com>	2021-12-26 14:10:52 +02:00
Avi Kivity	9643f84d81	Merge "Eliminate direct storage_proxy usage from cql3 statements" from Pavel E " The token metadata and features should be kept on the query_processor itself, so finally the "storage" API would look like this: 6 .query() 5 .get_max_result_size() 2 .mutate_with_triggers() 2 .cas() 1 .truncate_blocking() The get_max_result_size() is probably also worth moving away from storage, it seem to have nothing to do with it. tests: unit(dev) " * 'br-query-processor-in-cql-statements' of https://github.com/xemul/scylla: cql3: Generalize bounce-to-shard result creation cql3: Get data dictionary directly from query_processor create_keyspace_statement: Do not use proxy.shared_from_this() cas_request: Make read_command() accept query_processor select_statement: Replace all proxy-s with query_processor create_\|alter_table_statement: Make check_restricted_table_properties() accept query_processor create_\|alter_keyspace_statement: Make check_restricted_replication_strategy() accept query_processor role_management_statement: Make validate_cluster_support() accept query_processor drop_index_statement: Make lookup_indexed_table() accept query_processor batch_\|modification_statement: Make get_mutations accept query_processor modification_statement: Replace most of proxy-s with query_processor batch_statement: Replace most of proxy-s with query_processor cql3: Make create_arg_types()/prepare_type() accept query_processor cql3: Make .validate_while_executing() accept query_processor cql3: Make execution stages carry query_processor over cql3: Make .validate() and .check_access() accept query_processor	2021-12-26 14:10:52 +02:00
Nadav Har'El	e4b2dfb54d	alternator ttl: when node is down, secondary node continues to expire The current implementation of the Alternator expiration (TTL) feature has each node scan for expired partitions in its own primary ranges. This means that while a node is down, items in its primary ranges will not get expired. But we note that doesn't have to be this way: If only a single node is down, and RF=3, the items that node owns are still readable with QUORUM - so these items can still be safely read and checked for expiration - and also deleted. This patch implements a fairly simple solution: When a node completes scanning its own primary ranges, also checks whether any of its secondary ranges (ranges where it is the second replica) has its primary owner down. For such ranges, this node will scan them as well. This secondary scan stops if the remote node comes back up, but in that case it may happen that both nodes will work on the same range at the same time. The risks in that are minimal, though, and amount to wasted work and duplicate deletion records in CDC. In the future we could avoid this by using LWT to claim ownership on a range being scanned. We have a new dtest (see a separate patch), alternator_ttl_tests.py:: TestAlternatorTTL::test_expiration_with_down_node, which reproduces this and verifies this fix. The test starts a 5-node cluster, with 1000 items with random tokens which are due to be expired immediately. The test expects to see all items expiring ASAP, but when one of the five nodes is brought down, this doesn't happen: Some of the items are not expired, until this patch is used. Fixes #9787 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211222131933.406148-1-nyh@scylladb.com>	2021-12-26 14:10:52 +02:00
Pavel Solodovnikov	83862d9871	db: save supported features after passing gossip feature check Move saving features to `system.local#supported_features` to the point after passing all remote feature checks in the gossiper, right before joining the ring. This makes `system.local#supported_features` column to store advertised feature set. Leave a comment in the definition of `system.local` schema to reflect that. Since the column value is not actually used anywhere for now, it shouldn't affect any tests or alter the existing behavior. Later, we can optimize the gossip communication between nodes in the cluster, removing the feature check altogether in some cases (since the column value should now be monotonic). Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-23 12:48:37 +03:00
Pavel Emelyanov	d98dd0ff80	cql3: Generalize bounce-to-shard result creation The main intention is actually to free the qp.proxy() from the need to provide the get_stats() method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 11:28:44 +03:00
Pavel Emelyanov	d32de22ee8	cql3: Get data dictionary directly from query_processor After previous patches there's a whole bunch of places that do qp.proxy().data_dictionary() while the data_dictionary is present on the query processor itself and there's a public method to get one. So use it everywhere. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 11:28:44 +03:00
Pavel Emelyanov	ec101e8b56	create_keyspace_statement: Do not use proxy.shared_from_this() The prepare_schema_mutations is not sleeping method, so there's no point in getting call-local shared pointer on proxy. Plain reference is more than enough. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 11:28:44 +03:00
Pavel Emelyanov	b29d3f1758	cas_request: Make read_command() accept query_processor Just relpace the argument and patch the callers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	da4c29105d	select_statement: Replace all proxy-s with query_processor This is the largest user of proxy argument. Fix them all and their callers (all sit in the same .cc file). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	70ad1d9933	create_\|alter_table_statement: Make check_restricted_table_properties() accept query_processor Patch check_restricted_table_properties() and its callers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	2ca8a580d9	create_\|alter_keyspace_statement: Make check_restricted_replication_strategy() accept query_processor Patch the check_restricted_replication_strategy() and its callers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	0ea9e2636f	role_management_statement: Make validate_cluster_support() accept query_processor Patch internal role_management_statement's methods to use query_processor Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	4c2343e8dd	drop_index_statement: Make lookup_indexed_table() accept query_processor Patch internal drop_index_statement's methods to use query_processor Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	7a15f1c402	batch_\|modification_statement: Make get_mutations accept query_processor This completes the batch_ and modification_statement rework. Also touch the private batch_statement::read_command while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	b1b230548b	modification_statement: Replace most of proxy-s with query_processor There are some internal methods that use proxy argument. Replace most of them with query_processor, next patch will fix the rest -- those that interact with batch statement. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	3bad767f67	batch_statement: Replace most of proxy-s with query_processor There are some proxy arguments left in the batch_statement internals. Fix most of them to be query_processors. Few remainders will come later as they rely on other statements to be fixed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	83c79b8133	cql3: Make create_arg_types()/prepare_type() accept query_processor Change the methods' argument, then fix compiler errors. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:28 +03:00
Pavel Emelyanov	3d373597eb	cql3: Make .validate_while_executing() accept query_processor The schema_altering_statement declares this pure virtual method. This patch changes its first argument from proxy into query processor and fixes what compiler errors about. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:54:27 +03:00
Pavel Emelyanov	bce2ed9c6c	cql3: Make execution stages carry query_processor over The batch_ , modification_ and select_ statements get proxy from query processor just to push it through execution stage. Simplify that by pushing the query processor itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:53:44 +03:00
Pavel Emelyanov	b990ca5550	cql3: Make .validate() and .check_access() accept query_processor This is mostly a sed script that replaces methods' first argument plus fixes of compiler-generated errors. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-23 10:53:44 +03:00
Benny Halevy	f7b8b809d0	sstables: parse chunked_vector<std::integral Members>: maximize chunk size Currently this parse function reads only 100KB worth of members in eac hiteration. Since the default max_chunk_capacity is 128KB, 100KB underutilize the chunk capacity, and it could be safely increased to the max to reduce the number of allocations and corresponding calls to read_exactly for large arrays. Expose utils::chunked_vector::max_chunk_capacity so that the caler wouldn't have to guess this number and use it in parse(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211222103126.1819289-2-bhalevy@scylladb.com>	2021-12-22 15:47:37 +02:00
Benny Halevy	d95f6602a7	sstables: coroutinize parse functions Simplify the implementation using coroutines. This also has the potential to coalesce multiple allocations into one. test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211222103126.1819289-1-bhalevy@scylladb.com>	2021-12-22 15:47:37 +02:00
Benny Halevy	2f2e3b2e84	test: lib: index_reader_assertions: close reader before it is destroyed Otherwise, it may trip an assertion when the nuderlying file is closed, as seen in e.g.: https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4318/artifact/testlog/x86_64_release/sstable_3_x_test.test_read_rows_only_index.4174.log ``` test/boost/sstable_3_x_test.cc(0): Entering test case "test_read_rows_only_index" sstable_3_x_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed. Aborting on shard 0. Backtrace: 0x22557e8 0x2286842 0x7f2799e99a1f /lib64/libc.so.6+0x3d2a1 /lib64/libc.so.6+0x268a3 /lib64/libc.so.6+0x26788 /lib64/libc.so.6+0x35a15 0x222c53d 0x222c548 0xb929cc 0xc0b23b 0xa84bbf 0x24d0111 ``` Decoded: ``` __GI___assert_fail at :? ~file_data_source_impl at ./build/release/seastar/./seastar/src/core/fstream.cc:205 ~file_data_source_impl at ./build/release/seastar/./seastar/src/core/fstream.cc:202 std::default_delete<seastar::data_source_impl>::operator()(seastar::data_source_impl) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361 (inlined by) ~data_source at ././seastar/include/seastar/core/iostream.hh:55 (inlined by) ~input_stream at ././seastar/include/seastar/core/iostream.hh:254 (inlined by) ~continuous_data_consumer at ././sstables/consumer.hh:484 (inlined by) ~index_consume_entry_context at ././sstables/index_reader.hh:116 (inlined by) std::default_delete<sstables::index_consume_entry_context<sstables::index_consumer> >::operator()(sstables::index_consume_entry_context<sstables::index_consumer>) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361 (inlined by) ~index_bound at ././sstables/index_reader.hh:395 (inlined by) ~index_reader at ././sstables/index_reader.hh:435 std::default_delete<sstables::index_reader>::operator()(sstables::index_reader*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85 (inlined by) ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361 (inlined by) ~index_reader_assertions at ././test/lib/index_reader_assertions.hh:31 (inlined by) operator() at ./test/boost/sstable_3_x_test.cc:4630 ``` Test: unit(dev), sstable_3_x_test.test_read_rows_only_index(release X 10000) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211222132858.2155227-1-bhalevy@scylladb.com>	2021-12-22 15:33:22 +02:00
Raphael S. Carvalho	e80cb51b6a	distributed_loader: make shutdown clean by properly handling compaction_stopped exception Today, when resharding is interrupted, shutdown will not be clean because stopped exception interrupts the shutdown process. Let's handle stopped exception properly, to allow shutdown process to run to completion. Refs #9759 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211221175717.62293-1-raphaelsc@scylladb.com>	2021-12-22 15:08:31 +02:00
Botond Dénes	def6d48307	Merge 'gdb: Introduce "scylla lsa-check"' from Tomasz Grabiec Catches inconsistencies in LSA state. Currently: - discrepancy between segment set in _closed_segments and shard's segment descriptors - cross-shard segment references in _closed_segments - discrepancy in _closed_occupancy stats and what's in segment descriptors - segments not present in _closed_segments but present in segment descriptors Refs https://github.com/scylladb/scylla/issues/9544 Closes #9834 * github.com:scylladb/scylla: gdb: Introduce "scylla lsa-check" gdb: Make get_base_class_offset() also see indirect base classes	2021-12-22 15:08:31 +02:00
Pavel Emelyanov	7286374dba	migration_manager: Remove last occurrence of get_local_storage_proxy() The migration manager got local storage proxy reference recently, but one method still uses the global call. Fix it. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211221120034.21824-1-xemul@scylladb.com>	2021-12-22 15:08:31 +02:00
Botond Dénes	aba68c8f83	Merge "reader_concurrency_semaphore: convert to flat_mutation_reader_v2" from Michael " The second patch in this series is a mechanical conversion of reader_concurrency_semaphore to flat_mutation_reader_v2, and caller updates. The first patch is needed to pass the test suite, since without it a real reader version conversion would happen on every entry to and exit from reader_concurrency_semaphore, which is stressful (for example: mutation_reader_test.test_multishard_streaming_reader reaches 8191 conversions for a couple of readers, which somehow causes it to catch SIGSEGV in diverse and seemingly-random places). Note that in a real workload it is unreasonable to expect readers being parked in a reader_concurrency_semaphore to be pristine, so short-circuiting their version conversions will be impossible and this workaround will not really help. " * tag 'rcs-v2-v4' of https://github.com/cmm/scylla: reader_concurrency_semaphore: convert to flat_mutation_reader_v2 short-circuit flat mutation reader upgrades and downgrades	2021-12-22 15:08:31 +02:00
Tomasz Grabiec	3e81318587	gdb: Introduce "scylla lsa-check" Catches inconsistencies in LSA state. Currently: - discrepancy between segment set in _closed_segments and shard's segment descritpors - cross-shard segment references in _closed_segments - discrepancy in _closed_occupancy stats and what's in segment descriptors - segments not present in _closed_segments but present in segment descriptors	2021-12-21 21:18:52 +01:00
Tomasz Grabiec	d754504fa2	gdb: Make get_base_class_offset() also see indirect base classes I need it so that segment_descriptor is seen as inheriting from list_base_hook<>, which it does via log_heap_hook.	2021-12-21 21:18:52 +01:00
Michael Livshin	a1b8ba23d2	reader_concurrency_semaphore: convert to flat_mutation_reader_v2 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-12-21 11:26:17 +02:00
Michael Livshin	9f656b96ac	short-circuit flat mutation reader upgrades and downgrades When asked to upgrade a reader that itself is a downgrade, try to return the original v2 reader instead, and likewise when downgrading upgraded v1 readers. This is desirable because version transformations can result from, say, entering/leaving a reader concurrency semaphore, and the amount of such transformations is practically unbounded. Such short-circuiting is only done if it is safe, that is: the transforming reader's buffer is empty and its internal range tombstone tracking state is discardable. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-12-21 11:26:17 +02:00
Raphael S. Carvalho	64ec1c6ec6	table: Make sure major compaction doesn't miss data in memtable Make sure that major will compact data in all sstables and memtable, as tombstones sitting in memtable could shadow data in sstables. For example, a tombstone in memtable deleting a large partition could be missed in major, so space wouldn't be saved as expected. Additionally, write amplification is reduced as data in memtable won't have to travel through tiers once flushed. Fixes #9514. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211217160055.96693-2-raphaelsc@scylladb.com>	2021-12-21 07:21:34 +02:00
Raphael S. Carvalho	e1e8e020fe	tests: Allow memtable to be flushed through column_family_for_tests Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211217160055.96693-1-raphaelsc@scylladb.com>	2021-12-21 07:21:26 +02:00
Botond Dénes	bb0874b28b	main,tools,configure.py: fold tools into scylla exec The infrastructure is now in place. Remove the proxy main of the tools, and add appropriate `else if` statements to the executable switch in main.cc. Also remove the tool applications from the `apps` list and add their respective sources as dependencies to the main scylla executable. With this, we now have all tool executables living inside the scylla main one.	2021-12-20 18:27:25 +02:00
Botond Dénes	0761113d8b	tools: prepare for inclusion in scylla's main Rename actual main to `${tool_name}_main` and have a proxy main call it. In the next patch we will get rid of these proxy mains and the tool mains will be invoked from scylla's main, if the `argv[0]` matches their name. The main functions are included in a new `tools/entry_point.hh` header.	2021-12-20 18:27:19 +02:00
Botond Dénes	972d789a27	main: add skeleton switching code on argv[1] To prepare for the scylla executable hosting more than one apps, switching between them using argv[1]. This is consistent with how most modern multi-app/multi-command programs work, one prominent example being git. For now only one app is present: scylla itself, called "server". If argv[1] is missing or unrecognized, this is what is used as the default for backward-compatibility. The scylla app also gets a description, which explains that scylla hosts multiple apps and lists all the available ones.	2021-12-20 18:26:38 +02:00
Raphael S. Carvalho	e05859c3f9	compaction: kill unused code for resharding_compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211217162728.114936-2-raphaelsc@scylladb.com>	2021-12-20 18:21:31 +02:00
Raphael S. Carvalho	d1f2fd7f03	compaction: rename compacting_sstable_writer to compacted_fragments_writer the name compacting_sstable_writer is misleading as it doesn't perform any compaction. let's rename it to a name that reflects more what it does. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211217162728.114936-1-raphaelsc@scylladb.com>	2021-12-20 18:21:31 +02:00
Avi Kivity	f190434beb	Merge "table,sstable_set: use v2 readers below the cache" from Bodtrond " Convert sstable_set and table::make_sstable_reader() to v2. With this all readers below cache use the v2 format. Tests: unit(dev) " * 'table-make-sstable-reader-v2/v1' of https://github.com/denesb/scylla: table: upgrade make_sstable_reader() to v2 sstables/sstable_set: create_single_key_sstable_reader() upgrade to v2 sstables/sstable_set: remove unused and undefined make_reader() member	2021-12-20 17:53:44 +02:00
Botond Dénes	18cddd3279	table: upgrade make_sstable_reader() to v2 With this all readers below cache use the v2 format (except kl/la readers).	2021-12-20 17:40:46 +02:00
Botond Dénes	1a4ca831a4	main: extract scylla specific code into scylla_main() main() now contains only generic setup and teardown code and it delegates to scylla_main(). In the next patches we want to wire in tool executables into the scylla one. This will require selecting the main to run at runtime. scylla_main() will be just one of those (the default).	2021-12-20 17:31:46 +02:00
Botond Dénes	9027c6f936	sstables/sstable_set: create_single_key_sstable_reader() upgrade to v2 With this all methods of the sstable set create v2 readers.	2021-12-20 17:17:33 +02:00
Botond Dénes	847eddf19a	sstables/sstable_set: remove unused and undefined make_reader() member	2021-12-20 17:17:31 +02:00
Botond Dénes	55bb70a878	Merge "Make sure TWCS per-window major includes all files" from Raphael " TWCS perform STCS on a window as long as it's the most recent one. From there on, TWCS will compact all files in the past window into a single file. With some moderate write load, it could happen that there's still some compaction activity in that past window, meaning that per-window major may miss some files being currently compacted. As a result, a past window may contain more than 1 file after all compaction activity is done on its behalf, which may increase read amplification. To avoid that, TWCS will now make sure that per-window major is serialized, to make sure no files are missed. Fixes #9553. tests: unit(dev). " * 'fix_twcs_per_window_major_v3' of https://github.com/raphaelsc/scylla: TWCS: Make sure major on past window is done on all its sstables TWCS: remove needless param for STCS options TWCS: kill unused param in newest_bucket() compaction: Implement strategy control and wire it compaction: Add interface to control strategy behavior.	2021-12-20 17:12:50 +02:00
Avi Kivity	e772fcbd57	Merge "Convert combined reader to v2" from Botond " Users are adjusted by sprinkling `upgrade_to_v2()` and `downgrade_to_v1()` where necessary (or removing any of these where possible). No attempt was made to optimize and reduce the amount of v1<->v2 conversions. This is left for follow-up patches to keep this set small. The combined reader is composed of 3 layers: 1. fragment producer - pop fragments from readers, return them in batches (each fragment in a batch having the same type and pos). 2. fragment merger - merge fragment batches into single fragments 3. reader implementation glue-code Converting layers (1) and (3) was mostly mechanical. The logic of merging range tombstone changes is implemented at layer (2), so the two different producer (layer 1) implementations we have share this logic. Tests: unit(dev) " * 'combined-reader-v2/v4' of https://github.com/denesb/scylla: test/boost/mutation_reader_test: add test_combined_reader_range_tombstone_change_merging mutation_reader: convert make_clustering_combined_reader() to v2 mutation_reader: convert position_reader_queue to v2 mutation_reader: convert make_combined_reader() overloads to v2 mutation_reader: combined_reader: convert reader_selector to v2 mutation_reader: convert combined reader to v2 mutation_reader: combined_reader: attach stream_id to mutation_fragments flat_mutation_reader_v2: add v2 version of empty reader test/boost/mutation_reader_test: clustering_combined_reader_mutation_source_test: fix end bound calculation	2021-12-20 14:01:03 +02:00
Pavel Solodovnikov	96799a72d9	db: add `save_local_supported_features` function This is a utility function for writing the supported feature set to the `system.local` table. Will be used to move the corresponding part from `system_keyspace::setup_version` to the gossiper after passing remote feature check, effectively making `system.local#supported_features` store the advertised features (which already passed the feature check). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-20 13:31:52 +03:00
Botond Dénes	7f331cee01	test/boost/mutation_reader_test: add test_combined_reader_range_tombstone_change_merging Stressing the range tombstone change merging logic.	2021-12-20 09:29:05 +02:00
Botond Dénes	e1bbc4a480	mutation_reader: convert make_clustering_combined_reader() to v2 Just sprinkle the right amount downgrade_to_v1() and upgrade_to_v2() to call sites, no attempts at optimization was done.	2021-12-20 09:29:05 +02:00
Botond Dénes	2364144b19	mutation_reader: convert position_reader_queue to v2 By removing the converting (v1->v2) constructor of `reader_and_upper_bound` and adjusting its users.	2021-12-20 09:29:05 +02:00
Botond Dénes	aeddcf50a1	mutation_reader: convert make_combined_reader() overloads to v2 Just sprinkle the right amount downgrade_to_v1() and upgrade_to_v2() to call sites, no attempts at optimization was done.	2021-12-20 09:29:05 +02:00
Botond Dénes	1554b94b78	mutation_reader: combined_reader: convert reader_selector to v2	2021-12-20 09:29:05 +02:00
Botond Dénes	71835bdee1	mutation_reader: convert combined reader to v2 The meat of the change is on the fragment merger level, which is now also responsible for merging range tombstone changes. The fragment producers are just mechanically converted to v2 by appending `_v2` to the appropriate type names. The beauty of this approach is that range tombstone merging happens in a single place, shared by all fragment producers (there is 2 of them). Selectors and factory functions are left as v1 for now, they will be converted incrementally by the next patches.	2021-12-20 09:29:05 +02:00
Calle Wilund	4df008adcc	production_snitch_base: Do reverse lookup of endpoint for info Refs #9709 Refs #9653 If we don't find immediate info about an endpoint, check if we're being asked about a "private" ip for the endpoint. If so, give info for this.	2021-12-20 06:20:46 +02:00
Calle Wilund	4778770814	messaging_service: Make dc/rack encryption check for connection more strict Fixes #9653 When doing an outgoing connection, in a internode_encryption=dc/rack situation we should not use endpoint/local broadcast solely to determine if we can downgrade a connection. If gossip/message_service determines that we will connect to a different address than the "official" endpoint address, we should use this to determine association of target node, and similarly, if we bind outgoing connection to interface != bc we need to use this to decide local one. Note: This will effectively _disable_ internode_encryption=dc/rack on ec2 etc until such time that gossip can give accurate info on dc/rack for "internal" ip addresses of nodes.	2021-12-20 06:20:46 +02:00
Asias He	eba4a4fba4	repair: Allow ignoring dead nodes for replace operation Consider 1) n1, n2, n3, n4, n5 2) n2 and n3 are both down 3) start n6 to replace n2 4) start n7 to replace n3 We want to replace the dead nodes n2 and n3 to fix the cluster to have 5 running nodes. Replace operation in step 3 will fail because n3 is down. We would see errors like below: replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed for replace operation are down. It is highly recommended to fix the down nodes and try again. In the above example, currently, there is no way to replace any of the dead nodes. Users can either fix one of the dead nodes and run replace or run removenode operation to remove one of the dead nodes then run replace and run bootstrap to add another node. Fixing dead nodes is always the best solution but it might not be possible. Running removenode operation is not better than running replace operation (with best effort by ignoring the other dead node) in terms of data consistency. In addition, users have to run bootstrap operation to add back the removed node. So, allowing replacing in such case is a clear win. This patch adds the --ignore-dead-nodes-for-replace option to allow run replace operation with best effort mode. Please note, use this option only if the dead nodes are completely broken and down, and there is no way to fix the node and bring it back. This also means the user has to make sure the ignored dead nodes specified are really down to avoid any data consistency issue. Fixes #9757 Closes #9758	2021-12-20 00:49:03 +02:00
Avi Kivity	7bdc999bba	service: paxos_state: wean off get_local_storage_proxy() Instead of calling get_local_storage_proxy in paxos_state, get it from the caller (who is, in fact, storage_proxy or one of its components). Some of the callers, although they are storage_proxy components, don't have a storage_proxy reference handy and so they ignomiously call get_local_storage_proxy() themselves. This will be adjusted later. The other callers who are, in fact, storage_proxy, have to take special care not to cross a shard boundary. When they do, smp::submit_to() is converted to sharded::invoke_on() in order to get the correct local instance. Test: unit (dev) Closes #9824	2021-12-20 00:31:13 +02:00
Nadav Har'El	252ce8afd4	Merge 'Extend stop compaction api' from Benny Halevy Allow stopping compaction by type on a given keyspace and list of tables. Also add api unit test suite that tests the existing `stop_compaction` api and the new `stop_keyspace_compaction` api. Fixes #9700 Closes #9746 * github.com:scylladb/scylla: api: storage_service: validate_keyspace: improve exception error message api: compaction_manager: add stop_keyspace_compaction api: storage_service: expose validate_keyspace and parse_tables api: compaction_manager: stop_compaction: fix type description compaction_manager: stop_compaction: expose optional table* test: api: add basic compaction_manager test	2021-12-20 00:18:46 +02:00
Tomasz Grabiec	1c80d7fec4	tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries	2021-12-19 22:43:52 +01:00
Tomasz Grabiec	d678890757	row_cache: partition_snapshot_row_cursor: Print more details about the current version vector Now the format is the same as for the "heap" version vector. Contains positions and continuity flags. Helps in debugging. Before: {cursor: position={position: clustered,ckp{...},-1}, cont=0, rev=1, current=[0], heap=[ ], latest_iterator=[{position: clustered,ckp{...},-1}]} After: {cursor: position={position: clustered,ckp{...},-1}, cont=0, rev=1, current=[{v=0, pos={position: clustered,ckp{...},-1}, cont=false}], heap=[ ], latest_iterator=[{position: clustered,ckp{...},-1}]}	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	5196d450bd	row_cache: Improve trace-level logging Print MVCC snapshot to help distinguish reads which use different snapshots. Also, print the whole cursor, not just its position. This helps in determining which MVCC version the iterator comes from.	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	65a1a0247a	config: Use cache for reversed reads by default	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	9fd1120ad5	config: Adjust reversed_reads_auto_bypass_cache description Bypassing cache is no longer necessary to use native reverse readers.	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	63351483f0	row_cache: Support reverse reads natively Some implementation notes below. When iterating in reverse, _last_row is after the current entry (_next_row) in table schema order, not before like in the forward mode. Since there is no dummy row before all entries, reverse iteration must be now prepared for the fact that advancing _next_row may land not pointing at any row. The partition_snapshot_row_cursor maintains continuity() correctly in this case, and positions the cursor before all rows, so most of the code works unchanged. The only excpetion is in move_to_next_entry(), which now cannot assume that failure to advance to an entry means it can end a read. maybe_drop_last_entry() is not implemented in reverse mode, which may expose reverse-only workload to the problem of accumulating dummy entries. ensure_population_lower_bound() was not updating _last_row after inserting the entry in latets version. This was not a problem for forward reads because they do not modify the row in the partition snapshot represented by _last_row. They only need the row to be there in the latest version after the call. It's different for reveresed reads, which change the continuity of the entry represented by _last_row, hence _last_row needs to have the iterator updated to point to the entry from the latest version, otherwise we'd set the continuity of the previous version entry which would corrupt the continuity.	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	d0c367f44f	mvcc: partition_snapshot: Support slicing range tombstones in reverse	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	87c921dff5	test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition There may be unconsumed but expected fragments in the stream at the time of the call to produces_partition_end(). Call check_rts() sooner to avoid failures.	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	b3618163f8	row_cache: Log produced range tombstones	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	5f45d45c55	test: Make produces_range_tombstone() report ck_ranges	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	26ed0081a4	tests: lib: random_mutation_generator: Extract make_random_range_tombstone()	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	757fc1275f	partition_snapshot_row_cursor: Support reverse iteration	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	86791845ec	utils: immutable-collection: Make movable Classes with reference fields are not movable by default.	2021-12-19 22:41:35 +01:00
Pavel Emelyanov	d88ae7edae	Merge 'migration_manager: retire global storage proxy refs' from Avi Kivity Replace get_local_storage_proxy() and get_local_storage_proxy() with constructor-provided references. Some unneeded cases were removed. Test: unit (dev) Closes #9816 * github.com:scylladb/scylla: migration_manager: replace uses of get_storage_proxy and get_local_storage_proxy with constructor-provided reference migration_manager: don't keep storage_proxy alive during schema_check verb mm: don't capture storage proxy shared_ptr during background schema merge mm: remove stats on schema version get	2021-12-17 17:53:08 +03:00
Raphael S. Carvalho	f508f54f3e	table: move min_compaction_threshold() and compaction_enforce_min_threshold() into table_state Compaction specific methods can be implemented in table_state only, as they aren't needed elsewhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211214191822.164223-1-raphaelsc@scylladb.com>	2021-12-17 10:00:31 +02:00
Piotr Sarna	f49c20aa24	thrift: drop obtaining incorrect permits The thrift layer started partially having admission control after commit `ef1de114f0`, but code inspection suggests that it might cause use-after-free in a few cases, when a permit is obtained more than once per handling - due to the fact that some functions tail-called other functions, which also obtain a permit. These extraneous permits are not taken anyore. Tests: "please trust me" + cassandra-stress in thrift mode Message-Id: <ac5d711288b22c5fed566937722cceeabc234e16.1639394937.git.sarna@scylladb.com>	2021-12-17 09:35:24 +02:00
Avi Kivity	7c23ed888d	Update tools/jmx submodule (dropping unneeded dependencies) * tools/jmx 2c43d99...53f7f55 (1): > pom.xml: drop unneeded logging dependencies	2021-12-16 21:54:36 +02:00
Avi Kivity	a97731a7e5	migration_manager: replace uses of get_storage_proxy and get_local_storage_proxy with constructor-provided reference A static helper also gained a storage_proxy parameter.	2021-12-16 21:05:47 +02:00
Avi Kivity	aca9029c24	migration_manager: don't keep storage_proxy alive during schema_check verb The schema_check verb doesn't leak tasks, so when the verb is unregistered it will be drained. So protection for storage_proxy lifetime can be removed.	2021-12-16 21:04:27 +02:00
Avi Kivity	26c656f6ed	mm: don't capture storage proxy shared_ptr during background schema merge The definitions_update() verb captures a shared_ptr to storage_proxy to keep it alive while the background task executes. This was introduced in (2016!): commit `1429213b4c` Author: Pekka Enberg <penberg@scylladb.com> Date: Mon Mar 14 17:57:08 2016 +0200 main: Defer migration manager RPC verb registration after commitlog replay Defer registering migration manager RPC verbs after commitlog has has been replayed so that our own schema is fully loaded before other other nodes start querying it or sending schema updates. Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com> when moving this code from storage_proxy.cc. Later, better protection with a gate was added: commit `14de126ff8` Author: Pavel Emelyanov <xemul@scylladb.com> Date: Mon Mar 16 18:03:48 2020 +0300 migration_manager: Run background schema merge in gate The call for merge_schema_from in some cases is run in the background and thus is not aborted/waited on shutdown. This may result in use-after-free one of which is merge_schema_from -> read_schema_for_keyspace -> db::system_keyspace::query -> storage_proxy::query -> query_partition_key_range_concurrent in the latter function the proxy._token_metadata is accessed, while the respective object can be already free (unlike the storage_proxy itself that's still leaked on shutdown). Related bug: #5903, #5999 (cannot reproduce though) Tests: unit(dev), manual start-stop dtest(consistency.TestConsistency, dev) dtest(schema_management, dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Reviewed-by: Pekka Enberg <penberg@scylladb.com> Message-Id: <20200316150348.31118-1-xemul@scylladb.com> Since now the task execution is protected by the gate and therefore migration_manager lifetime (which is contained within that of storage_proxy, as it is constructed afterwards), capturing the shared_ptr is not needed, and we therefore remove it, as it uses the deprecated global storage_proxy accessors.	2021-12-16 21:01:06 +02:00
Botond Dénes	7db31e1bdb	mutation_reader: combined_reader: attach stream_id to mutation_fragments The fragment producer component of the combined reader returns a batch of fragments on each call to operator()(). These fragments are merged into a single one by the fragment merger. This patch adds a stream id to each fragment in the batch which identifies the stream (reader) it originates from. This will be used in the next patches to associate range-tombstone-changes originating from the same stream with each other.	2021-12-16 14:57:49 +02:00
Botond Dénes	c193bbed82	flat_mutation_reader_v2: add v2 version of empty reader Convert the v1 implementation to v2, downgrade to v1 in the existing `make_empty_flat_reader()`.	2021-12-16 14:57:49 +02:00
Botond Dénes	f15f4952be	test/boost/mutation_reader_test: clustering_combined_reader_mutation_source_test: fix end bound calculation Currently the test assumes that fragments represent weakly monotonic upper bounds and therefore unconditionally overwrites the upper-bound on receiving each fragment. Range tombstones however violate this as a range tombstone with a smaller position (lower bound) may have a higher upper bound than some or all fragments that follow it in the stream. This causes test failures after the converting the combined reader to v2, but not before, no idea why.	2021-12-16 14:57:49 +02:00
Nadav Har'El	9ae98dbe92	Merge 'Reduce boot time for dtest setup' from Asias He This patch helps to speed up node boot up for test setups like dtest. Nadav reported ``` With Asias's two patches o Scylla, and my patch to enable it in dtest: Boot time of 5 nodes is now down to 9 seconds! Remember we started this exercise with 214 seconds? :-) ``` Closes #9808 * github.com:scylladb/scylla: storage_service: Recheck tokens before throw in storage_service::bootstrap gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero	2021-12-16 13:44:42 +02:00
Pavel Emelyanov	b2a62d2b59	Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec Appending an empty range adjacent to an existing range tombstone would not deoverlap (by dropping the empty range tombstone) resulting in different (non canoncial) result depending on the order of appending. Suppose that range tombstone [a, b] covers range tombstone [x, x), and [a, x) and [x, b) are range tombstones which correspond to [a, b] split around position x. Appending [a, x) then [x, b) then [x, x) would give [a, b) Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b) The fix is to drop empty range tombstones in range_tombstone_list so that the result is canonical. Fixes #9661 Closes #9764 * github.com:scylladb/scylla: range_tombstone_list: Deoverlap adjacent empty ranges range_tombstone_list: Convert to work in terms of position_in_partition	2021-12-16 10:00:40 +03:00
Avi Kivity	c40043b142	mm: remove stats on schema version get	2021-12-15 18:56:18 +02:00
Nadav Har'El	d323b82cf6	Merge 'Introduce data_dictionary module' from Avi Kivity The full user-defined structure of the database (keyspaces, tables, user-defined types, and similar metadata, often known as the schema in other databases) is needed by much of the front-end code. But in Scylla it is deeply intertwined with the replica data management code - ::database, ::keyspace, and ::table. Not only does the front-end not need data access, it cannot get correct data via these objects since they represent just one replica out of many. This dual-role is a frequent cause of recompilations. It was solved to some degree by forward declarations, but there is still a lot of incidental dependencies. To solve this, we introduce a data_dictionary module (and namespace) to exclusively deal with greater schema metadata. It is an interface, with a backing implementation by the existing code, so it doesn't add a new source of truth. The plan is to allow mock implementations for testing as well. Test: unit (dev, release, debug). Closes #9783 * github.com:scylladb/scylla: cql3, related: switch to data_dictionary test: cql_test_env: provide access to data_dictionary storage_proxy: provide access to data_dictionary database: implement data_dictionary interface data_dictionary: add database/keyspace/table objects data_dictionary: move keyspace_metadata to data_dictionary data_dictionary: move user_types_metadata to new module data_dictionary	2021-12-15 18:29:28 +02:00
Avi Kivity	87917d2536	Merge "gms: gossiper: coroutinize a few small functions" from Pavel S " Start converting small functions in gossiper code from using `seastar::thread` context to coroutines. For now, the changes are quite trivial. Later, larger code fragments will be converted to eliminate uses of `seastar::async` function calls. Moving the code to coroutines makes the code a bit more readable and also mmediately evident that a given function is async just looking at the signature (for example, for void-returning functions, a coroutine will return `future<>` instead of `void` in case of a seastar::thread-using function). Tests: unit(dev) " * 'coro_gossip_v1' of https://github.com/ManManson/scylla: gms: gossiper: coroutinize `maybe_enable_features` gms: gossiper: coroutinize `wait_alive` gms: gossiper: coroutinize `add_saved_endpoint` gms: gossiper: coroutinize `evict_from_membership`	2021-12-15 16:02:18 +02:00
Tomasz Grabiec	87e3552cb8	intrusive_btree: Make default-initialized iterator cast to false This patch makes the following expression true: !bool(iterator_base{}) It's a reasonable expectation upon which subsequent patches will rely.	2021-12-15 13:54:40 +01:00
Avi Kivity	d768e9fac5	cql3, related: switch to data_dictionary Stop using database (and including database.hh) for schema related purposes and use data_dictionary instead. data_dictionary::database::real_database() is called from several places, for these reasons: - calling yet-to-be-converted code - callers with a legitimate need to access data (e.g. system_keyspace) but with the ::database accessor removed from query_processor. We'll need to find another way to supply system_keyspace with data access. - to gain access to the wasm engine for testing whether used defined functions compile. We'll have to find another way to do this as well. The change is a straightforward replacement. One case in modification_statement had to change a capture, but everything else was just a search-and-replace. Some files that lost "database.hh" gained "mutation.hh", which they previously had access to through "database.hh".	2021-12-15 13:54:23 +02:00
Avi Kivity	399e2895f1	test: cql_test_env: provide access to data_dictionary Allow tests to have access to the data_dictionary.	2021-12-15 13:54:18 +02:00
Avi Kivity	c2da20484d	storage_proxy: provide access to data_dictionary Probably storage_proxy is not the correct place to supply data_dictionary, but it is available to practically all of the coordinator code, so it is convenient.	2021-12-15 13:54:08 +02:00
Avi Kivity	1de0a4b823	database: implement data_dictionary interface Implement the new data_dictionary interface using the existing ::database, ::keyspace, and ::table classes. The implementation is straightforward. This will allow the coordinator code to access the full schema without depending on the gnarly bits that compose ::database, like reader_concurrency_semaphore or the backlog controller.	2021-12-15 13:53:46 +02:00
Avi Kivity	e55a606423	data_dictionary: add database/keyspace/table objects Add metadata-only counterparts to ::database, ::keyspace, and ::table. Apart from being metadata-only objects suitable for the coordinator, the new types are also type-erased and so they can be mocked without being linked to ::database and friends. We use a single abstract class to mediate between data_dictionary objects and the objects they represent (data_dictionary::impl). This makes the data_dictionary objects very lightweight - they only contain a pointer to the impl object (of which only one needs to be instantiated), and a reference to the object that is represented. This allows these objects to be easily passed by value. The abstraction is leaky: in one place it is outright breached with data_dictionary::database::real_database() that returns a ::database reference. This is used so that we can perform the transition incrementally. Another place is that one of the methods returns a secondary_index_manager, which in turn grants access to the real objects. This will be addressed later, probably by type erasing as well. This patch only contains the interface, and no implementation. It is somewhat messy since it mimics the years-old evolution of the real objects, but maybe it will be easier to improve it now.	2021-12-15 13:52:31 +02:00
Avi Kivity	3945acaa2d	data_dictionary: move keyspace_metadata to data_dictionary Like user_types_metadata, keyspace_metadata does not grant data access, just metadata, and so belongs in data_dictionary.	2021-12-15 13:52:21 +02:00
Avi Kivity	021c7593b8	data_dictionary: move user_types_metadata to new module data_dictionary The new module will contain all schema related metadata, detached from actual data access (provided by the database class). User types is the first contents to be moved to the new module.	2021-12-15 13:52:10 +02:00
Asias He	b4eb270e89	storage_service: Recheck tokens before throw in storage_service::bootstrap In case count_normal_token_owners check fails, sleep and retry. This makes test setups like skip_wait_for_gossip_to_settle = 0 and ring_delay_ms = 0 work.	2021-12-15 19:40:43 +08:00
Asias He	78d0cc4ab5	gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero The skip_wait_for_gossip_to_settle == 0 which means do not wait for gossip to settle at all. It is not respected in gossiper::wait_for_range_setup and in gossiper::wait_for_gossip for initial sleeps. Since setting skip_wait_for_gossip_to_settle zero is not allowed in production cluster anyway. It is mostly used by tests like dtest to reduce the cluster boot up time. Respect skip_wait_for_gossip_to_settle zero flag and avoid any sleep and wait completely.	2021-12-15 19:40:43 +08:00
Tzach Livyatan	d6fbabbf8c	fix typo in repair_based_node_ops.md Fix https://github.com/scylladb/scylla/issues/9786 Closes #9788	2021-12-15 09:56:21 +02:00
Avi Kivity	3ac622bdd8	Merge "Add v2 versions of make_forwadable() and make_flat_mutation_reader_from_fragments()" from Botond " These two readers are crucial for writing tests for any composable reader so we need v2 versions of them before we can convert and test the combined reader (for example). As these two readers are often used in situations where the payload they deliver is specially crafted for the test at hand, we keep their v1 versions too to avoid conversion meddling with the tests. Tests: unit(dev) " * 'forwarding-and-fragment-reader-v2/v1' of https://github.com/denesb/scylla: flat_mutation_reader_v2: add make_flat_mutation_reader_from_fragments() test/lib/mutation_source_test: don't force v1 reader in reverse run mutation_source: add native_version() getter flat_mutation_reader_v2: add make_forwardable() position_in_partition: add after_key(position_in_partition_view) flat_mutation_reader: make_forwardable(): fix indentation flat_mutation_reader: make_forwardable(): coroutinize reader	2021-12-14 20:43:09 +02:00
Raphael S. Carvalho	be6cfa4a83	table: only stop regular compaction when disabling auto compaction disable auto compaction API is about regular compactions, so maintenance operations like cleanup must not be stopped. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211213133541.36015-1-raphaelsc@scylladb.com>	2021-12-14 15:49:50 +02:00
Benny Halevy	f3a4ae1460	database: add debug messages around drop and truncate Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211214104357.2224926-1-bhalevy@scylladb.com>	2021-12-14 14:26:33 +02:00
Benny Halevy	32d61a3d09	test: sstable_directory_test_table_lock_works: verify that truncate is blocked on the the table lock The test in its current form is invalid, as database::remove does removing the table's name from its listing as well as from the keyspace metadata, so it won't be found after that. That said, database::drop_column_family then proceeds to truncate and stop the table, after calling await_pending_ops, and the latter should indeed block on the lock taken by the test. This change modifies the test to create some sstables in the table's directory before starting the sstable_directory. Then, when executing "drop table" in the background, wait until the table is not found by db.find_column_family That would fail the test before this change. See https://jenkins.scylladb.com/job/scylla-enterprise/job/next/1442/artifact/testlog/x86_64_debug/sstable_directory_test.sstable_directory_test_table_lock_works.4720.log ``` INFO 2021-12-13 14:00:17,298 [shard 0] schema_tables - Dropping ks.cf id=00487bc0-5c1d-11ec-9e3b-a44f824027ae version=b10c4994-31c7-3f5a-9591-7fedb0273c82 test/boost/sstable_directory_test.cc(453): fatal error: in "sstable_directory_test_table_lock_works": unexpected exception thrown by table_ok.get() ``` A this point, the test verifies again that the sstables are still on disk (and no truncate happened), and only after drop completed, the table should not exist on disk. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211214104407.2225080-1-bhalevy@scylladb.com>	2021-12-14 14:26:17 +02:00
Nadav Har'El	31eeb44d28	alternator: fix error on UpdateTable for non-existent table When the UpdateTable operation is called for a non-existent table, the appropriate error is ResourceNotFoundException, but before this patch we ran into an exception, which resulted in an ugly "internal server error". In this patch we use the existing get_table() function which most other operations use, and which does all the appropriate verifications and generates the appropriate Alternator api_error instead of letting internal Scylla exceptions escape to the user. This patch also includes a test for UpdateTable on a non-existent table, which used to fail before this patch and pass afterwards. We also add a test for DeleteTable in the same scenario, and see it didn't have this bug. As usual, both tests pass on DynamoDB, which confirms we generate the right error codes. Fixes #9747. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211206181605.1182431-1-nyh@scylladb.com>	2021-12-14 13:09:27 +01:00
Tomasz Grabiec	b228ddabb7	Merge "Move schema altering statement to raft" from Gleb The series is on top of "wire up schema raft state machine". It will apply without, but will not work obviously (when raft is disabled it does nothing anyway). This series does not provide any linearisability just yet though. It only uses raft as a means to distribute schema mutations. To achieve linearisability more work is needed. We need to at lease make sure that schema mutation use monotonically increasing timestamps and, since schema altering statement are RMW, no modification to schema were done between schema mutation creation and application. If there were an operation needs to be restarted. * scylla-dev/gleb/raft-schema-v5: (59 commits) cql3: cleanup mutation creation code in ALTER TYPE cql3: use migration_manager::schema_read_barrier() before accessing a schema in altering statements cql3: bounce schema altering statement to shard 0 migration_manager: add is_raft_enabled() to check if raft is enabled on a cluster migration_manager: add schema_read_barrier() function migration_manager: make announce() raft aware migration_manager: co-routinize announce() function migration_manager: pass raft_gr to the migration manager migration_manager: drop view_ptr array from announce_column_family_update() mm: drop unused announce_ methods cql3: drop schema_altering_statement::announce_migration() cql3: drop has_prepare_schema_mutations() from schema altering statement cql3: drop announce_migration() usage from schema_altering_statement cql3: move DROP AGGREGATE statement to prepare_schema_mutations() api migration_manager: add prepare_aggregate_drop_announcement() function cql3: move DROP FUNCTION statement to prepare_schema_mutations() api migration_manager: add prepare_function_drop_announcement() function cql3: move CREATE AGGREGATE statement to prepare_schema_mutations() api migration_manager: add prepare_new_aggregate_announcement() function cql3: move CREATE FUNCTION statement to prepare_schema_mutations() api ...	2021-12-14 11:05:32 +01:00
Piotr Sarna	feea7cb920	Merge 'cql3: disentangle column_identifier from selectable' from Avi Kivity column_identifier serves two purposes: a value type used to denote an identifier (which may or may not map to a table column), and `selectable` implementation used for selecting table columns. This stands in the way of further refactoring - the unification of the WHERE clause prepare path (prepare_expression()) and the SELECT clause prepare path (prepare_selectable()). Reduce the entanglement by moving the selectable-specific parts to a new type, selectable_column, and leaving column_identifier as a pure value type. Closes #9729 * github.com:scylladb/scylla: cql3: move selectable_column to selectable.cc cql3: column_identifier: split selectable functionality off from column_identifier	2021-12-14 10:37:32 +01:00
Benny Halevy	b28314c2e5	database: find_uuid: update comment To agree with `8cbecb1c21`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211214073228.2159674-1-bhalevy@scylladb.com>	2021-12-14 11:17:50 +02:00
Nadav Har'El	815324713e	test/alternator: add more tests for ADD operand mismatch The "ADD" operator in UpdateItem's AttributeUpdates supports a number of types (numbers, sets and strings), should result in a ValidationException if the attribute's existing type is different from the type of the operand - e.g., trying to ADD a number to an attribute which has a set as a value. So far we only had partial testing for this (we tested the case where both operands are sets, but of different types) so this patch adds the missing tests. The new tests pass (on both Alternator and DynamoDB) - we don't have a bug there. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211213195023.1415248-1-nyh@scylladb.com>	2021-12-14 11:15:23 +02:00
Botond Dénes	425c0b0394	test/cql-pytest/nodetool.py: fix take_snapshot() for cassandra take_snapshot() contained copypasta from flush() for the nodetool variant. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211208110129.141592-1-bdenes@scylladb.com>	2021-12-14 11:15:23 +02:00
Takuya ASADA	6870938842	scylla_raid_setup: fix typo Closes #9790	2021-12-14 11:15:23 +02:00
Benny Halevy	c89876c975	compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested Currently when scrub/validate is stopped (e.g. via the api), scrub_validate_mode_validate_reader co_return:s without closing the reader passed to it - causing a crash due to internal error check, see #9766. Throwing a compaction_stopped_exception rather than co_return:ing an exception will be handled as any other exeption, including closing the reader. Fixes #9766 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>	2021-12-14 11:15:23 +02:00
Benny Halevy	d38206587e	table: enable_auto_compaction: trigger compaction auto_compaction has been disabled so sstables may have already been accumulated and require compaction. Do not wait for new sstables to be written to trigger compaction, trigger compaction right away. Refs #9784 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211212090632.1257829-1-bhalevy@scylladb.com>	2021-12-14 11:15:23 +02:00
Gleb Natapov	1ba9cc8836	cql3: cleanup mutation creation code in ALTER TYPE Now that we have only one user for do_announce_migration() function it can be simplified (and renamed to something more appropriate).	2021-12-14 09:01:42 +02:00
Gleb Natapov	72a55c584e	cql3: use migration_manager::schema_read_barrier() before accessing a schema in altering statements Schema altering statements do read/modify/write on the schema. To make sure a statement access the latest schema it needs to execute raft read barrier before accessing local schema copy.	2021-12-14 09:01:42 +02:00
Gleb Natapov	31a873c80e	cql3: bounce schema altering statement to shard 0 Since raft's group zero resides on shard 0 only lets bounce all schema altering statement to shard 0 (if raft is enabled) to make it easier to use it.	2021-12-14 09:01:42 +02:00
Gleb Natapov	6e5061a12d	migration_manager: add is_raft_enabled() to check if raft is enabled on a cluster	2021-12-14 09:01:42 +02:00
Gleb Natapov	955e582fb6	migration_manager: add schema_read_barrier() function The function is responsible of calling raft's group zero read barrier in case it is enabled.	2021-12-14 09:01:42 +02:00
Gleb Natapov	9ee4ba143a	migration_manager: make announce() raft aware If raft is enabled use it to distribute schema change instead of direct RPC calls.	2021-12-14 09:01:40 +02:00
Gleb Natapov	3fd834222a	migration_manager: co-routinize announce() function	2021-12-14 09:00:33 +02:00
Tomasz Grabiec	78a6474982	range_tombstone_list: Deoverlap adjacent empty ranges Appending an empty range adjacent to an existing range tombstone would not deoverlap (by dropping the empty range tombstone) resulting in different (non canoncial) result depending on the order of appending. Suppose that [a, b] covers [x, x) Appending [a, x) then [x, b) then [x, x) would give [a, b) Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b) Fix by dropping empty range tombstones.	2021-12-13 21:31:36 +01:00
Raphael S. Carvalho	8eace8fc49	TWCS: Make sure major on past window is done on all its sstables Once current window is sealed, TWCS is supposed to compact all its sstables into one. If there's ongoing compaction, it can happen that sstables are missed and therefore past windows will contain more than one sstable. Additionally, it could happen that major doesn't happen at all if under heavy load. All these problems are fixed by serializing major on past window and also postponing it if manager refuses to run the job now. Fixes #9553. Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-13 16:10:43 -03:00
Raphael S. Carvalho	2dc890d8e6	TWCS: remove needless param for STCS options STCS option can be retrieved from class member, as newest_bucket() is no longer a static function. let's get rid of it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-13 16:05:40 -03:00
Raphael S. Carvalho	41a5736aaf	TWCS: kill unused param in newest_bucket() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-13 16:05:36 -03:00
Raphael S. Carvalho	49f40c8791	compaction: Implement strategy control and wire it This implements strategy control interface for both manager and tests, and wire it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-13 16:05:23 -03:00
Raphael S. Carvalho	6d9466052e	compaction: Add interface to control strategy behavior. This interface is akin to table_state, but compaction manager's representative instead. It will allow compaction manager to set goals and contraints on compaction strategies. It will start by allowing strategy to know if there's ongoing compaction, which is useful for virtually all strategies. For example, LCS may want to compact L0 in parallel with higher levels, to avoid L0 falling behind. This interface can be easily extended to allow manager to switch to a reclaim mode, if running out of space, etc. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-13 15:55:37 -03:00
Nadav Har'El	41c7b2fb4b	test/cql-pytest run: fix inaccurate comment The code in test/cql-pytest/run.py can start Scylla (or Cassandra, or Redis, etc.) in a random IP address in 127.... We explained in a comment that 127.0.0. is used by CCM so we avoid it in case someone runs both dtest and test.py in parallel on the same machine. But this observation was not accurate: Although the original CCM did use only 127.0.0., in Scylla's CCM we added in 2017, in commit 00d3ba5562567ab83190dd4580654232f4590962, the ability to run multiple copies of CCM in parallel; CCM now uses 127.0.., not just 127.0.0.. So we need to correct this in the comment. Luckily, the code doesn't need to change! We already avoided the entire 127.0.. for simplicity, not just 127.0.0.*. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211212151339.1361451-1-nyh@scylladb.com>	2021-12-13 18:12:11 +02:00
Avi Kivity	e44a28dce4	Merge "compaction: Allow data from different buckets (e.g. windows) to be compacted together" from Raphael " Today, data from different buckets (e.g. windows) cannot be compacted together because mutation compactor happens inside each consumer, where each consumer is done on behalf of a particular bucket. To solve this problem, mutation compaction process is being moved from consumer into producer, such that interposer consumer, which is responsible for segregation, will be feeded with compacted data and forward it into the owner bucket. Fixes #9662. tests: unit(debug). " * 'compact_across_buckets_v2' of github.com:raphaelsc/scylla: tests: sstable_compaction_test: add test_twcs_compaction_across_buckets compaction: Move mutation compaction into producer for TWCS compaction: make enable_garbage_collected_sstable_writer() more precise	2021-12-12 15:07:15 +02:00
Gleb Natapov	e9fafea5c1	migration_manager: pass raft_gr to the migration manager Migration manager will be use raft group zero to distribute schema changes.	2021-12-11 12:31:07 +02:00
Gleb Natapov	38e1f85959	migration_manager: drop view_ptr array from announce_column_family_update() No users pass it any longer.	2021-12-11 12:31:07 +02:00
Gleb Natapov	a13ebe13c9	mm: drop unused announce_ methods	2021-12-11 12:31:07 +02:00
Gleb Natapov	730171f4df	cql3: drop schema_altering_statement::announce_migration() It is no longer used.	2021-12-11 12:31:07 +02:00
Gleb Natapov	837a153b34	cql3: drop has_prepare_schema_mutations() from schema altering statement It is no longer used.	2021-12-11 12:31:07 +02:00
Gleb Natapov	f5e10b23dd	cql3: drop announce_migration() usage from schema_altering_statement Now that all schema altering statement support prepare_schema_mutations() we can drop announce_migration() usage.	2021-12-11 12:31:07 +02:00
Gleb Natapov	e632ded782	cql3: move DROP AGGREGATE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	07103d915e	migration_manager: add prepare_aggregate_drop_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	e904236cd4	cql3: move DROP FUNCTION statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	25ae8a6376	migration_manager: add prepare_function_drop_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	9d1d54bc93	cql3: move CREATE AGGREGATE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	7430750674	migration_manager: add prepare_new_aggregate_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	156f234996	cql3: move CREATE FUNCTION statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	10c14cd044	migration_manager: add prepare_new_function_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	9ec0db660c	cql: get rid of mutable members in DROP/CREATE FUNCTION Instead of using a mutable member as a way to pass data between functions just return the data directly to a caller.	2021-12-11 12:31:07 +02:00
Gleb Natapov	661651a836	cql3: move statement validation to execute time for function related statements To be able to confine raft to the execution time of a statement we need to move all schema access to the execution time as well. Since the validation code access the schema lets run it during execution.	2021-12-11 12:31:07 +02:00
Gleb Natapov	1d448f59a0	cql3: move DROP INDEX statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	05801b99c6	cql3: factor our mutation creation code into a separate function for DROP INDEX The function will be used in the next patch.	2021-12-11 12:31:07 +02:00
Gleb Natapov	09128719dc	cql3: move DROP VIEW statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	25294e4460	migration_manager: add prepare_view_drop_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	4528273e54	cql3: move DROP TYPE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	87b52c30e7	migration_manager: add prepare_type_drop_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	36745b6b73	cql3: move statement validation to execute time for DROP TYPE To be able to confine raft to the execution time of a statement we need to move all schema access to the execution time as well. Since the validation code access the schema lets run it during execution.	2021-12-11 12:31:07 +02:00
Gleb Natapov	d438a3285e	cql3: move DROP TABLE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	471d48d277	migration_manager: add prepare_column_family_drop_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	68cf743554	cql3: move DROP KEYSPACE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	f1cc1fb96e	migration_manager: add prepare_keyspace_drop_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	4574981f9e	cql3: move CREATE INDEX statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	2d1b318d36	cql3: CREATE INDEX do not re-create targets array twice The validation and execution code recreate the same array twice. Avoid it by returning the array from verification function.	2021-12-11 12:31:07 +02:00
Gleb Natapov	6029ba6f5b	cql3: factor our mutation creation code into a separate function for CREATE INDEX The function will be used in the next patch.	2021-12-11 12:31:07 +02:00
Gleb Natapov	a3e1cb932a	cql3: move statement validation to execute time for CREATE INDEX To be able to confine raft to the execution time of a statement we need to move all schema access to the execution time as well. Since the validation code access the schema lets run it during execution.	2021-12-11 12:31:07 +02:00
Gleb Natapov	5f30b5802c	cql3: move ALTER KEYSPACE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	d79e426fb6	migration_manager: add prepare_keyspace_update_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	456d2e28c3	cql3: move CREATE KEYSPACE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	011f38a2f1	migration_manager: add prepare_new_keyspace_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	edff0cf4db	cql3: move ALTER TYPE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	db1c9cec20	cql3: move ALTER VIEW statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	94bc34bb20	cql3: factor our mutation creation code into a separate function for ALTER VIEW The function will be used in the next patch.	2021-12-11 12:31:07 +02:00
Gleb Natapov	a4afc69b87	migration_manager: add prepare_view_update_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	82acc9aa05	cql3: move CREATE VIEW statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	c294d7b1ca	cql3: factor our mutation creation code into a separate function for CREATE VIEW statement The function will be used in the next patch.	2021-12-11 12:31:07 +02:00
Gleb Natapov	3f47210374	migration_manager: add prepare_new_view_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	af6b3d985d	cql3: move ALTER TABLE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	688efff6b5	cql3: factor our mutation creation code into a separate function for ALTER TABLE The function will be used in the next patch.	2021-12-11 12:31:07 +02:00
Gleb Natapov	7cc629980b	migration_manager: add prepare_column_family_update_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	5af2c342a3	migration_manager: add prepare_update_type_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	67661b6e66	cql3: move CREATE TYPE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	5649daf76a	migration_manager: add prepare_new_type_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	5e9af3c414	cql3: move CREATE TABLE statement to prepare_schema_mutations() api	2021-12-11 12:31:07 +02:00
Gleb Natapov	20dbd717ff	migration_manager: add prepare_new_column_family_announcement() function The function only generates mutations for the announcement, but does not send them out. Will be used by the later patches.	2021-12-11 12:31:07 +02:00
Gleb Natapov	b2af64ec5e	cql3: introduce schema_altering_statement::prepare_schema_mutations() as announce_migration() alternative Instead of announcing schema mutations the new function will return them. The caller is responsible to announce them. To easy the transition make the API optional. Statements that do not have it will use old announce_migration() method.	2021-12-11 12:31:07 +02:00
Gleb Natapov	2f95a29209	migration_manager: add include_keyspace() function Currently a keyspace mutation is included into schema mutation list just before announcement. Move the inclusion to a separate function. It will be used later when instead of announcing new schema the mutation array will be returned.	2021-12-11 12:31:07 +02:00
Pavel Solodovnikov	47533bca65	gms: gossiper: coroutinize `maybe_enable_features` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:39:48 +03:00
Pavel Solodovnikov	3993c6a9fb	gms: gossiper: coroutinize `wait_alive` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:30:32 +03:00
Pavel Solodovnikov	a6ff04dd24	gms: gossiper: coroutinize `add_saved_endpoint` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:23:35 +03:00
Pavel Solodovnikov	23dd8b66c5	gms: gossiper: coroutinize `evict_from_membership` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:15:03 +03:00
Raphael S. Carvalho	7c90088152	tests: sstable_compaction_test: add test_twcs_compaction_across_buckets Verify that TWCS compaction can now compact data across time windows, like a tombstone which will cause all shadowed data to be purged once they're all compacted together. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-10 17:14:45 -03:00
Raphael S. Carvalho	9b8aa1e9ae	compaction: Move mutation compaction into producer for TWCS If interposer is enabled, like the timestamp-based one for TWCS, data from different buckets (e.g. windows) cannot be compacted together because mutation compaction happens inside each consumer, where each consumer will be belong to a different bucket. To remove this limitation, let's move the mutation compactor from consumer into producer, such that compacted data will be feeded into the interposer, before it segregates data. We're short-circuiting this logic if TWCS isn't in use as compacting reader adds overhead to compaction, given that this reader will pop fragments from combined sstable reader, compact them using mutation_compactor and finally push them out to the underlying reader. without compacting reader (e.g. STCS + no interposer): 228255.92 +- 1519.53 partitions / sec (50 runs, 1 concurrent ops) 224636.13 +- 1165.05 partitions / sec (100 runs, 1 concurrent ops) 224582.38 +- 1050.71 partitions / sec (100 runs, 1 concurrent ops) with compacting reader (e.g. TWCS + interposer): 221376.19 +- 1282.11 partitions / sec (50 runs, 1 concurrent ops) 216611.65 +- 985.44 partitions / sec (100 runs, 1 concurrent ops) 215975.51 +- 930.79 partitions / sec (100 runs, 1 concurrent ops) So the cost of compacting data across buckets is ~3.5%, which happens only with interposer enabled and GC writer disabled. Fixes #9662. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-10 17:14:44 -03:00
Pavel Emelyanov	b0a8c153f7	select_statement: Remove unused proxy args and captures The generate_view_paging_state_from_base_query_results() has unused proxy argument that's carried over quite a long stack for nothing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211210175203.26197-1-xemul@scylladb.com>	2021-12-10 20:39:55 +02:00
Raphael S. Carvalho	484269cd8f	compaction: make enable_garbage_collected_sstable_writer() more precise we only want to enable GC writer if incremental compaction is required. let's make it more precise by checking that size limit for sstable isn't disabled, so GC writer will only be enabled for compaction strategies that really need it. So strategies that don't need it won't pay the penalty. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-10 15:22:08 -03:00
Avi Kivity	3f862f9ece	cql3: move selectable_column to selectable.cc Move selectable_column to selectable.cc (and to the cql3::selection namespace). This cleans up column_identifier.hh so it is now a pure vocabulary header.	2021-12-10 19:51:57 +02:00
Avi Kivity	3305d1d514	cql3: column_identifier: split selectable functionality off from column_identifier column_identifier serves two purposes: one is as a general value type that names a value, for example in column_specification. The other is as a `selectable` derived class specializing in selecting columns from a base table. Obviously, to select a column from a table you need to know its name, but to name some value (which might not be a table column!) you don't need the ability to select it from a table. The mix-up stands in the way of unifying the select-clause (`selectable`) and where-clause (previously known as `term`) expression prepare paths. This is because the already existing where-clause result, `expr::column_value`, is implemented as `column_definition*`, while the select clause equivalent, `column_identifier`, can't contain a column_definition because not all uses of column_identifier name a schema column. To fix this, split column_identifier into two: column_identifier retains the original use case of naming a value, while a new class `selectable_column` has the additional ability of selecting a column in a select clause. It still doesn't use column_definition, that will be adjusted later.	2021-12-10 19:51:55 +02:00
Botond Dénes	04306d762f	tools/scylla-sstables: remove unused variables and captures Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211210142949.527545-1-bdenes@scylladb.com>	2021-12-10 18:24:08 +03:00
Juliusz Stasiewicz	351f142791	cdc/check_and_repair_cdc_streams: ignore LEFT endpoints When `check_and_repair_cdc_streams` encountered a node with status LEFT, Scylla would throw. This behavior is fixed so that LEFT nodes are simply ignored. Fixes #9771 Closes #9778	2021-12-10 15:28:14 +01:00
Raphael S. Carvalho	e0758fded1	compaction_manager: make get_compaction_state() private internal method that should never be directly used by the outside world. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211210120806.19233-1-raphaelsc@scylladb.com>	2021-12-10 17:19:24 +03:00
Botond Dénes	39426b1aa3	flat_mutation_reader_v2: add make_flat_mutation_reader_from_fragments() The main difference compared to v1 (apart from having _v2 suffix at relevant places) is how slicing and reversing works. The v2 variant has native reverse support built-in because the reversing reader is not something we want to convert to v2. A native v2 mutation-source test is also added.	2021-12-10 15:48:49 +02:00
Botond Dénes	20e45987b5	test/lib/mutation_source_test: don't force v1 reader in reverse run Currently in the reverse run we wrap the test-provided mutation-source and create a v1 reader with it, forcing a conversion if the mutation-source has a v2 factory. Worse still, if the test is v2 native, there will be a double conversion. This patch fixes this by creating a wrapper mutation-source appropriate to the version of the underlying factory of the wrapped mutation-source.	2021-12-10 15:48:49 +02:00
Botond Dénes	d8870d2fe1	mutation_source: add native_version() getter So tests can determine the native version of the factory function and create the native reader if needed, to avoid unnecessary conversions.	2021-12-10 15:48:49 +02:00
Botond Dénes	76ee3f029c	flat_mutation_reader_v2: add make_forwardable() Not a completely straightforward conversion as the v2 version has to make sure to emit the current range tombstone change after fast_forward_to() (if it changes compared to the current one before fast forwarding). Changes are around the two new members `_tombstone_to_emit` and `maybe_emit_tombstone()`.	2021-12-10 15:48:49 +02:00
Botond Dénes	a7866f783f	position_in_partition: add after_key(position_in_partition_view)	2021-12-10 15:48:49 +02:00
Botond Dénes	7306f53be1	flat_mutation_reader: make_forwardable(): fix indentation	2021-12-10 15:19:18 +02:00
Botond Dénes	2468a0602b	flat_mutation_reader: make_forwardable(): coroutinize reader For improved readability and to facilitate further patching.	2021-12-10 15:19:18 +02:00
Tomasz Grabiec	4d302dfa1a	Merge "Fix exception safety of rows insertion" from Pavel Emelyanov There are several places that (still) use throwing b-tree .insert_before() method and don't manage the inserted object lifetime. Some of those places also leave the leaked rows_entry on the LRU delaying the assertion failure by the time those entries get evicted (#9728) To prevent such surprises in the future, the set removes the non-safe inserters from the B-tree code. Actually most of this set is that removal plus preparations for reviewability. * xemul/br-rows-insertion-exception-safety-2: btree: Earnestly discourage from insertion of plain references row-cache: Handle exception (un)safety of rows_entry insertion partition_snapshot_row_cursor: Shuffle ensure_result creation mutation_partition: Use B-tree insertion sugar tests: Make B-tree tests use unique-ptrs for insertion	2021-12-10 13:55:18 +01:00
Pavel Emelyanov	6b4b170025	btree: Earnestly discourage from insertion of plain references Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-10 12:35:12 +03:00
Pavel Emelyanov	ee103636ac	row-cache: Handle exception (un)safety of rows_entry insertion The B-tree's insert_before() is throwing operation, its caller must account for that. When the rows_entry's collection was switched on B-tree all the risky places were fixed by `ee9e1045`, but few places went under the radar. In the cache_flat_mutation_reader there's a place where a C-pointer is inserted into the tree, thus potentially leaking the entry. In the partition_snapshot_row_cursor there are two places that not only leak the entry, but also leave it in the LRU list. The latter it quite nasty, because those entry can be evicted, eviction code tries to get rows_entry iterator from "this", but the hook happens to be unattached (because insertion threw) and fails the assert. fixes: #9728 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-10 12:35:12 +03:00
Pavel Emelyanov	9fd8db318d	partition_snapshot_row_cursor: Shuffle ensure_result creation Both places get the C-pointer on the freshly allocated rows_entry, insert it where needed and return back the dereferenced pointer. The C-pointer is going to become smart-pointer that would go out of scope before return. This change prepares for that by constructing the ensure_result from the iterator, that's returned from insertion of the entry. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-10 12:35:12 +03:00
Pavel Emelyanov	e03f7191d9	mutation_partition: Use B-tree insertion sugar The B-tree insertion methods accept smart pointers and automatically release the ownership after exception-risky part is passed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-10 12:35:12 +03:00
Pavel Emelyanov	5a405a4273	tests: Make B-tree tests use unique-ptrs for insertion The non-smart-pointers overloads are going away, prepare tests for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-10 12:35:12 +03:00
Nadav Har'El	03d67440ef	alternator: test additional metrics and fix another broken counter In issue #9406 we noticed that a counter for BatchGetItem operations was missing. When we fixed it, we added a test which checked this counter - but only this counter. It was left as a TODO to test the rest of the Alternator metrics, and this is what this patch does. Here we add a comprehensive test for all of the operations supported by Scylla and how they increase the appropriate operation counter. With this test we discovered a new bug: the DescribeTimeToLive operation incremented the UpdateTimeToLiveCounter :-( So in this patch we also include a fix for that bug, and the new test verifies that it is fixed. In addition to the operation counters, Alternator also has additional metric and we also added tests for some of them - but not all. The remaining untested metrics are listed in a TODO comment. Message-Id: <20211206154727.1170112-1-nyh@scylladb.com>	2021-12-10 08:08:54 +02:00
Benny Halevy	cca956bce2	database_test: snapshot_with_quarantine_works: get the list of sstables from table object Rather than the filesystem, to reduce flakiness. Also, add some test logging. Fixes #9763 Test: database_test(debug, release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211209175144.854896-1-bhalevy@scylladb.com>	2021-12-09 21:01:25 +02:00
Nadav Har'El	006fa588a3	alternator ttl: correct misleading typo in error message Alternator's support for the DynamoDB API TTL features is experimental, so if a user attempts to use one the TTL API requests, an error message is returned that the experimental feature must be turned on first. The message incorrectly said that the name of the experimental flag to turn on is "alternator_ttl", with an underscore. But that's a type - it should be "alternator-ttl" with a hyphen. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211209183428.1336526-1-nyh@scylladb.com>	2021-12-09 20:47:05 +02:00
Benny Halevy	8728fd480d	database_test: do_with_some_data: get the return func future do_with_some_data runs a function in a seastar thread. It needs to get() the future func returns rather than propagating it. This solves a secondary failure due to abandoned future when the test case fails, as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4254/artifact/testlog/x86_64_debug/database_test.snapshot_with_quarantine_works.381.log ``` test/boost/database_test.cc(903): fatal error: in "snapshot_with_quarantine_works": critical check expected.empty() has failed WARN 2021-12-08 00:35:16,300 [shard 0] seastar - Exceptional future ignored: boost::execution_aborted, backtrace: 0x10935e50 0x16ff2d8d 0x16ff2a4d 0x16ff5033 0x16ff5ec2 0x162d4ce9 0x10a2bdb5 0x10a2bd24 0x10a54ca4 0x10a27cf3 0x10a22151 0x10a67c9d 0x10a67a78 0x163ac37e 0x163b29e9 0x163b7690 0x163b51c1 0x17c212df 0x17c1f097 0x17bf8b4c 0x17bf83f2 0x17bf82a2 0x17bf7d52 0x10f8bf5a 0x166db84b /lib64/libpthread.so.0+0x9298 /lib64/libc.so.6+0x100352 ... *** 1 abandoned failed future(s) detected Failing the test because fail was requested by --fail-on-abandoned-failed-futures ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211209174512.851945-1-bhalevy@scylladb.com>	2021-12-09 21:11:56 +03:00
Nadav Har'El	c6f2afb93d	Merge 'cql3: Allow to skip EQ restricted columns in ORDER BY' from Jan Ciołek In queries like: ```cql SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c1 ASC, c2 ASC) ``` we can skip the requirement to specify ordering for `c1` column. The `c1` column is restricted by an `EQ` restriction, so it can have at most one value anyway, there is no need to sort. This commit makes it possible to write just: ```cql SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c2 ASC) ``` I reorganized the ordering code, I feel that it's now clearer and easier to understand. It's possible to only introduce a small change to the existing code, but I feel like it becomes a bit too messy. I tried it out on the [`orderby_disorder_small`](https://github.com/cvybhu/scylla/commits/orderby_disorder_small) branch. The diff is a bit messy because I moved all ordering functions to one place, it's better to read [select_statement.cc](https://github.com/cvybhu/scylla/blob/orderby_disorder/cql3/statements/select_statement.cc#L1495-L1658) lines 1495-1658 directly. In the new code it would also be trivial to allow specifying columns in any order, we would just have to sort them. For now I commented out the code needed to do that, because the point of this PR was to fix #2247. Allowing this would require some more work changing the existing tests. Fixes: #2247 Closes #9518 * github.com:scylladb/scylla: cql-pytest: Enable test for skipping eq restricted columns in order by cql3: Allow to skip EQ restricted columns in ORDER BY cql3: Add has_eq_restriction_on_column function cql3: Reorganize orderings code	2021-12-09 21:11:56 +03:00
Nadav Har'El	36c3b92b19	alternator, schema_loader: get rid of deprecation warnings Seastar moved the read_entire_stream(), read_entire_stream_contiguous() and skip_entire_stream() from the "httpd" namespace to the "util" namespace. Using them with their old names causes deprecation warnings when compiling alternator/server.cc. This patch fixes the namespace (and adds the new include) to get rid of the deprecation warnings. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211209132759.1319420-1-nyh@scylladb.com>	2021-12-09 21:11:56 +03:00
Avi Kivity	242e19195f	Merge "table: Prevent resurrecting data from memtable on compaction" from Mikołaj " Mutations are not guaranteed to come in the order of their timestamps. If there is an expired tombstone in the sstable and a repair inserts old data into memtable, the compaction would not consider memtable data and purge the tombstone leading to data resurrection. The solution is to disallow purging tombstones newer than min memtable timestamp. If there are no memtables, max timestamp is used. " * 'check-memtable-at-compact-tombstone-discard/v2' of github.com:mikolajsieluzycki/scylla: table: Prevent resurrecting data from memtable on compaction table: Add min_memtable_timestamp function to table	2021-12-09 21:11:56 +03:00
Piotr Sarna	2ec36a6c53	alternator,ttl: limit parallelism to 1 page Right now we do not really have any parallelism in the alternator TTL service, but in order to be future-proof, a semaphore is instantiated to ensure that we only handle 1 page of a scan at a time, regardless of how many tables are served. This commit also removes the FIXME regarding the service permit - using an empty permit is a conscious decision, because the parallelism is limited by other means (see above). Tests: unit(release) Message-Id: <b5f0c94f1afbead1f940a210911cc05f70900dcd.1638990637.git.sarna@scylladb.com>	2021-12-09 21:11:55 +03:00
Asias He	9859c76de1	storage_service: Wait for seastar::get_units in node_ops The seastar::get_units returns a future, we have to wait for it. Fixes #9767 Closes #9768	2021-12-09 21:11:55 +03:00
Jan Ciolek	13d367dada	cql-pytest: Enable test for skipping eq restricted columns in order by This test was marked as xfail, but now the functionality it tests has been implemented. In my opinion the expected error message makes no sense, the message was: "Order by currently only supports the ordering of columns following their declared order in the PRIMARY KEY" In cases where there was missing restriction on one column. This has been changed to: "Unsupported order by relation - column {} doesn't have an ordering or EQ relation." Because of that I had to modify the test to accept messages from both Scylla and Cassandra. The expected error message pattern is now "rder by", because that's the largest common part. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-09 14:59:47 +01:00
Benny Halevy	85f10138f0	api: storage_service: validate_keyspace: improve exception error message Generate the error message using the no_such_keyspace(ks_name) exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-09 14:40:21 +02:00
Benny Halevy	6805ce5bd9	api: compaction_manager: add stop_keyspace_compaction Allow stopping compaction by type on a given keyspace and list of tables. Add respective rest_api test. Fixes #9700 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-09 14:40:13 +02:00
Benny Halevy	522a32f19f	api: storage_service: expose validate_keyspace and parse_tables To be used by the compaction_manager api in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-09 14:25:53 +02:00
Mikołaj Sielużycki	504efe0607	table: Prevent resurrecting data from memtable on compaction Mutations are not guaranteed to come in the order of their timestamps. If there is an expired tombstone in the sstable and a repair inserts old data into memtable, the compaction would not consider memtable data and purge the tombstone leading to data resurrection. The solution is to disallow purging tombstones newer than min memtable timestamp.	2021-12-09 13:22:14 +01:00
Benny Halevy	71c95faeee	api: compaction_manager: stop_compaction: fix type description List only the compaction types we support stopping. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-09 14:17:38 +02:00
Benny Halevy	fed7319698	compaction_manager: stop_compaction: expose optional table* To be used by api layer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-09 14:14:49 +02:00
Mikołaj Sielużycki	7ce0ca040d	table: Add min_memtable_timestamp function to table	2021-12-09 13:14:38 +01:00
Benny Halevy	4535cb5cb3	test: api: add basic compaction_manager test Test compaction_manager/stop_compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-09 13:59:06 +02:00
Jan Ciolek	a548c2dac4	cql3: Allow to skip EQ restricted columns in ORDER BY In queries like: SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c1 ASC, c2 ASC) we can skip the requirement to specify ordering for c1 column. The c1 column is restricted by an EQ restriction, so it can have only one value anyway, there is no need to sort. This commit makes it possible to write just: SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c2 ASC) Fixes: #2247 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-09 12:07:02 +01:00
Jan Ciolek	7bbfa48bc5	cql3: Add has_eq_restriction_on_column function Adds a function that checks whether a given expression has eq restrction on the specified column. It finds restrictions like col = ... or (col, col2) = ... IN restrictions don't count, they aren't EQ restrictions Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-09 12:06:43 +01:00
Jan Ciolek	f76a1cd4bf	cql3: Reorganize orderings code Reorganized the code that handles column ordering (ASC or DESC). I feel that it's now clearer and easier to understand. Added an enum that describes column ordering. It has two possible values: ascending or descending. It used to be a bool that was sometimes called 'reversed', which could mean multiple things. Instead of column.type->is_reversed() != <ordering bool> there is now a function called are_column_select_results_reversed. Split checking if ordering is reversed and verifying whether it's correct into two functions. Before all of this was done by is_reversed() This is a preparation to later allow skipping ORDER BY restrictions on some columns. Adding this to the existing code caused it to get quite complex, but this new version is better suited for the task. The diff is a bit messy because I moved all ordering functions to one place, it's better to read select_statement.cc lines 1495-1651 directly. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-09 12:06:42 +01:00
Nadav Har'El	f9673309aa	docs: protocols.md - add information on Redis listening address The description in protocols.md of the Redis protocol server in Scylla explains how its port can be configured, but not how the listening IP address can be configured. It turns out that the same "rpc_address" that controls CQL's and Thrift's IP address also applies to Redis. So let's document that. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211208160206.1290916-1-nyh@scylladb.com>	2021-12-08 20:14:52 +01:00
Nadav Har'El	e032f92c5c	Merge 'api/storage service: validate table names' from Benny Halevy This series fixes a couple issues around generating and handling of no_such_keyspace and no_such_column_family exceptions. First, it removes std::throw_with_nested around their throw sites in the respective database::find_* functions. Fixes #9753 And then, it introduces a `validate_tables` helper in api/storage_service.cc that generates a `bad_param_exception` in order to set the correct http response status if a non-existing table name is provided in the `cf` http request parameter. Fixes #9754 The series also adds a test for the REST API under test/rest_api that verifies the storage_service enable/disable auto_compaction api and checks the error codes for non-existing keyspace or table. Test: unit(dev) Closes #9755 * github.com:scylladb/scylla: api: storage_service: add parse_tables database: un-nest no_such_keyspace and no_such_column_family exceptions database: throw internal error when failing uuid returned by find_uuid database: find_uuid: throw no_such_column_family exception if ks/cf were not found test: rest_api: add storage_service test test: add basic rest api test test: cql-pytest: wait for rest api when starting scylla	2021-12-08 16:54:48 +02:00
Benny Halevy	ff63ad9f6e	api: storage_service: add parse_tables Splits and validate the cf parameter, containing an optional comma-separated list of table names. If any table is not found and a no_such_column_family exception is thrown, wrap it in a `bad_param_exception` so it will translate to `reply::status_type::bad_request` rather than `reply::status_type::internal_server_error`. With that, hide the split_cf function from api/api.hh since it was used only from api/storage_service and new use sites should use validate_tables instead. Fixes #9754 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-08 16:42:40 +02:00
Benny Halevy	a3bd7806e7	database: un-nest no_such_keyspace and no_such_column_family exceptions These were thrown in the respective database::find_* function as nested exception since `d3fe0c5182`. Wrapping them in nested exceptions just makes it harder to figure out and work with and apprently serves no purpose. Without these nested_exception we can correctly detect internal errors when synchronously failing to find a uuid returned by find_uuid(ks, cf). Fixes #9753 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-08 16:35:38 +02:00
Benny Halevy	ac49e5fff1	database: throw internal error when failing uuid returned by find_uuid find_uuid returns a uuid found for ks_name.table_name. In some cases, we immediately and synchronously use that uuid to lookup other information like the table& or the schema. Failing to find that uuid indicates an internal error when no preemption is possible. Note that yielding could allow deletion of the table to sneak in and invalidate the uuis. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-08 16:35:38 +02:00
Benny Halevy	8cbecb1c21	database: find_uuid: throw no_such_column_family exception if ks/cf were not found Rather than masquerading all errors as std::out_of_range("") convert only the std::out_of_range error from _ks_cf_to_uuid.at() to no_such_column_family(ks, cf). That relieves all callers of fund_uuid from doing that conversion themselves. For example, get_uuid in api/column_family now only deals with converting no_such_column_family to bad_param_exception, as it needs to do at the api level, rather than generating a similar error from scratch. Other call sites required no intervention. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-08 16:35:38 +02:00
Benny Halevy	5eb32aa57c	test: rest_api: add storage_service test FIXME: negative tests for not-found tables should result in a requests.codes.bad_request but currently result in requests.codes.internal_server_error. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-08 16:35:36 +02:00
Tomasz Grabiec	fe2fa3f20d	range_tombstone_list: Convert to work in terms of position_in_partition This makes it comprehensible, and a bit simpler.	2021-12-08 15:16:18 +01:00
Piotr Sarna	d486b496a6	alternator,ttl: start scans from a random token range This patch addresses yet another FIXME from alternator/ttl.cc. Namely, scans are now started from a random, owned token range instead of always starting with the first range. This mechanism is expected to reduce the probability of some ranges being starved when the scanning process is often restarted, e.g. due to nodes failing. Should the mechanism prove insufficient for some users, a more complete solution is to regularly persist the state of the scanning process in a table (distributed if we want to allow other nodes to pick up from where a dead node left off), but that induces overhead. Tests: unit(release) (including a long loop over the ttl pytest) Message-Id: <7fc3f6525ceb69725c41de10d0fb6b16188349e3.1638387924.git.sarna@scylladb.com> Message-Id: <db198e743ca9ed1e5cc659e73da342fbce2c882a.1638473143.git.sarna@scylladb.com>	2021-12-08 16:15:53 +02:00
Benny Halevy	26257cfa6d	test: add basic rest api test Test system/uptime_ms to start with. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-08 16:05:33 +02:00
Benny Halevy	01f2e8b391	test: cql-pytest: wait for rest api when starting scylla Some of the tests, like nodetool.py, use the scylla REST API. Add a check_rest_api function that queries http://<node_addr>:10000/ that is served once scylla starts listening on the API port and call it via run.wait_for_services. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-08 16:05:32 +02:00
Piotr Sarna	26288c1a86	test,alternator: make TTL tests less prone to false negatives On my local machine, a 3 second deadline proved to cause flakiness of test_ttl_expiration case, because its execution time is just around 3 seconds. This patch addresse the problem by bumping the local timeout to 10 (and 15 for test_ttl_expiration_long, since it's dangerously near the 10 second deadline on my machine as well). Moreover, some test cases short-circuited once they detected that all needed items expired, but other ones lacked it and always used their full time slot. Since 10 seconds is a little too long for a single test case, even one marked with --veryslow, this patch also adds a couple of other short-circuits. One exception is test_ttl_expiration_hash_wrong_type, which actually depends on the fact that we should wait for the whole loop to finish. Since this case was never flaky for me with the 3 second timeout, it's left as is. Theoretically, test_ttl_expiration also kind of depends on checking the condition more than once (because the TTL of one of the values is bumped on each iteration), but empirical evidence shows that multiple iterations always occur in this test case anyway - for me, it always spinned at least 3 times. Tests: unit(release) Message-Id: <a0a479929dac37daace744e0a970567a8aa3b518.1638431933.git.sarna@scylladb.com>	2021-12-08 16:02:45 +02:00
Raphael S. Carvalho	c3c23dd1e5	multishard_mutation_query: make multi_range_reader::fill_buffer() work even after EOS if fill_buffer() is called after EOS, underlying reader will be fast forwarded to a range pointed to by an invalid iterator, so producing incorrect results. fill_buffer() is changed to return early if EOS was found, meaning that underlying reader already fast forwarded to all ranges managed by multi_range_reader. Usually, consume facilities check for EOS, before calling fill_buffer() but most reader impl check for EOS to avoid correctness issues. Let's do the same here. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211208131423.31612-1-raphaelsc@scylladb.com>	2021-12-08 15:39:11 +02:00
Avi Kivity	f28552016f	Update seastar submodule * seastar f8a038a0a2...8d15e8e67a (21): > core/program_options: preserve defaultness of CLI arguments > log: Silence logger when logging > Include the core/loop.hh header inside when_all.hh header > http: Fix deprecated wrappers > foreign_ptr: Add concept > util: file: add read_entire_file > short_streams: move to util > Revert "Merge: file: util: add read_entire_file utilities" > foreign_ptr: declare destroy as a static method > Merge: file: util: add read_entire_file utilities > Merge "output_stream: handle close failure" from Benny > net: bring local_address() to seastar::connected_socket. > Merge "Allow programatically configuring seastar" from Botond > Merge 'core: clean up memory metric definitions' from John Spray > Add PopOS to debian list in install-dependencies.sh > Merge "make shared_mutex functions exception safe and noexcept" from Benny > on_internal_error: set_abort_on_internal_error: return current state > Implementation of iterator-range version of when_any > net: mark functions returning ethernet_address noexcept > net: ethernet_address: mark functions noexcept > shared_mutex: mark wake and unlock methods noexcept Contains patch from Botond Dénes <bdenes@scylladb.com>: db/config: configure logging based on app_template::seastar_options Scylla has its own config file which supports configuring aspects of logging, in addition to the built-in CLI logging options. When applying this configuration, the CLI provided option values have priority over the ones coming from the option file. To implement this scylla currently reads CLI options belonging to seastar from the boost program options variable map. The internal representation of CLI options however do not constitute an API of seastar and are thus subject to change (even if unlikely). This patch moves away from this practice and uses the new shiny C++ api: `app_template::seastar_options` to obtain the current logging options.	2021-12-08 14:21:11 +02:00
Tomasz Grabiec	5eaca85e4b	Merge "wire up schema raft state machine" from Gleb This series wires up the schema state machine to process raft commands and transfer snapshots. The series assumes that raft group zero is used for schema transfer only and that single raft command contains single schema change in a form of canonical_mutation array. Both assumptions may change in which case the code will be changed accordingly, but we need to start somewhere. * scylla-dev/gleb/schema-raft-sm-v2: schema raft sm: request schema sync on schema_state_machine snapshot transfer raft service: delegate snapshot transfer to a state machine implementation schema raft sm: pass migration manager to schema_raft_state_machine and merge schema on apply()	2021-12-08 13:14:28 +01:00
Nadav Har'El	92e7fbe657	test/alternator: check correct error for unknown operation Add a short test verifying that Alternator responds with the correct error code (UnknownOperationException) when receiving an unknown or unsupported operation. The test passes on both AWS and Alternator, confirming that the behavior is the same. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211206125710.1153008-1-nyh@scylladb.com>	2021-12-08 13:56:38 +02:00
Gleb Natapov	f25424edcd	storage_service: remove unused function. is_auto_bootstrap() function is no longer used. Message-Id: <YbCVXPI4hE8wgT4T@scylladb.com>	2021-12-08 13:55:32 +02:00
Botond Dénes	0aa4e5e726	test/cql-pytest: mv virtual_tables.py -> test_virtual_tables.py For consistency with the other tests. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211208102108.126492-1-bdenes@scylladb.com>	2021-12-08 12:23:22 +02:00
Gavin Howell	c6e0a807b4	Update wasm.md Grammar correction, sentence re-write. Closes #9760	2021-12-08 10:24:53 +01:00
Tomasz Grabiec	2a36377bb3	Merge "test: raft: randomized_nemesis_test: introduce server stop/crash nemesis" from Kamil We begin by preparing the `persistence` class so that the storage can be reused across different Raft server instances: the test keeps a shared pointer to the storage so that when a server stops, a new server with the same ID can be reconstructed with this storage. We then modify `environment` so that server instances can be removed and replaced in middle of operations. Finally we prepare a nemesis operation which gracefully stops or immediately crashes a randomly picked server and run this operation periodically in `basic_generator_test`. One important change that changes the API of `raft::server` is included: the metrics are not automatically registered in `start()`. This is because metric registration modifies global data structures, which cannot be done twice with the same set of metrics (and we would do it when we restart a server with the same ID). Instead, `register_metrics()` is exposed in the `raft::server` interface to be called when running servers in production. * kbr/crashes-v3: raft: server: print the ID of aborted server test: raft: randomized_nemesis_test: run stop_crash nemesis in `basic_generator_test` test: raft: randomized_nemesis_test: introduce `stop_crash` operation test: raft: randomized_nemesis_test: environment: implement server `stop` and `crash` raft: server: don't register metrics in `start()` test: raft: randomized_nemesis_test: raft_server: return `stopped_error` when called during abort test: raft: randomized_nemesis_test: handle `raft::stopped_error` test: raft: randomized_nemesis_test: handle missing servers in `environment` call functions test: raft: randomized_nemesis_test: environment: split `new_server` into `new_node` and `start_server` test: raft: randomized_nemesis_test: remove `environment::get_server` test: raft: randomized_nemesis_test: construct `persistence_proxy` outside `raft_server<M>::create` test: raft: randomized_nemesis_test: persistence_proxy: store a shared pointer to `persistence` test: raft: randomized_nemesis_test: persistence: split into two classes test: raft: logical_timer: introduce `sleep_until`	2021-12-07 22:16:23 +01:00
Nadav Har'El	bb0f8c3cdf	Merge 'build: disable superword-level parallism (slp) on clang' from Avi Kivity Clang (and gcc) can combine loads and stores of independent variables into wider operations, often using vector registers. This reduces instruction count and execution unit occupancy. However, clang is too aggressive and generates loads that break the store-to-load forwarding rules: a load must be the same size or smaller than the corresponding load, or it will execute with a large penalty. Disabling slp results in larger but faster code. Comparing before and after on Zen 3: slp: 226766.49 tps ( 75.1 allocs/op, 12.1 tasks/op, 45073 insns/op) 226679.57 tps ( 75.1 allocs/op, 12.1 tasks/op, 45074 insns/op) 226168.79 tps ( 75.1 allocs/op, 12.1 tasks/op, 45061 insns/op) 225884.34 tps ( 75.1 allocs/op, 12.1 tasks/op, 45068 insns/op) 225998.16 tps ( 75.1 allocs/op, 12.1 tasks/op, 45056 insns/op) median 226168.79 tps ( 75.1 allocs/op, 12.1 tasks/op, 45061 insns/op) median absolute deviation: 284.45 maximum: 226766.49 minimum: 225884.34 no slp: 228195.33 tps ( 75.1 allocs/op, 12.1 tasks/op, 45109 insns/op) 227773.76 tps ( 75.1 allocs/op, 12.1 tasks/op, 45123 insns/op) 228088.98 tps ( 75.1 allocs/op, 12.1 tasks/op, 45117 insns/op) 228157.43 tps ( 75.1 allocs/op, 12.1 tasks/op, 45129 insns/op) 228072.29 tps ( 75.1 allocs/op, 12.1 tasks/op, 45128 insns/op) median 228088.98 tps ( 75.1 allocs/op, 12.1 tasks/op, 45117 insns/op) median absolute deviation: 68.45 maximum: 228195.33 minimum: 227773.76 Disabling slp increases the instruction count by ~60 instructions per op (0.13%) but increases throughput by 0.85%. This shows the impact of the violation is quite high. It can also be observed by the effect on stalled cycles: slp: 44,932.70 msec task-clock # 0.993 CPUs utilized 13,618 context-switches # 303.075 /sec 33 cpu-migrations # 0.734 /sec 1,695 page-faults # 37.723 /sec 211,997,160,633 cycles # 4.718 GHz (71.67%) 1,118,855,786 stalled-cycles-frontend # 0.53% frontend cycles idle (71.67%) 1,258,837,025 stalled-cycles-backend # 0.59% backend cycles idle (71.66%) 454,445,559,376 instructions # 2.14 insn per cycle # 0.00 stalled cycles per insn (71.66%) 83,557,588,477 branches # 1.860 G/sec (71.67%) 174,313,252 branch-misses # 0.21% of all branches (71.67%) no-slp: 44,579.83 msec task-clock # 0.986 CPUs utilized 13,435 context-switches # 301.369 /sec 33 cpu-migrations # 0.740 /sec 1,691 page-faults # 37.932 /sec 210,070,080,283 cycles # 4.712 GHz (71.68%) 1,066,774,628 stalled-cycles-frontend # 0.51% frontend cycles idle (71.68%) 1,082,255,966 stalled-cycles-backend # 0.52% backend cycles idle (71.66%) 455,067,924,891 instructions # 2.17 insn per cycle # 0.00 stalled cycles per insn (71.68%) 83,597,450,748 branches # 1.875 G/sec (71.65%) 151,897,866 branch-misses # 0.18% of all branches (71.68%) Note the differences in "backend cycles idle" and "stalled cycles per insn". I also observed the same pattern on a much older generation Intel (although the baseline instructions per clock there are around 0.56). slp: 42232.64 tps ( 75.1 allocs/op, 12.1 tasks/op, 44818 insns/op) 42318.87 tps ( 75.1 allocs/op, 12.1 tasks/op, 44849 insns/op) 42331.33 tps ( 75.1 allocs/op, 12.1 tasks/op, 44857 insns/op) 42315.89 tps ( 75.1 allocs/op, 12.1 tasks/op, 44875 insns/op) 42410.19 tps ( 75.1 allocs/op, 12.1 tasks/op, 44818 insns/op) median 42318.87 tps ( 75.1 allocs/op, 12.1 tasks/op, 44849 insns/op) median absolute deviation: 12.46 maximum: 42410.19 minimum: 42232.64 no-slp: 42464.18 tps ( 75.1 allocs/op, 12.1 tasks/op, 44886 insns/op) 42631.88 tps ( 75.1 allocs/op, 12.1 tasks/op, 44939 insns/op) 42783.95 tps ( 75.1 allocs/op, 12.1 tasks/op, 44961 insns/op) 42671.23 tps ( 75.1 allocs/op, 12.1 tasks/op, 44947 insns/op) 42487.82 tps ( 75.1 allocs/op, 12.1 tasks/op, 44875 insns/op) median 42631.88 tps ( 75.1 allocs/op, 12.1 tasks/op, 44939 insns/op) median absolute deviation: 144.06 maximum: 42783.95 minimum: 42464.18 slp: 26,877.01 msec task-clock # 0.989 CPUs utilized 15,621 context-switches # 0.581 K/sec 9 cpu-migrations # 0.000 K/sec 55,322 page-faults # 0.002 M/sec 96,084,360,190 cycles # 3.575 GHz (72.55%) 71,435,545,235 stalled-cycles-frontend # 74.35% frontend cycles idle (72.57%) 59,531,573,539 stalled-cycles-backend # 61.96% backend cycles idle (70.96%) 53,273,420,083 instructions # 0.55 insn per cycle # 1.34 stalled cycles per insn (72.55%) 10,240,844,987 branches # 381.026 M/sec (72.57%) 94,348,150 branch-misses # 0.92% of all branches (72.57%) no-slp: 26,381.66 msec task-clock # 0.971 CPUs utilized 15,586 context-switches # 0.591 K/sec 9 cpu-migrations # 0.000 K/sec 55,318 page-faults # 0.002 M/sec 94,317,505,691 cycles # 3.575 GHz (72.59%) 69,693,601,709 stalled-cycles-frontend # 73.89% frontend cycles idle (72.59%) 57,579,078,046 stalled-cycles-backend # 61.05% backend cycles idle (58.08%) 53,260,417,953 instructions # 0.56 insn per cycle # 1.31 stalled cycles per insn (72.60%) 10,235,123,948 branches # 387.964 M/sec (72.60%) 96,002,988 branch-misses # 0.94% of all branches (72.62%) Closes #9752 * github.com:scylladb/scylla: build: rearrange -O3 and -f<optimization-option> options build: disable superword-level parallism (slp) on clang	2021-12-07 18:01:26 +02:00
Avi Kivity	c519857beb	build: rearrange -O3 and -f<optimization-option> options It turns out that -O3 enabled -fslp-vectorize even if it is disabled before -O3 on the command line. Rearrange the code so that -O3 is before the more specific optimization options.	2021-12-07 17:52:32 +02:00
Juliusz Stasiewicz	5a8741a1ca	cdc: Throw when ALTERing cdc options without "enabled":"..." The problem was that such a command: ``` alter table ks.cf with cdc={'ttl': 120}; ``` would assume that "enabled" parameter is the default ("false") and, in effect, disable CDC on that table. This commit forces the user to specify that key. Fixes #6475 Closes #9720	2021-12-07 17:37:44 +02:00
Avi Kivity	04ad07b072	build: disable superword-level parallism (slp) on clang Clang (and gcc) can combine loads and stores of independent variables into wider operations, often using vector registers. This reduces instruction count and execution unit occupancy. However, clang is too aggressive and generates loads that break the store-to-load forwarding rules: a load must be the same size or smaller than the corresponding load, or it will execute with a large penalty. Disabling slp results in larger but faster code. Comparing before and after on Zen 3: slp: 226766.49 tps ( 75.1 allocs/op, 12.1 tasks/op, 45073 insns/op) 226679.57 tps ( 75.1 allocs/op, 12.1 tasks/op, 45074 insns/op) 226168.79 tps ( 75.1 allocs/op, 12.1 tasks/op, 45061 insns/op) 225884.34 tps ( 75.1 allocs/op, 12.1 tasks/op, 45068 insns/op) 225998.16 tps ( 75.1 allocs/op, 12.1 tasks/op, 45056 insns/op) median 226168.79 tps ( 75.1 allocs/op, 12.1 tasks/op, 45061 insns/op) median absolute deviation: 284.45 maximum: 226766.49 minimum: 225884.34 no slp: 228195.33 tps ( 75.1 allocs/op, 12.1 tasks/op, 45109 insns/op) 227773.76 tps ( 75.1 allocs/op, 12.1 tasks/op, 45123 insns/op) 228088.98 tps ( 75.1 allocs/op, 12.1 tasks/op, 45117 insns/op) 228157.43 tps ( 75.1 allocs/op, 12.1 tasks/op, 45129 insns/op) 228072.29 tps ( 75.1 allocs/op, 12.1 tasks/op, 45128 insns/op) median 228088.98 tps ( 75.1 allocs/op, 12.1 tasks/op, 45117 insns/op) median absolute deviation: 68.45 maximum: 228195.33 minimum: 227773.76 Disabling slp increases the instruction count by ~60 instructions per op (0.13%) but increases throughput by 0.85%. This shows the impact of the violation is quite high. It can also be observed by the effect on stalled cycles: slp: 44,932.70 msec task-clock # 0.993 CPUs utilized 13,618 context-switches # 303.075 /sec 33 cpu-migrations # 0.734 /sec 1,695 page-faults # 37.723 /sec 211,997,160,633 cycles # 4.718 GHz (71.67%) 1,118,855,786 stalled-cycles-frontend # 0.53% frontend cycles idle (71.67%) 1,258,837,025 stalled-cycles-backend # 0.59% backend cycles idle (71.66%) 454,445,559,376 instructions # 2.14 insn per cycle # 0.00 stalled cycles per insn (71.66%) 83,557,588,477 branches # 1.860 G/sec (71.67%) 174,313,252 branch-misses # 0.21% of all branches (71.67%) no-slp: 44,579.83 msec task-clock # 0.986 CPUs utilized 13,435 context-switches # 301.369 /sec 33 cpu-migrations # 0.740 /sec 1,691 page-faults # 37.932 /sec 210,070,080,283 cycles # 4.712 GHz (71.68%) 1,066,774,628 stalled-cycles-frontend # 0.51% frontend cycles idle (71.68%) 1,082,255,966 stalled-cycles-backend # 0.52% backend cycles idle (71.66%) 455,067,924,891 instructions # 2.17 insn per cycle # 0.00 stalled cycles per insn (71.68%) 83,597,450,748 branches # 1.875 G/sec (71.65%) 151,897,866 branch-misses # 0.18% of all branches (71.68%) Note the differences in "backend cycles idle" and "stalled cycles per insn". I also observed the same pattern on a much older generation Intel (although the baseline instructions per clock there are around 0.56). slp: 42232.64 tps ( 75.1 allocs/op, 12.1 tasks/op, 44818 insns/op) 42318.87 tps ( 75.1 allocs/op, 12.1 tasks/op, 44849 insns/op) 42331.33 tps ( 75.1 allocs/op, 12.1 tasks/op, 44857 insns/op) 42315.89 tps ( 75.1 allocs/op, 12.1 tasks/op, 44875 insns/op) 42410.19 tps ( 75.1 allocs/op, 12.1 tasks/op, 44818 insns/op) median 42318.87 tps ( 75.1 allocs/op, 12.1 tasks/op, 44849 insns/op) median absolute deviation: 12.46 maximum: 42410.19 minimum: 42232.64 no-slp: 42464.18 tps ( 75.1 allocs/op, 12.1 tasks/op, 44886 insns/op) 42631.88 tps ( 75.1 allocs/op, 12.1 tasks/op, 44939 insns/op) 42783.95 tps ( 75.1 allocs/op, 12.1 tasks/op, 44961 insns/op) 42671.23 tps ( 75.1 allocs/op, 12.1 tasks/op, 44947 insns/op) 42487.82 tps ( 75.1 allocs/op, 12.1 tasks/op, 44875 insns/op) median 42631.88 tps ( 75.1 allocs/op, 12.1 tasks/op, 44939 insns/op) median absolute deviation: 144.06 maximum: 42783.95 minimum: 42464.18 slp: 26,877.01 msec task-clock # 0.989 CPUs utilized 15,621 context-switches # 0.581 K/sec 9 cpu-migrations # 0.000 K/sec 55,322 page-faults # 0.002 M/sec 96,084,360,190 cycles # 3.575 GHz (72.55%) 71,435,545,235 stalled-cycles-frontend # 74.35% frontend cycles idle (72.57%) 59,531,573,539 stalled-cycles-backend # 61.96% backend cycles idle (70.96%) 53,273,420,083 instructions # 0.55 insn per cycle # 1.34 stalled cycles per insn (72.55%) 10,240,844,987 branches # 381.026 M/sec (72.57%) 94,348,150 branch-misses # 0.92% of all branches (72.57%) no-slp: 26,381.66 msec task-clock # 0.971 CPUs utilized 15,586 context-switches # 0.591 K/sec 9 cpu-migrations # 0.000 K/sec 55,318 page-faults # 0.002 M/sec 94,317,505,691 cycles # 3.575 GHz (72.59%) 69,693,601,709 stalled-cycles-frontend # 73.89% frontend cycles idle (72.59%) 57,579,078,046 stalled-cycles-backend # 61.05% backend cycles idle (58.08%) 53,260,417,953 instructions # 0.56 insn per cycle # 1.31 stalled cycles per insn (72.60%) 10,235,123,948 branches # 387.964 M/sec (72.60%) 96,002,988 branch-misses # 0.94% of all branches (72.62%)	2021-12-07 17:08:38 +02:00
Raphael S. Carvalho	648c921af2	cql3: statements: Fix UB when getting memory consumption limit for unpaged query get_max_result_size() is called on slice moved in previous argument. This results in use-after-move with clang, which evaluation order is left-to-right. For paged queries, max_result_size is later overriden by query_pager, but for unpaged and/or reversed queries it can happen that max result size incorrectly contains the 1MB limit for paged, non-reversed queries. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211207145133.69764-1-raphaelsc@scylladb.com>	2021-12-07 16:57:01 +02:00
Avi Kivity	edaa0c468d	cql3: expr: standardize on struct tag for expression components Expression components are pure data, so emphasize this by using the struct tag consistently. This is just a cosmetic change. Closes #9740	2021-12-07 15:46:25 +02:00
Botond Dénes	2e5440bdf2	Merge 'Convert compaction to flat_mutation_reader_v2' from Raphael Carvalho Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too. There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2. tests: unit(debug). Closes #9725 * github.com:scylladb/scylla: compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2 sstable_set: update make_crawling_reader() to flat_mutation_reader_v2 sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2 sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2 sstable_set: update incremental_reader_selector to flat_mutation_reader_v2	2021-12-07 15:17:38 +02:00
Raphael S. Carvalho	2435bd14c6	compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:57 -03:00
Raphael S. Carvalho	c6399005a3	sstable_set: update make_crawling_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:55 -03:00
Raphael S. Carvalho	aebbe68239	sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:53 -03:00
Raphael S. Carvalho	c3c070a5ca	sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:51 -03:00
Raphael S. Carvalho	6b664067dd	sstable_set: update incremental_reader_selector to flat_mutation_reader_v2 Cannot be fully converted to flat_mutation_reader_v2 yet, as the selector is built on combined_reader interface which is still not converted. So only updated wherever possible. Subsequent work will update sstable_set readers, which uses the selector, to flat_mutation_reader_v2. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:49 -03:00
Kamil Braun	75bab2beec	raft: server: print the ID of aborted server	2021-12-07 11:23:34 +01:00
Kamil Braun	45fe0d015d	test: raft: randomized_nemesis_test: run stop_crash nemesis in `basic_generator_test` There is a separate thread that periodically stops/crashes and restarts a randomly chosen server, so the nemesis runs concurrently with reconfigurations and network partitions.	2021-12-07 11:23:34 +01:00
Kamil Braun	f9073b864f	test: raft: randomized_nemesis_test: introduce `stop_crash` operation An operation which chooses a server randomly, randomly chooses whether to crash or gracefully stop it, performs the chosen operation, and restarts the server after a selected delay.	2021-12-07 11:23:34 +01:00
Kamil Braun	168390d4bb	test: raft: randomized_nemesis_test: environment: implement server `stop` and `crash` `stop` gracefully stops a running server, `crash` immediately "removes" it (from the point of view of the rest of the environment). We cannot simply destroy a running server. Read the comments in `crash` to understand how it's implemented.	2021-12-07 11:23:34 +01:00
Kamil Braun	485c0b1819	raft: server: don't register metrics in `start()` Instead, expose `register_metrics()` at the `server` interface (previously it was a private method of `server_impl`). Metrics are global so `register_metrics()` cannot be called on two servers that have the same ID, which is useful e.g. in tests when we want to simulate server stops and restarts.	2021-12-07 11:23:33 +01:00
Kamil Braun	429f87160b	test: raft: randomized_nemesis_test: raft_server: return `stopped_error` when called during abort Don't return `gate_closed_exception` which is an internal implementation detail and which callers don't expect.	2021-12-07 11:22:52 +01:00
Kamil Braun	c79dacc028	test: raft: randomized_nemesis_test: handle `raft::stopped_error` Include it in possible call result types. It will start appearing when we enable server aborts in the middle of the test.	2021-12-07 11:22:52 +01:00
Kamil Braun	25a8772306	test: raft: randomized_nemesis_test: handle missing servers in `environment` call functions `environment` functions for performing operations on Raft servers: `is_leader`, `call`, `reconfigure`, `get_configuration`, currently assume that a server is running on each node at all times and that it never changes. Prepare these functions for missing/restarting servers.	2021-12-07 11:22:51 +01:00
Kamil Braun	d281b2c0ea	test: raft: randomized_nemesis_test: environment: split `new_server` into `new_node` and `start_server` Soon it will be possible to stop a server and then start a completely new `raft::server` instance but which uses the same ID and persistence, simulating a server restart. For this we introduce the concept of a "node" which keeps the persistence alive (through a shared pointer). To start a server - using `start_server` - we must first create a node on which it will be running through `new_node`. `new_server` is now a short function which does these two things.	2021-12-07 11:22:51 +01:00
Kamil Braun	5c803ae1d0	test: raft: randomized_nemesis_test: remove `environment::get_server` To perform calls to servers in a Raft cluster, the test code would first obtain a reference to a server through `get_server` and then call the server directly. This will not be safe when we implement server crashes and restarts as servers will disappear in middle of operations; we don't want the test code to keep references to no-longer-existing servers. In the new API the test will call the `environment` to perform operations, giving it the server ID. `environment` will handle disappearing servers underneath.	2021-12-07 11:22:51 +01:00
Kamil Braun	0d64fbc39d	test: raft: randomized_nemesis_test: construct `persistence_proxy` outside `raft_server<M>::create`	2021-12-07 11:22:51 +01:00
Kamil Braun	4e8a86c6a1	test: raft: randomized_nemesis_test: persistence_proxy: store a shared pointer to `persistence` We want the test to be able to reuse `persistence` even after `persistence_proxy` is destroyed for simulating server restarts. We'll do it by having the test keep a shared pointer to `persistence`. To do that, instead of storing `persistence` by value and constructing it inside `persistence_proxy`, store it by `lw_shared_ptr` which is taken through the constructor (so `persistence` itself is now constructed outside of `persistence_proxy`).	2021-12-07 11:22:51 +01:00
Kamil Braun	16b1d2abcc	test: raft: randomized_nemesis_test: persistence: split into two classes The previous `persistence` implemented the `raft::persistence` interface and had two different responsibilities: - representing "persistent storage", with the ability to store and load stuff to/from it, - accessing in-memory state shared with a corresponding instance of `impure_state_machine` that is running along `persistence` inside a `raft::server`. For example, `persistence::store_snapshot_descriptor` would persist not only the snapshot descriptor, but also the corresponding snapshot. The descriptor was provided through a parameter but the snapshot wasn't. To obtain the snapshot we use a data structure (`snapshots_t`) that both `persistence` and `impure_state_machine` had a reference to. We split `persistence` into two classes: - `persistence` which handles only the first responsibility, i.e. storing and loading stuff; everything to store is provided through function parameters (e.g. now we have a `store_snapshot` function which takes both the snapshot and its descriptor through the parameters) and everything to load is returned directly by functions (e.g. `load_snapshot` returns a pair containing both the descriptor and corresponding snapshot) - `persistence_proxy` (for lack of a better name) which implements `raft::persistence`, contains the above `persistence` inside and shares a data structure with `impure_state_machine` (so `persistence_proxy` corresponds to the old `persistence`). The goal is to prepare for reusing the persisted stuff between different instances of `raft::server` running in a single test when simulating server shutdowns/crashes and restarts. When destroying a `raft::server`, we destroy its `impure_state_machine` and `persistence_proxy` (we are forced to because constructing a `raft::server` requires a `unique_ptr` to `raft::persistence`), but we will be able to keep the underlying `persistence` for the next instance (if we simulate a restart) - after a slight modification made in the next commit.	2021-12-07 11:22:51 +01:00
Kamil Braun	c1db77fa61	test: raft: logical_timer: introduce `sleep_until` Allows sleeping until a given time point arrives.	2021-12-07 11:22:51 +01:00
Avi Kivity	79bcdc104e	Merge "Fix stateful multi-range scans" from Botond " Currently stateful (readers being saved and resumed on page boundaries) multi-range scans are broken in multiple ways. Trying to use them can result in anything from use-after-free (#6716) or getting corrupt data (#9718). Luckily no-one is doing such queries today, but this started to change recently as code such as Alternator TTL and distributed aggregate reads started using this. This series fixes both problems and adds a unit test too exercising this previously completely unused code-path. Fixes: #6716 Fixes: #9718 Tests: unit(dev, release, debug) " * 'fix-stateful-multi-range-scans/v1' of https://github.com/denesb/scylla: test/boost/multishard_mutation_query_test: add multi-range test test/boost/multishard_mutation_query_test: add multi-range support multishard_mutation_query: don't drop data during stateful multi-range reads multishard_combining_reader: reader_lifecycle_policy: allow saving read range on fast-forward	2021-12-07 12:19:56 +02:00
Nadav Har'El	ca46c3ba8f	test/redis: replace run script with shorter Python script In the past, we had very similar shell scripts for test/alternator/run, test/cql-pytest/run and test/redis/run. Most of the code of all three scripts was identical - dealing with starting Scylla in a temporary directory, running pytest, and so on. The code duplication meant that every time we fixed a bug in one of those scripts, or added an important boot-time parameter to Scylla, we needed to fix all three scripts. The solution was to convert the run scripts to Python, and to use a common library, test/cql-pytest/run.py, for the main features shared by all scripts - starting Scylla, waiting for protocols to be available, and running pytest. However, we only did this conversion for alternator and cql-pytest - redis remained the old shell scripts. This patch completes the conversion also for redis. As expected, no change was needed to the run.py library code, which was already strong enough for the needs of the redis tests. Fixes #9748. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211207081423.1187847-1-nyh@scylladb.com>	2021-12-07 12:18:07 +02:00
Avi Kivity	395b30bca8	mutation_reader: update make_filtering_reader() to flat_mutation_reader_v2 As part of the drive to move over to flat_mutation_reader_v2, update make_filtering_reader(). Since it doesn't examine range tombstones (only the partition_start, to filter the key) the entire patch is just glue code upgrading and downgrading users in the pipeline (or removing a conversion, in one case). Test: unit (dev) Closes #9723	2021-12-07 12:18:07 +02:00
Raphael S. Carvalho	6737c88045	compaction_manager: use single semaphore for serialization of maintenance compactions We have three semaphores for serialization of maintenance ops. 1) _rewrite_sstables_sem: for scrub, cleanup and upgrade. 2) _major_compaction_sem: for major 3) _custom_job_sem: for reshape, resharding and offstrategy scrub, cleanup and upgrade should be serialized with major, so rewrite sem should be merged into major one. offstrategy is also a maintenance op that should be serialized with others, to reduce compaction aggressiveness and space requirement. resharding is one-off operation, so can be merged there too. the same applies for reshape, which can take long and not serializing it with other maintenance activity can lead to exhaustion of resources and high space requirement. let's have a single semaphore to guarantee their serialization. deadlock isn't an issue because locks are always taken in same order. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211201182046.100942-1-raphaelsc@scylladb.com>	2021-12-07 12:18:07 +02:00
Eliran Sinvani	426fc8db3a	Repair: add a stringify function for node_ops_cmd Adding a strigify function for the node_ops_cmd enum, will make the log output more readable and will make it possible (hopefully) to do initial analysis without consulting the source code. Refs #9629 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #9745	2021-12-07 12:18:07 +02:00
Nadav Har'El	d3abff9ea1	test/alternator: validate that TagResource needs a Tags parameter A short new test to verify that in the TagResource operation, the Tags parameter - specifying which tags to set - is required. The test passes on both AWS and Alternator - they both produce a ValidationException in this case (the specific human-readable error message is different, though, so we don't check it). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211206140541.1157574-1-nyh@scylladb.com>	2021-12-06 15:08:16 +01:00
Avi Kivity	f907205b92	utils: logalloc: correct and adjust timing unit in stall report The stall report uses the millisecond unit, but actually reports nanoseconds. Switch to microseconds (milliseconds are a bit too coarse) and use the safer "duration / 1us" style rather than "duration::count()" that leads to unit confusion. Fixes #9733. Closes #9734	2021-12-06 09:51:57 +02:00
Botond Dénes	ade4cdf0e7	Merge "compaction: quarantine invalid sstables" from Benny Halevy " This series adds an optional "quarantine" subdirectory to the table data directory that may contain sstables that are fenced-off from regular compaction. The motivation, as discussed in https://github.com/scylladb/scylla/issues/7658 and https://github.com/scylladb/scylla/issues/9537#issuecomment-953635973, is to prevent regular compaction from spreading sstable corruption further to other sstables, and allow investigating the invalid sstables using the scylla-sstable tool, or scrubbing them in segregate mode. When sstables are found to be invalid in scrub::mode::validate they are moved to the quarantine directory, where they will still be available for reading, but will not be considered for regular or major compaction. By default scrub, in all other modes, will consider all sstables, including the quaratined ones. To make it more efficient, a new option was added and exposed via the storage_service/keyspace_scrub api - quarantine_mode. When set to quarantine_mode::only, scrub will read only the quarantined sstables, so that the user can start with validate mode to detect invalid sstables and quarantine them, then scrub/segregate only the quarantined sstables. Test: unit(dev), database_test(debug) DTest: nodetool_additional_test.py:TestNodetool.{scrub_ks_sstable_with_invalid_fragment_test,scrub_segregate_sstable_with_invalid_fragment_test,scrub_segregate_ks_sstable_with_invalid_fragment_test,scrub_sstable_with_invalid_fragment_test,scrub_with_multi_nodes_expect_data_rebuild_test,scrub_with_one_node_expect_data_loss_test,validate_ks_sstable_with_invalid_fragment_test,validate_with_one_node_expect_data_loss_test,validate_sstable_with_invalid_fragment_test} " * tag 'quarantine-invalid-sstables-v6' of github.com:bhalevy/scylla: test: sstable_compaction_test: add sstable_scrub_quarantine_mode_test compaction: scrub: add quarantine_mode option compaction_manager: perform_sstable_scrub: get the whole compaction_type_options::scrub compaction: scrub_sstables_validate_mode: quarantine invalid sstables test: database_test: add snapshot_with_quarantine_works test: database_test: add populate_from_quarantine_works distributed_loader: populate_keyspace: populate also from the quarantine dir distributed_loader: populate_column_family: add must_exist param sstables: add is_quarantined sstables: add is_eligible_for_compaction sstables: define symbolic names for table subdirectories	2021-12-06 08:58:43 +02:00
Takuya ASADA	ea20f89c56	dist: allow running scylla-housekeeping with strict umask setting To avoid failing scylla-housekeeping in strict umask environment, we need to chmod a+r on repository file and housekeeping.uuid. Fixes #9683 Closes #9739	2021-12-05 20:46:46 +02:00
Benny Halevy	044e4a6b72	token_metadata: delete private constructor It is not used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211205174306.450536-1-bhalevy@scylladb.com>	2021-12-05 19:49:29 +02:00
Avi Kivity	32beb9e7e4	Merge "Keep proxy reference from thrift" from Pavel E " Thrift is one of the users of global storage proxy instance. This set remove all such calls from the thrift/ code. tests: unit(dev) " * 'br-thrift-reference-storage-proxy' of https://github.com/xemul/scylla: thrift: Use local proxy reference in do_paged_slice thrift: Use local proxy reference in handler methods thrift: Keep sharded proxy reference on thrift_handler	2021-12-05 19:22:33 +02:00
Benny Halevy	9ed72cac95	test: sstable_compaction_test: add sstable_scrub_quarantine_mode_test For each quarantine mode: Validate sstables to quarantine one of them and then scrub with the given quarantine mode and verify the output whwther the quarantined sstable was scrubbed or not. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:29:58 +02:00
Benny Halevy	cc122984d6	compaction: scrub: add quarantine_mode option Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:29:04 +02:00
Benny Halevy	60ff28932c	compaction_manager: perform_sstable_scrub: get the whole compaction_type_options::scrub So we can pass additional options on top of the scrub mode. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:21:37 +02:00
Benny Halevy	bbe275f37d	compaction: scrub_sstables_validate_mode: quarantine invalid sstables When invalid sstables are detected, move them to the quarantine subdirectory so they won't be selected for regular compaction. Refs #7658 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:14:16 +02:00
Benny Halevy	3eabfad9fc	test: database_test: add snapshot_with_quarantine_works Test that snapshot includes quarantined sstables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	11b54d44d9	test: database_test: add populate_from_quarantine_works Test that we load quarantined sstables by creating a dataset, moving a sstable to the quarantine dir, and then reload the table and verify the dataset. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	075962b45a	distributed_loader: populate_keyspace: populate also from the quarantine dir sstables in the quarantine subdirectory are part of the table. They're just not eligible for non-scrub compaction. Call populate_column_family also for the quarantine subdirectory, allowing it to not exist. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	f643dc90a9	distributed_loader: populate_column_family: add must_exist param Check if the directory to be loaded exists. Currently must_exist=true in all cases, but it may be set to false when loading directories that may not exist. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	13e7b00f2e	sstables: add is_quarantined Quarantined sstables will reside in a "quarantine" subdirectory and are also not eligible for regular compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	07c5ddf182	sstables: add is_eligible_for_compaction Currently compaction_manager tracks sstables based on !requires_view_building() and similarly, table::in_strategy_sstables picks up only sstables that are not in staging. is_eligible_for_compaction() generalizes this condition in preparation for adding a quarantine subdirectory for invalid sstables that should not be compacted as well. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	bdc53880d4	sstables: define symbolic names for table subdirectories Define the "staging", "upload", and "snapshots" subdirectory names as named const expressions in the sstables namespace rather than relying on their string representation, that could lead to typo mistakes. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Avi Kivity	bfdab1e92e	alternator: ttl: don't initialize vector from initializer_list in coroutine Initializing a vector from an initializer_list defeats move construction (since initializer_list is const). Moreover it is suspected to cause a crash due to a miscompile. In any case, this patch fixes the crash. Fixes #9735. Closes #9736	2021-12-05 17:51:05 +02:00
Avi Kivity	8d724835eb	Merge 'select_statement: Calculate _restrictions->need_filtering() only once' from Jan Ciołek Originally mentioned in: https://github.com/scylladb/scylla/pull/9481#issuecomment-982698208 Currently we call `_restrictions->need_filtering()` each time a prepared select is executed. This is not super efficient - `need_filtering` has to scan through the whole AST and analyze it. This PR calculates value of `_restrictions->need_filtering()` only once and then uses this precomputed value. I ran `perf_simple_query` on my laptop throttled to 1GHz and it looks like this saves ~1000 instructions/op. ```bash median 38459.09 tps ( 75.1 allocs/op, 12.1 tasks/op, 46099 insns/op) median 38743.79 tps ( 75.1 allocs/op, 12.1 tasks/op, 46091 insns/op) median 38489.52 tps ( 75.1 allocs/op, 12.1 tasks/op, 46097 insns/op) median 38492.10 tps ( 75.1 allocs/op, 12.1 tasks/op, 46102 insns/op) median 38478.65 tps ( 75.1 allocs/op, 12.1 tasks/op, 46098 insns/op) median 38930.07 tps ( 75.1 allocs/op, 12.1 tasks/op, 44922 insns/op) median 38777.52 tps ( 75.1 allocs/op, 12.1 tasks/op, 44904 insns/op) median 39325.41 tps ( 75.1 allocs/op, 12.1 tasks/op, 44925 insns/op) median 38640.51 tps ( 75.1 allocs/op, 12.1 tasks/op, 44907 insns/op) median 39075.89 tps ( 75.1 allocs/op, 12.1 tasks/op, 44920 insns/op) ./build/release/test/perf/perf_simple_query --cpuset 1 -m 1G --random-seed 0 --task-quota-ms 10 --operations-per-shard 1000000 ``` Closes #9727 * github.com:scylladb/scylla: select_statement: Use precomputed value of _restrictions->need_filtering() select_statement: Store whether restrictions need filtering in a variable	2021-12-05 13:38:51 +02:00
Takuya ASADA	097a6ee245	dist: add support im4gn/is4gen instance on AWS Add support next-generation, storage-optimized ARM64 instance types. Fixes #9711 Closes #9730	2021-12-05 13:20:01 +02:00
Nadav Har'El	de21455dfe	Rename one logger which had a space in its name We had a logger called "query result log", with spaces, which made it impossible to enable it with the REST API due to missing percent decoding support in our HTTP server (see #9614). Although that HTTP server bug should be fixed as well (in Seastar - see scylladb/seastar#725), there is no good reason to have a logger name with a space in it. This is the only logger whose name has a space: We have 77 other loggers using underscores (_) in their name, and only 9 using hyphens (-). So in this patch we choose the more popular alternative - an underscore. Fixes #9614. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211205093732.1092553-1-nyh@scylladb.com>	2021-12-05 12:18:21 +02:00
Pavel Emelyanov	db5678bb7f	Merge "Kill unused code in compaction" from Raphael tests: unit(dev). * github.com/raphaelsc/scylla.git cleanups_for_compaction_12_03 compaction_strategy: kill unused compaction_strategy_type::major compaction: Log skip of fully expired sstables compaction_strategy: kill unused can_compact_partial_runs() compaction: kill useless on_skipped_expired_sstable() compaction: merge _total_input_sstables and _ancestors	2021-12-03 19:22:08 +03:00
Jan Ciolek	22c3e00c44	select_statement: Use precomputed value of _restrictions->need_filtering() Instead of calculating _restrictions->need_filtering() each time, we can now use the value that has been already calculated. This used to happen during query execution, so we get an increase in performance. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-03 17:03:53 +01:00
Jan Ciolek	075b3a45fd	select_statement: Store whether restrictions need filtering in a variable Instead of calculating _restrictions->need_filtering() we can calculate it only once and then use this computed variable. It turns out that _restrictions->need_filtering() is called during execution of prepared statements and it has to scan through the whole AST, so doing it only once gives us a performance gain. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-03 17:01:09 +01:00
Raphael S. Carvalho	2f9f089eda	compaction_strategy: kill unused compaction_strategy_type::major Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:27:10 -03:00
Raphael S. Carvalho	0e3d388ebb	compaction: Log skip of fully expired sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:25:48 -03:00
Raphael S. Carvalho	9725e5efa9	compaction_strategy: kill unused can_compact_partial_runs() This strategy method was introduced unnecessarily. We assume it was going to be needed, but turns out it was never needed, not even for ICS. Also it's built on a wrong assumption as an output sstable run being generated can never be compacted in parallel as the non-overlapping requirement can be easily broken. LCS for example can allow parallel compaction on different runs (levels) but correctness cannto be guaranteed with same runs are compacted in parallel. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:20:51 -03:00
Raphael S. Carvalho	7a7a2467fa	compaction: kill useless on_skipped_expired_sstable() It was introduced by commit `5206a97915` because fully expired sstable wouldn't be registed and therefore could be never removed from backlog tracker. This is no longer possible as table is now responsible for removing all input sstables. So let's kill on_skipped_expired_sstable() as it's now only boilerplate we don't need. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:19:29 -03:00
Raphael S. Carvalho	32c2534e91	compaction: merge _total_input_sstables and _ancestors Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:19:23 -03:00
Pavel Emelyanov	d86b35f474	thrift: Use local proxy reference in do_paged_slice This place need some more care than simple replacement Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-03 17:56:04 +03:00
Pavel Emelyanov	35c35602ae	thrift: Use local proxy reference in handler methods Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-03 17:56:04 +03:00
Pavel Emelyanov	2d8272dc03	thrift: Keep sharded proxy reference on thrift_handler Carried via main -> controller -> server -> factory -> handler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-03 17:48:19 +03:00
Piotr Sarna	0bd139e81c	Merge 'cql3: expr: detemplate and deinline find_in_expression() ... and count_if()' from Avi Kivity The expression code provides some utilities to examine and manipulate expressions at prepare time. These are not (or should not be) in the fast path and so should be optimized for compile time and code footprint rather than run time. This series does so by detemplating and deinlining find_in_expression() and count_if(). Closes #9712 * github.com:scylladb/scylla: cql3: expr: adjust indentation in recurse_until() cql3: expr: detemplate count_if() cql3: expr: detemplate count_if() cql3: expr: rewrite count_if() in terms of recurse_until() cql3: expr: deinline recurse_until() cql3: expr: detemplate find_in_expression	2021-12-03 15:41:07 +01:00
Piotr Sarna	3867ca2fd6	Merge 'cql3: Don't allow unset values inside UDT' from Jan Ciołek Scylla doesn't support unset values inside UDT. The old code used to convert `unset` to `null`, which seems incorrect. There is an extra space in the error message to retain compatability with Cassandra. Fixes: #9671 Closes #9724 * github.com:scylladb/scylla: cql-pytest: Enable test for UDT with unset values cql3: Don't allow unset values inside UDT	2021-12-03 15:36:55 +01:00
Jan Ciolek	3ae8752812	cql-pytest: Enable test for UDT with unset values The test testUDTWithUnsetValues was marked as xfail, but now the issue has been fixed and we can enable it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-03 14:46:21 +01:00
Jan Ciolek	be14904416	cql3: Don't allow unset values inside UDT Scylla doesn't support unset values inside UDT. The old code used to convert unset to null, which seems incorrect. There is an extra space in the error message to retain compatability with Cassandra. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-12-03 14:46:21 +01:00
Gleb Natapov	b954d91b4f	migration_manager: co-routinize announce_column_family_drop function. Message-Id: <20211202150531.1277448-30-gleb@scylladb.com>	2021-12-03 13:29:34 +02:00
Botond Dénes	fb8d268251	test/boost/multishard_mutation_query_test: add multi-range test	2021-12-03 10:51:45 +02:00
Botond Dénes	2b613a13a5	test/boost/multishard_mutation_query_test: add multi-range support In the test infrastructure code, so we can add tests passing multiple ranges to the tested `multishard_{mutation,data}_query()`, exercising multi-range functionality.	2021-12-03 10:51:45 +02:00
Botond Dénes	5380cb0102	multishard_mutation_query: don't drop data during stateful multi-range reads When multiple ranges are passed to `multishard_{mutation,data}_query()`, it wraps the multishard reader with a multi-range one. This interferes with the disassembly of the multishard reader's buffer at the end of the page, because the multi-range reader becomes the top-level reader, denying direct access to the multishard reader itself, whose buffer is then dropped. This confuses the reading logic, causing data corruption on the next page(s). A further complication is that the multi-range reader can include data from more then one range in its buffer when filling it. To solve this, a special-purpose multi-range is introduced and used instead of the generic one, which solves both these problems by guaranteeing that: * Upon calling fill_buffer(), the entire content of the underlying multishard reader is moved to that of the top-level multi-range reader. So calling `detach_buffer()` guarantees to remove all unconsumed fragments from the top-level readers. * fill_buffer() will never mix data from more than one ranges. It will always stop on range boundaries and will only cross if the last range was consumed entirely. With this, multi-range reads finally work with reader-saving.	2021-12-03 10:45:06 +02:00
Botond Dénes	953603199e	multishard_combining_reader: reader_lifecycle_policy: allow saving read range on fast-forward The reader_lifecycle_policy API was created around the idea of shard readers (optionally) being saved and reused on the next page. To do this, the lifecycle policy has to also be able to control the lifecycle of by-reference parameters of readers: the slice and the range. This was possible from day 1, as the readers are created through the lifecycle policy, which can intercept and replace the said parameters with copies that are created in stable storage. There was one whole in the design though: fast-forwarding, which can change the range of the read, without the lifecycle policy knowing about this. In practice this results in fast-forwarded readers being saved together with the wrong range, their range reference becoming stale. The only lifecycle implementation prone to this is the one in `multishard_mutation_query.cc`, as it is the only one actually saving readers. It will fast-forward its reader when the query happens over multiple ranges. There were no problems related to this so far because no one passes more than one range to said functions, but this is incidental. This patch solves this by adding an `update_read_range()` method to the lifecycle policy, allowing the shard reader to update the read range when being fast forwarded. To allow the shard reader to also have control over the lifecycle of this range, a shared pointer is used. This control is required because when an `evictable_reader` is the top-level reader on the shard, it can invoke `create_reader()` with an edited range after `update_read_range()`, replacing the fast-forwarded-to range with a new one, yanking it out from under the feet of the evictable reader itself. By using a shared pointer here, we can ensure the range stays alive while it is the current one.	2021-12-03 10:27:44 +02:00
Raphael S. Carvalho	4a02e312f6	compaction: increase disjoint tolerance in TWCS reshape When reshaping TWCS table in relaxed mode, which is the case for offstrategy and boot, disjoint tolerance is too strict, which can lead those processes to do more work than needed. Let's increase the tolerance to max threshold, which will limit the amount of sstables opened in compaction to a reasonable amount. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211130132538.56285-1-raphaelsc@scylladb.com>	2021-12-03 06:38:42 +02:00
Raphael S. Carvalho	6ad630c095	scylla-gdb.py: fix unique ptr on newer libstdc++ unfortunately, correctness of std_unique_ptr and similar depends on their implementation in libstdc++. let's support unique ptr on newer systems while maintaining backward compatibility. ./test.py --mode=release scylla-gdb now passes to me, also verified `scylla compaction-tasks` produces correct info. Fixes #9677. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211202173534.359672-1-raphaelsc@scylladb.com>	2021-12-03 06:33:54 +02:00
Avi Kivity	3b82ef854d	Merge "Some compaction manager cleanups" from Raphael " couple of preparatory changes for coroutinization of manager " * 'some_compaction_manager_cleanups_v5' of github.com:raphaelsc/scylla: compaction_manager: move check_for_cleanup into perform_cleanup() compaction_manager: replace get_total_size by one liner compaction_manager: make consistent usage of type and name table compaction_manager: simplify rewrite_sstables() compaction_manager: restore indentation	2021-12-02 19:53:13 +02:00
Pavel Emelyanov	5cfeac0c90	paxos: Drop forward declarations of seastar pointers They will break compilation after next seastar update, but the good news is that scylla compiles even without them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211202173643.1070-1-xemul@scylladb.com>	2021-12-02 19:49:03 +02:00
Konstantin Osipov	bdb924cdac	cql3: co-routinize create_table_statement::announce_migration() Message-Id: <20211202150531.1277448-4-gleb@scylladb.com>	2021-12-02 19:43:30 +02:00
Pavel Emelyanov	e4f35e2139	migration_manager: Eliminate storage service from passive announcing Currently storage service acts as a glue between database schema value and the migration manager "passive_announce" call. This interposing is not required, migration manager can do all the management itself, and the linkage can be done in main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-02 19:43:30 +02:00
Pavel Emelyanov	a751a1117a	migration_manager: Coroutinize drain() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-02 19:43:30 +02:00
Pavel Emelyanov	eb8e30f696	migration_manager: Rename stop to drain then bring it back Because today's migration_manager::stop is called drain-time. Keep the .stop for next patch, but since it's called when the whole migration_manager stops, guard it against re-entrances. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-02 19:43:30 +02:00
Pavel Emelyanov	798f4b0e3f	migration_manager: Sanitize (maybe_)schedule_schema_pull Both calls are now private. Also the non-maybe one can become void and handle pull exceptions by itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-02 19:43:30 +02:00
Pavel Emelyanov	421679e428	migration_manager: Schedule schema pulls upon gossip events Move the calls from respective storage service notification callbacks. One non-move change is that token metadata that was available on the storage service should be taken from storage proxy, but this change is aligned with future changes -- migration manager depends on proxy and will get a local proxy reference some day. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-02 19:43:30 +02:00
Pavel Emelyanov	d4d0bd147e	migration_manager: Subscribe on gossiper events This is to start schema pulls upon on_join, on_alive and on_change ones in the next patch. Migration manager already has gossiper reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-12-02 19:43:30 +02:00
Botond Dénes	259649c779	sstables/index_reader: improved diagnostics on missing index entry Add the summary index and the bound's address to the error message, so it can be correlated with other trace level logging when investigating a problem. Refs: #9446 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211202124955.542293-2-bdenes@scylladb.com>	2021-12-02 19:43:30 +02:00
Botond Dénes	f0b9519999	test/lib/exception_utils: add message_matches() predicate Which checks the message against the given regex. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211202124955.542293-1-bdenes@scylladb.com>	2021-12-02 19:43:30 +02:00
Nadav Har'El	605a2de398	config: change default prometheus_address handling, again In the very recent commit `3c0e703` fixing issue #8757, we changed the default prometheus_address setting in scylla.yaml to "localhost", to match the default listen_address in the same file. We explained in that commit how this helped developers who use an unchanged scylla.yaml, and how it didn't hurt pre-existing users who already had their own scylla.yaml. However, it was quickly noted by Tzach and Amnon that there is one use case that was hurt by that fix: Our existing documentation, such as the installation guide https://www.scylladb.com/download/?platform=centos ask the user to take our initial scylla.yaml, and modify listen_address, rpc_address, seeds, and cluster_name - and that's it. That document - and others - don't tell the user to also override prometheus_address, so users will likely forget to do so - and monitoring will not work for them. So this patch includes a different solution to #8757. What it does is: 1. The setting of prometheus_address in scylla.yaml is commented out. 2. In config.cc, prometheus_address defaults to empty. 3. In main.cc, if prometheus_address is empty (i.e., was not explicitly set by the user), the value of listen_address is used instead. In other words, the idea is that prometheus_address, if not explicitly set by the user, should default to listen_address - which is the address used to listen to the internal Scylla inter-node protocol. Because the documentation already tells the user to set listen_address and to not leave it set to localhost, setting it will also open up prometheus, thereby solving #9701. Meanwhile, developers who leave the default listen_address=localhost will also get prometheus_address=localhost, so the original #8757 is solved as well. Finally, for users who had an old scylla.yaml where prometheus_address was explicitly set to something, this setting will continue to be used. This was also a requirement of issue #8757. Fixes #9701. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211129155201.1000893-1-nyh@scylladb.com>	2021-12-02 19:43:30 +02:00
Avi Kivity	7cfd278c32	db: size_estimates_virtual_reader: convert to flat_mutation_reader_v2 As part of changing the codebase to flat_mutation_reader_v2, change size_estimates_virtual_reader. Since the bulk of the work is done by make_flat_mutation_reader_from_mutations() (which is unchanged), only glue code is affected. It is also not performance sensitive, so the extra conversions are unimportant. Test: unit (dev) Closes #9707	2021-12-02 19:43:30 +02:00
Avi Kivity	b920f2500d	db: virtual_table: convert chained_delegating_reader to v2 As part of changing the codebase to flat_mutation_reader_v2, change chained_delegating_reader and its user virtual_table. Since the reader does not process fragments (only forwarding things around), only glue code is affected. It is also not performance sensitive, so the extra conversions are unimportant. Test: unit (dev) Closes #9706	2021-12-02 19:43:30 +02:00
Raphael S. Carvalho	6d750d4f59	compaction_manager: move check_for_cleanup into perform_cleanup() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-02 14:39:31 -03:00
Raphael S. Carvalho	9aed7e9d67	compaction_manager: replace get_total_size by one liner Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-02 14:39:31 -03:00
Raphael S. Carvalho	760cfd93fb	compaction_manager: make consistent usage of type and name table new code in manager adopted name and type table, whereas historical code still uses name and type column family. let's make it consistent for newcomers to not get confused. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-02 14:39:27 -03:00
Gleb Natapov	cab1a1c403	schema raft sm: request schema sync on schema_state_machine snapshot transfer If the schema state machine requests snapshot transfer it means that it missed some schema mutations and needs a full sync. We already have a function that does it: migration_manager::submit_migration_task(), so call it on a snapshot transfer.	2021-12-02 14:55:29 +02:00
Raphael S. Carvalho	e460f72250	compaction_manager: simplify rewrite_sstables() as rewrite_sstables() switched to coroutine, it can be simplified by not using smart pointers to handle lifetime issues. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-02 08:15:41 -03:00
Raphael S. Carvalho	48124fc15a	compaction_manager: restore indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-02 08:15:38 -03:00
Gleb Natapov	431f931d2a	raft service: delegate snapshot transfer to a state machine implementation We want raft service to support all kinds of state machines and most services provided by it may indeed be shared. But snapshot transfer is very state machine specific and thus cannot be put into the raft service. This patch delegates snapshot transfer implementation to a state machine implementation.	2021-12-02 10:54:44 +02:00
Gleb Natapov	fd109ecff1	schema raft sm: pass migration manager to schema_raft_state_machine and merge schema on apply() This patch wires up schema_raft_state_machine::apply() function. For now it assumes that a raft command contains single schema change in the form of a schema mutation array. It may change later (we may add more info to a schema), but for now this will do.	2021-12-02 10:46:32 +02:00
Piotr Sarna	761c691149	alternator,ttl: simplify getting primary key column values Key column values fetched during the TTL scan have a well-defined order - primary columns come first. This assumption is now used to simplify getting the values from rows during scans without having to consult result metadata first. Tests: unit(release) Message-Id: <dcb19b8bab0dd02838693fe06d5a835ea2f378ff.1638357005.git.sarna@scylladb.com>	2021-12-02 10:29:41 +02:00
Piotr Sarna	337906bc1c	alternator: precompute scan range parameters in a function This commit addresses a very simple FIXME left in alternator TTL implementation - it reduces the number of parameters passed to scan_table_ranges() by enclosing the parameters in a separate object. Tests: unit(release) Message-Id: <214afcd9d5c1968182ad98550105f82add216c80.1638354094.git.sarna@scylladb.com>	2021-12-02 10:04:05 +02:00
Raphael S. Carvalho	de165b864c	repair: Enable off-strategy compaction for rebuild Let's enable offstrategy for repair based rebuild, for it to take advantage of offstrategy benefits, one of the most important being compaction not acting aggressively, which is important for both reducing operation time and delivering good latency while the operation is running. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211130115957.13779-1-raphaelsc@scylladb.com>	2021-12-02 09:58:58 +02:00
David Garcia	954d5d5d63	Fix cql docs error Closes #9613	2021-12-02 09:58:58 +02:00
Avi Kivity	ef3edcf848	test: refine test suite names exposed via xunit format The test suite names seen by Jenkins are suboptimal: there is no distinction between modes, and the ".cc" suffix of file names is interpreted as a class name, which is converted to a tree node that must be clicked to expand. Massage the names to remove unnecessary information and add the mode. Closes #9696	2021-12-02 09:58:58 +02:00
Avi Kivity	9edd86362a	test: sstable_test: don't read compressed file size from closed file We read the compressed file size from a file that was already closed, resulting in EBADF on my machine. Not sure why it works for everyone else. Fix by reading the size using the path. Closes #9675	2021-12-01 16:28:46 +02:00
Raphael S. Carvalho	f23e0d7f2d	compaction_manager: Disconsider inactive tasks when filtering sstables After commit `1f5b17f`, overlapping can be introduced in level 1 because procedure that filters out sstables from partial runs is considering inactive tasks, so L1 sstables can be incorrectly filtered out from next compaction attempt. When L0 is merged into L1, overlapping is then introduced in L1 because old L1 sstables weren't considered in L0 -> L1 compaction. From now on, compaction_manager::get_candidates() will only consider active tasks, to make sure actual partial runs are filtered out. Fixes #9693. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211129180459.125847-1-raphaelsc@scylladb.com>	2021-12-01 16:11:44 +02:00
Raphael S. Carvalho	9de7abdc80	compaction: LCS: Fix inefficiency when pushing SSTables to higher levels To satisfy backlog controller, commit `28382cb25c` changed LCS to incrementally push sstables to highest level when there's nothing else to be done. That's overkill because controller will be satisfied with level L being fanout times larger than L-1. No need to push everything to last level as it's even worse than a major, because any file being promoted will overlap with ~10 files in next level. At least, the cost is amortized by multiple iterations, but terrible write amplification is still there. Consequently, this reduces overall efficiency. For example, it might happen that LCS in table A start pushing everything to highest level, when table B needs resources for compaction to reduce its backlog. Increased write amplification in A may prevent other tables from reducing their backlog in a timely manner. It's clear that LCS should stop promoting as soon as level L is 10x larger than L-1, so strategy will still be satisfied while fixing the inefficiency problem. Now layout will look like as follow: SSTables in each level: [0, 2, 15, 121] Previously, it looked like once table stopped being written to: SSTables in each level: [0, 0, 0, 138] It's always good to have everything in a single run, but that comes with a high write amplification cost which we cannot afford in steady state. With this change, the layout will still be good enough to make everybody happy. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211129143606.71257-1-raphaelsc@scylladb.com>	2021-12-01 16:10:25 +02:00
Gleb Natapov	f2ab5f4e60	raft service: insert a new raft instance into the servers' list only after it is started RPC module starts to dispatching calls to a server the moment it is in the servers' list, but until raft::server::start() completes the instance is not fully created yet and is not ready to accept anything. Fix the code that initialize new raft group to insert new raft instance into the list only after it is started. Message-Id: <YZTFFW9v0NlV7spR@scylladb.com>	2021-12-01 13:11:49 +01:00
Nadav Har'El	94eb5c55c8	Merge 'Loading cache improve eviction use policy' from Vladislav Zolotarov This series introduces a new version of a loading_cache class. The old implementation was susceptible to a "pollution" phenomena when frequently used entry can get evicted by an intensive burst of "used once" entries pushed into the cache. The new version is going to have a privileged and unprivileged cache sections and there's a new loading_cache template parameter - SectionHitThreshold. The new cache algorithm goes as follows: * We define 2 dynamic cache sections which total size should not exceed the maximum cache size. * New cache entry is always added to the "unprivileged" section. * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section. * Both sections' entries obey expiration and reload rules in the same way as before this patch. * When cache entries need to be evicted due to a size restriction "unprivileged" section's least recently used entries are evicted first. More details may be found in #8674. In addition, during a testing another issue was found in the authorized_prepared_statements_cache: #9590. There is a patch that fixes it as well. Closes #9708 * github.com:scylladb/scylla: loading_cache: account unprivileged section evictions loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed loading_cache::timestamped::lru_entry: refactoring loading_cache.hh: rearrange the code (no functional change) loading_cache: use std::pmr::polymorphic_allocator	2021-12-01 13:13:53 +02:00
Calle Wilund	3e21fea2b6	test_streamts: test_streams_starting_sequence_number fix 'LastEvaluatedShardId' usage It is not part of raw response, but of the 'StreamDescription' object. Test fails internmittently depending on PK randomization. Closes #9710	2021-12-01 11:05:40 +02:00
Avi Kivity	03755b362a	Merge 'compaction_manager api: stop ongoing compactions' from Benny Halevy This series extends `compaction_manager::stop_ongoing_compaction` so it can be used from the api layer for: - table::disable_auto_compaction - compaction_manager::stop_compaction Fixes #9313 Fixes #9695 Test: unit(dev) Closes #9699 * github.com:scylladb/scylla: compaction_manager: stop_compaction: wait for ongoing compactions to stop compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level compaction_manager: unify stop_ongoing_compactions implementations compaction_manager: stop_ongoing_compactions: add compaction_type option compaction_manager: get_compactions: get a table* parameter table: disable_auto_compaction: stop ongoing compactions compaction_manager: make stop_ongoing_compactions public table: futurize disable_auto_compactions	2021-11-30 19:08:14 +02:00
Avi Kivity	2c613b027d	cql3: expr: adjust indentation in recurse_until() Whitespace changes only.	2021-11-30 17:57:53 +02:00
Avi Kivity	f7f77df143	cql3: expr: detemplate count_if() No functional changes. This prepare-path function does not need to be inlined.	2021-11-30 17:52:15 +02:00
Avi Kivity	3a96b74e49	cql3: expr: detemplate count_if() count_if() is a prepare-path function and does not need to be a template. Type-erase it with noncopyable_function.	2021-11-30 17:50:34 +02:00
Avi Kivity	6f9e56e678	cql3: expr: rewrite count_if() in terms of recurse_until() Counting is just recursing without early termination, and counting as a side effect.	2021-11-30 17:49:00 +02:00
Avi Kivity	c01188c414	cql3: expr: deinline recurse_until() As a prepare-path function, it has no business being inline.	2021-11-30 17:41:16 +02:00
Avi Kivity	d0177d4b85	cql3: expr: detemplate find_in_expression find_in_expression() is not in a fast path but is quite large and inlined due to being a template. Detemplate it into a recurse_until() utility function, and keep only the minimal code in a template. The recurse_until is still inline to simplify review, but will be deinlined in the next patch.	2021-11-30 17:37:24 +02:00
Avi Kivity	595cc328b1	Merge 'cql3: Remove term, replace with expression' from Jan Ciołek This PR finally removes the `term` class and replaces it with `expression`. * There was some trouble with `lwt_cache_id` in `expr::function_call`. The current code works the following way: * for each `function_call` inside a `term` that describes a pk restriction, `prepare_context::add_pk_function_call` is called. * `add_pk_function_call` takes a `::shared_ptr<cql3::functions::function_call>`, sets its `cache_id` and pushes this shared pointer onto a vector of all collected function calls * Later when some condiition is met we want to clear cache ids of all those collected function calls. To do this we iterate through shared pointers collected in `prepare_context` and clear cache id for each of them. This doesn't work with `expr::function_call` because it isn't kept inside a shared pointer. To solve this I put the `lwt_cache_id` inside a shared pointer and then `prepare_context` collects these shared pointers to cache ids. I also experimented with doing this without any shared pointers, maybe we could just walk through the expression and clear the cache ids ourselves. But the problem is that expressions are copied all the time, we could clear the cache in one place, but forget about a copy. Doing it using shared pointers more closely matches the original behaviour. The experiment is on the [term2-pr3-backup-altcache](https://github.com/cvybhu/scylla/tree/term2-pr3-backup-altcache) branch * `shared_ptr<term>` being `nullptr` could mean: * It represents a cql value `null` * That there is no value, like `std::nullopt` (for example in `attributes.hh`) * That it's a mistake, it shouldn't be possible A good way to distinguish between optional and mistake is to look for `my_term->bind_and_get()`, we then know that it's not an optional value. * On the other hand `raw_value` cased to bool means: * `false` - null or unset * `true` - some value, maybe empty I ran a simple benchmark on my laptop to see how performance is affected: ``` build/release/test/perf/perf_simple_query --smp 1 -m 1G --operations-per-shard 1000000 --task-quota-ms 10 ``` * On master (`a21b1fbb2f`) I get: ``` 176506.60 tps ( 77.0 allocs/op, 12.0 tasks/op, 45831 insns/op) median 176506.60 tps ( 77.0 allocs/op, 12.0 tasks/op, 45831 insns/op) median absolute deviation: 0.00 maximum: 176506.60 minimum: 176506.60 ``` * On this branch I get: ``` 172225.30 tps ( 75.1 allocs/op, 12.1 tasks/op, 46106 insns/op) median 172225.30 tps ( 75.1 allocs/op, 12.1 tasks/op, 46106 insns/op) median absolute deviation: 0.00 maximum: 172225.30 minimum: 172225.30 ``` Closes #9481 * github.com:scylladb/scylla: cql3: Remove remaining mentions of term cql3: Remove term cql3: Rename prepare_term to prepare_expression cql3: Make prepare_term return an expression instead of term cql3: expr: Add size check to evaluate_set cql3: expr: Add expr::contains_bind_marker cql3: expr: Rename find_atom to find_binop cql3: expr: Add find_in_expression cql3: Remove term in operations cql3: Remove term in relations cql3: Remove term in multi_column_restrictions cql3: Remove term in term_slice, rename to bounds_slice cql3: expr: Remove term in expression cql3: expr: Add evaluate_IN_list(expression, options) cql3: Remove term in column_condition cql3: Remove term in select_statement cql3: Remove term in update_statement cql3: Use internal cql format in insert_prepared_json_statement cache types: Add map_type_impl::serialize(range of <bytes, bytes>) cql3: Remove term in cql3/attributes cql3: expr: Add constant::view() method cql3: expr: Implement fill_prepare_context(expression) cql3: expr: add expr::visit that takes a mutable expression cql3: expr: Add receiver to expr::bind_variable	2021-11-30 16:39:39 +02:00
Avi Kivity	078f69c133	Merge "raft: (service) implement group 0 as a service" from Kostja " To ensure consistency of schema and topology changes, Scylla needs a linearizable storage for this data available at every member of the database cluster. The series introduces such storage as a service, available to all Scylla subsystems. Using this service, any other internal service such as gossip or migrations (schema) could persist changes to cluster metadata and expect this to be done in a consistent, linearizable way. The series uses the built-in Raft library to implement a dedicated Raft group, running on shard 0, which includes all members of the cluster (group 0), adds hooks to topology change events, such as adding or removing nodes of the cluster, to update group 0 membership, ensures the group is started when the server boots. The state machine for the group, i.e. the actual storage for cluster-wide information still remains a stub. Extending it to actually persist changes of schema or token ring is subject to a subsequent series. Another Raft related service was implemented earlier: Raft Group Registry. The purpose of the registry is to allow Scylla have an arbitrary number of groups, each with its own subset of cluster members and a relevant state machine, sharing a common transport. Group 0 is one (the first) group among many. " * 'raft-group-0-v12' of github.com:scylladb/scylla-dev: raft: (server) improve tracing raft: (metrics) fix spelling of waiters_awaken raft: make forwarding optional raft: (service) manage Raft configuration during topology changes raft: (service) break a dependency loop raft: (discovery) introduce leader discovery state machine system_keyspace: mark scylla_local table as always-sync commitlog system_keyspace: persistence for Raft Group 0 id and Raft Server Id raft: add a test case for adding entries on follower raft: (server) allow adding entries/modify config on a follower raft: (test) replace virtual with override in derived class raft: (server) fix a typo in exception message raft: (server) implement id() helper raft: (server) remove apply_dummy_entry() raft: (test) fix missing initialization in generator.hh	2021-11-30 16:24:51 +02:00
Raphael S. Carvalho	0d5ac845e1	compaction: Make cleanup withstand better disk pressure scenario It's not uncommong for cleanup to be issued against an entire keyspace, which may be composed of tons of tables. To increase chances of success if low on space, cleanup will now start from smaller tables first, such that bigger tables will have more space available, once they're reached, to satisfy their space requirement. parallel_for_each() is dropped and wasn't needed given that manager performs per-shard serialization of cleanup jobs. Refs #9504. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211130133712.64517-1-raphaelsc@scylladb.com>	2021-11-30 16:15:24 +02:00
Benny Halevy	957003e73f	compaction_manager: stop_compaction: wait for ongoing compactions to stop Similar to #9313, stop_compaction should also reuse the stop_ongoing_comapctions() infrastructure and wait on ongoing compactions of the given type to stop. Fixes #9695 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 16:09:11 +02:00
Benny Halevy	b9ba181d3c	compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level Normally, "Stopping 0 tasks for 0 ongoing compactions for table ..." is not very interesting so demote its log_level to debug. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 16:09:11 +02:00
Benny Halevy	03e969dbef	compaction_manager: unify stop_ongoing_compactions implementations Now stop_ongoing_compactions(reason) is equivalent to to stop_ongoing_compactions(reason, nullptr, std::nullopt) so share the code of the latter for the former entry point. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 16:09:07 +02:00
Benny Halevy	94011bdcca	compaction_manager: stop_ongoing_compactions: add compaction_type option And make the table optional as well, so it can be used by stop_compaction() to a particular compaction type on all tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 16:07:47 +02:00
Benny Halevy	a419759835	compaction_manager: get_compactions: get a table* parameter Optionally get running compaction on the provided table. This is required for stop_ongoing_compactions on a given table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 16:06:34 +02:00
Benny Halevy	4affa801a5	table: disable_auto_compaction: stop ongoing compactions The api call disables new regular compaction jobs from starting but it doesn't wait for ongoing compaction to stop and so it's much less useful. Returning after stopping regular compaction jobs and waiting for them to stop guarantees that no regular compactions job are running when nodetool disableautocompaction returns successfully. Fixes #9313 Test: sstable_compaction_test,sstable_directory_test(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 16:06:34 +02:00
Benny Halevy	3c721eb228	compaction_manager: make stop_ongoing_compactions public So it can be used directly by table code in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 16:06:29 +02:00
Raphael S. Carvalho	3006394312	compaction: Allow incremental compaction with interposer consumer Until commit `c94e6f8567`, interposer consumer wouldn't work with our GC writer, needed for incremental compaction correctness. Now that the technical debt is gone, let's allow incremental compaction with interposer consumer. The only change needed is serialization of replacer as two consumers cannot step on each toe, like when we have concurrent bucket writers with TWCS. sstable_compaction_test.test_bug_6472 passes with this change, which was added when #6472 was fixed by not allowing incremental compaction with interposer consumer. Refs #6472. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211126191000.43292-1-raphaelsc@scylladb.com>	2021-11-30 15:24:17 +02:00
Eliran Sinvani	ddd7248b3b	testlib: close index_reader to avoid racing condition In order to avoid race condition introduced in `9dce1e4` the index_reader should be closed prior to it's destruction. This only exposes 4.4 and earlier releases to this specific race. However, it is always a good idea to first close the index reader and only then destroy it since it is most likely to be assumed by all developers that will change the reader index in the future. Ref #9704 (because on 4.4 and earlier releases are vulnerable). Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #9705	2021-11-30 13:05:24 +01:00
Benny Halevy	b60d697084	table: futurize disable_auto_compactions So it can stop ongoing compaction and wait for them to complete. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-30 08:33:04 +02:00
Vlad Zolotarov	4cb245fe3c	loading_cache: account unprivileged section evictions Provide a template parameter to provide a static callbacks object to increment a counter of evictions from the unprivileged section. If entries are evicted from the cache while still in the unprivileged section indicates a not efficient usage of the cache and should be investigated. This patch instruments authorized_prepared_statements_cache and a prepared_statements_cache objects to provide non-empty callbacks. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2021-11-29 21:45:53 -05:00
Vlad Zolotarov	1a9c6d9fd3	loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy This patch implements a simple variation of LFRU eviction policy: * We define 2 dynamic cache sections which total size should not exceed the maximum cache size. * New cache entry is always added to the "unprivileged" section. * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section. * Both sections' entries obey expiration and reload rules in the same way as before this patch. * When cache entries need to be evicted due to a size restriction "unprivileged" section's least recently used entries are evicted first. Note: With a 2 sections cache it's not enough for a new entry to have the latest timestamp in order not be evicted right after insertion: e.g. if all all other entries are from the privileged section. And obviously we want to allow new cache entries to be added to a cache. Therefore we can no longer first add a new entry and then shrink the cache. Switching the order of these two operations resolves the culprit. Fixes #8674 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2021-11-29 21:45:21 -05:00
Pavel Solodovnikov	e3f922c48b	raft: write raft log in user memory System dirty memory space is limited by 10MB capacity. This means that memtables cannot accumulate more than 5MB before they are flushed to sstables. This can impact performance under load. Move the `system.raft` table to the regular dirty memory space. Fixes: #9692 Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20211129200044.1144961-1-pa.solodovnikov@scylladb.com>	2021-11-29 23:51:24 +01:00
Vlad Zolotarov	66c150769b	authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed Always "touch" a prepared_statements_cache entry when it's accessed via authorized_prepared_statements_cache. If we don't do this it may turn out that the most recently used prepared statement doesn't have the newest last_read timestamp and can get evicted before the not-so-recently-read statement if we need to create space in the prepared statements cache for a new entry. And this is going to trigger an eviction of the corresponding entry from the authorized_prepared_cache breaking the LRU paradigm of these caches. Fixes #9590 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2021-11-29 17:37:25 -05:00
Nadav Har'El	d9c5c4eab6	test/alternator: tests for Select parameter in GSI and LSI We already have tests for the behavior of the "Select" parameter when querying a base table, but this patch adds additional tests for its behavior when querying a GSI or a LSI. There are some differences: Select=ALL_PROJECTED_ATTRIBUTES is not allowed for base tables, but is allowed - and in fact is the default - for GSI and LSI. Also, GSI may not allow ALL_ATTRIBUTES (which is the default for base tables) if only a subset of the attributes were projected. The new tests xfail because the Select and Projection features have not yet been implemented in Alternator. They pass in DynamoDB. After this patch we have (hopefully) complete test coverage of the Select feature, which will be helpful when we start implementing it. Refs #5058 (Select) Refs #5036 (Projection) Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211125100443.746917-1-nyh@scylladb.com>	2021-11-29 20:28:43 +01:00
Nadav Har'El	1c279118f4	test/alternator: more test cases for Select parameter Add to the existing tests for the Select parameter of the Query and Scan operations another check: That when Select is ALL_ATTRIBUTES or COUNT, specifying AttributesToGet or ProjectionExpression is forbidden - because the combination doesn't make sense. The expanded test continues to xfail on Alternator (because the Select parameter isn't yet implemented), and passes on DynamoDB. Strengthening the tests for this feature will be helpful when we decide to implement it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211125074128.741677-1-nyh@scylladb.com>	2021-11-29 20:28:25 +01:00
Vlad Zolotarov	cbabde9622	loading_cache::timestamped::lru_entry: refactoring * Store a reference to a parent (loading_cache) object instead of holding references to separate fields. * Access loading_cache fields via accessors. * Move the LRU "touch" logic to the loading_cache. * Keep only a plain "list entry" logic in the lru_entry class. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2021-11-29 14:24:56 -05:00
Vlad Zolotarov	9125b4545e	loading_cache.hh: rearrange the code (no functional change) Hide internal classes inside the loading_cache class: * Simpler calls - no need for a tricky back-referencing to access loading_cache fields. * Cleaner interface. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2021-11-29 14:24:56 -05:00
Vlad Zolotarov	fd92718f48	loading_cache: use std::pmr::polymorphic_allocator Use std::pmr::polymorphic_allocator instead of std::allocator - the former allows not to define the allocated object during the template specification. As a result we won't have to have lru_entry defined before loading_cache, which in line would allow us to rearrange classes making all classes internal to loading_cache and hence simplifying the interface. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2021-11-29 14:24:56 -05:00
Raphael S. Carvalho	1f3135abb4	sstable_set: use for_each_sstable() in make_crawling_reader() sstable_set_impl::all() may have to copy all sstables from multiple sets, if compound. let's avoid this overhead by using sstable_set_impl::for_each_sstable(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211127181037.56542-1-raphaelsc@scylladb.com>	2021-11-29 19:59:39 +02:00
Michael Livshin	f0e2ada748	fix mutation_source::operator bool() for v2 factories A mutation source is valid when it has either a v1 or v2 flat mutation reader factory, but `operator bool()` only checks for the former. Fixes #9697 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9698	2021-11-29 19:50:37 +02:00
Nadav Har'El	8618346331	config: automate experimental_features_t::all() The experimental_features_t has an all() method, supposedly returning all values of the enum - but it's easy to forget to update it when adding a new experimental feature - and it's currently out-of-sync (it's missing the ALTERNATOR_TTL option). We already have another method, map(), where a new experimental feature must be listed otherwise it can't be used, so let's just take all()'s values from map(), automatically, instead of forcing developers to keep both lists up-to-date. Note that using the all() function to enable all experimental features is not recommended - the best practice is to enable specific experimental features, not all of them. Nevertheless, this all() function is still used in one place - in the cql_repl tool - which uses it to enable all experimental features. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211108135601.78460-1-nyh@scylladb.com>	2021-11-29 18:44:23 +02:00
Tomasz Grabiec	3226c5bf9d	Merge 'sstables: mx: enable position fast-forwarding in reverse mode' from Kamil Braun Most of the machinery was already implemented since it was used when jumping between clustering ranges of a query slice. We need only perform one additional thing when performing an index skip during fast-forwarding: reset the stored range tombstone in the consumer (which may only be stored in fast-forwarding mode, so it didn't matter that it wasn't reset earlier). Comments were added to explain the details. As a preparation for the change, we extend the sstable reversing reader random schema test with a fast-forwarding test and include some minor fixes. Fixes #9427. Closes #9484 * github.com:scylladb/scylla: query-request: add comment about clustering ranges with non-full prefix key bounds sstables: mx: enable position fast-forwarding in reverse mode test: sstable_conforms_to_mutation_source_test: extend `test_sstable_reversing_reader_random_schema` with fast-forwarding test: sstable_conforms_to_mutation_source_test: fix `vector::erase` call test: mutation_source_test: extract `forwardable_reader_to_mutation` function test: random_schema: fix clustering column printing in `random_schema::cql`	2021-11-29 16:01:53 +01:00
Raphael S. Carvalho	80a1ebf0f3	compaction_manager: Fix race when selecting sstables for rewrite operations Rewrite operations are scrub, cleanup and upgrade. Race can happen because 'selection of sstables' and 'mark sstables as compacting' are decoupled. So any deferring point in between can lead to a parallel compaction picking the same files. After commit `2cf0c4bbf`, files are marked as compacting before rewrite starts, but it didn't take into account the commit `c84217ad` which moved retrieval of candidates to a deferring thread, before rewrite_sstables() is even called. Scrub isn't affected by this because it uses a coarse grained approach where whole operation is run with compaction disabled, which isn't good because regular compaction cannot run until its completion. From now on, selection of files and marking them as compacting will be serialized by running them with compaction disabled. Now cleanup will also retrieve sstables with compaction disabled, meaning it will no longer leave uncleaned files behind, which is important to avoid data resurrection if node regains ownership of data in uncleaned files. Fixes #8168. Refs #8155. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211129133107.53011-1-raphaelsc@scylladb.com>	2021-11-29 16:27:29 +02:00
Avi Kivity	bcadd8229b	Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj " When memtable contains both mutations and tombstones that delete them, the output flushed to sstables contains both mutations. Inserting a compacting reader results in writing smaller sstables and saves compaction work later. There are mixed performance implications of this change: - If no rows are removed, there is a ~12% penalty on writing. Read times are not affected. A heuristic is implemented to avoid this problem - compaction is executed only if there are tombstones. - Read and write performance linearly improves with percentage of rows removed. At ~15% of rows removed, writes become faster than without compaction. In the tables below in columns 4 and 7, values below 100% denote improvement and values over 100% denote regression. The tests were performed on a table with 5 columns and the exact percentages will vary across different table schemas. 1. percentage removed 2. write duration/row no compaction 3. write duration/row with compaction 4. write performance new/old 5. read duration/row no compaction 6. read duration/row with compaction 7. read performance new/old 1 2 3 4 5 6 7 5 8.91E-07 9.64E-07 108.25% 6.05E-07 5.76E-07 95.23% 10 9.28E-07 9.94E-07 107.15% 6.14E-07 5.56E-07 90.55% 15 9.27E-07 9.21E-07 99.43% 6.24E-07 5.27E-07 84.39% 20 9.28E-07 9.03E-07 97.31% 6.19E-07 4.83E-07 78.03% 25 9.49E-07 8.58E-07 90.40% 6.40E-07 4.59E-07 71.76% 30 9.68E-07 8.28E-07 85.61% 6.35E-07 4.20E-07 66.07% 35 9.81E-07 8.07E-07 82.26% 6.38E-07 3.88E-07 60.85% 40 9.97E-07 7.81E-07 78.35% 6.43E-07 3.59E-07 55.91% 45 1.01E-06 7.59E-07 75.28% 6.45E-07 3.34E-07 51.75% 50 1.02E-06 7.30E-07 71.52% 6.55E-07 3.00E-07 45.78% 55 1.06E-06 7.08E-07 66.97% 6.65E-07 2.70E-07 40.56% 60 1.04E-06 6.87E-07 66.20% 6.62E-07 2.40E-07 36.22% 65 1.05E-06 6.56E-07 62.49% 6.60E-07 2.12E-07 32.04% 70 1.06E-06 6.34E-07 59.58% 6.66E-07 1.80E-07 27.07% 75 1.07E-06 6.09E-07 56.90% 6.69E-07 1.50E-07 22.38% 80 1.09E-06 5.84E-07 53.58% 6.80E-07 1.20E-07 17.62% 85 1.10E-06 5.56E-07 50.49% 6.83E-07 9.00E-08 13.18% 90 1.11E-06 5.33E-07 47.92% 6.90E-07 5.97E-08 8.66% 95 1.12E-06 5.07E-07 45.10% 6.93E-07 3.04E-08 4.39% 100 1.14E-06 4.87E-07 42.77% 6.97E-07 6.56E-12 0.00% 1. percentage removed 2. write instructions retired/row no compaction 3. write instructions retired/row with compaction 4. write performance new/old 5. read instructions retired/row no compaction 6. read instructions retired/row with compaction 7. read performance new/old 1 2 3 4 5 6 7 5 10276 11188 108.88% 7735 7297 94.34% 10 10463 10891 104.09% 7797 6913 88.66% 15 10633 10596 99.65% 7852 6529 83.15% 20 10811 10300 95.27% 7910 6145 77.69% 25 10997 9998 90.92% 7976 5755 72.15% 30 11177 9707 86.85% 8033 5376 66.92% 35 11353 9412 82.90% 8092 4992 61.69% 40 11522 9111 79.07% 8143 4604 56.54% 45 11708 8819 75.32% 8208 4224 51.46% 50 11877 8520 71.74% 8259 3836 46.45% 55 12064 8228 68.20% 8325 3456 41.51% 60 12240 7928 64.77% 8382 3069 36.61% 65 12419 7635 61.48% 8440 2688 31.85% 70 12598 7339 58.26% 8499 2304 27.11% 75 12768 7043 55.16% 8549 1920 22.46% 80 12977 6747 51.99% 8616 1536 17.83% 85 13131 6451 49.13% 8673 1152 13.28% 90 13311 6155 46.24% 8731 767 8.78% 95 13487 5858 43.43% 8790 383 4.36% 100 13657 5562 40.73% 8841 0 0.00% " * 'add-compacting-reader-when-flushing-memtable-v6' of github.com:mikolajsieluzycki/scylla: memtable-sstable: Add compacting reader when flushing memtable. memtable-sstable: Track existence of tombstones in memtable.	2021-11-29 15:15:59 +02:00
Mikołaj Sielużycki	a88f7df195	memtable-sstable: Add compacting reader when flushing memtable. When memtable contains both mutations and tombstones that delete them, the output flushed to sstables contains both mutations. Inserting a compacting reader results in writing smaller sstables and saves compaction work later. Performance tests of this change have shown a regression in a common case where there are no deletes. A heuristic is employed to skip compaction unless there are tombstones in the memtable to minimise the impact of that issue.	2021-11-29 13:19:42 +01:00
Mikołaj Sielużycki	6dd9f63f3b	memtable-sstable: Track existence of tombstones in memtable. Add flags if memtable contains tombstones. They can be used as a heuristic to determine if a memtable should be compacted on flush. It's an intermediate step until we can compact during applying mutations on a memtable.	2021-11-29 13:06:12 +01:00
Kamil Braun	b2b242d0ad	query-request: add comment about clustering ranges with non-full prefix key bounds	2021-11-29 11:10:49 +01:00
Kamil Braun	8722e0d23c	sstables: mx: enable position fast-forwarding in reverse mode Most of the machinery was already implemented since it was used when jumping between clustering ranges of a query slice. We need only perform one additional thing when performing an index skip during fast-forwarding: reset the stored range tombstone in the consumer (which may only be stored in fast-forwarding mode, so it didn't matter that it wasn't reset earlier). Comments were added to explain the details.	2021-11-29 11:10:49 +01:00
Kamil Braun	ea6310961c	test: sstable_conforms_to_mutation_source_test: extend `test_sstable_reversing_reader_random_schema` with fast-forwarding The test would check whether the forward and reverse readers returned consistent results when created in non-forwarding mode with slicing. Do the same but using fast-forwarding instead of slicing. To do this we require a vector of `position_range`s. We also need a vector of `clustering_range`s for the existing test. We modify the existing `random_ranges` function to return `position_range`s instead of `clustering_range`s since `position_range`s are easier to reason about, especially when we consider non-full clustering key prefixes. A function is introduced to convert a `position_range` to a `clustering_range` for the existing test.	2021-11-29 11:10:46 +01:00
Benny Halevy	cf528d7df9	database: shutdown: don't shutdown keyspaces yet Don't shutdown the keyspaces just yet, since they are needed during shutdown. FIXME: restore when #8995 is fixed and no queries are issued after the database shuts down. Refs #8995 Fixes #9684 Test: unit(dev) - scylla-gdb test fails locally with #9677 DTest: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_adding_info_{1,2}_test(dev) - running now into #8995. dtest fails with unexpected error: "storage_proxy - Exception when communicating with 127.0.62.4, to read from system_distributed.service_levels: seastar::gate_closed_exception (gate closed)" Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211127083348.146649-2-bhalevy@scylladb.com>	2021-11-29 11:59:45 +02:00
Benny Halevy	93367ba55f	effective_replication_map_factory: temporarily unregister outstanding maps when destroyed The next patch will disable stopping the keyspaces in database shutdown due to #9684. This will leave outstanding e_r_m:s when the factory is destroyed. They must be unregistered from the factory so they won't try to submit_background_work() to gently clear their contents. Support that temporarily until shutdown is fixed to ensure they are no outstanding e_r_m:s when the factory is destroyed, at which point this can turn into an internal error. Refs #8995 Refs #9684 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211127083348.146649-1-bhalevy@scylladb.com>	2021-11-29 11:59:44 +02:00
Nadav Har'El	1e2ecd282a	Merge 'Harden compaction manager remove' from Benny Halevy This series hardens compaction_manager::remove by: - add debug logging around task execution and stopping. - access compaction_state as lw_shared_ptr rather than via a raw pointer. - with that, detach it from `_compaction_state` in `compaction_manager::remove` right away, to prevent further use of it while compactions are stopped. - added write_lock in `remove` to make sure the lock is not held by any stray task. Test: unit(dev), sstable_compaction_test(debug) Dtest: alternator_tests.py:AlternatorTest.test_slow_query_logging (debug) Closes #9636 * github.com:scylladb/scylla: compaction_manager: add compaction_state when table is constructed compaction_manager: remove: fixup indentation compaction_manager: remove: detach compaction_state before stopping ongoing compactions compaction_manager: remove: serialize stop_ongoing_compactions and gate.close compaction_manager: task: keep a reference on compaction_state test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed. test: sstable_utils: compact_sstables: deregister compaction also on error path test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path test: compaction_manager_test: add debug logging to register/deregister compaction test: compaction_manager_test: deregister_compaction: erase by iterator test: compaction_manager_test: move methods out of line compaction_manager: compaction_state: use counter for compaction_disabled compaction_manager: task: delete move and copy constructors compaction_manager: add per-task debug log messages compaction_manager: stop_ongoing_compactions: log number of tasks to stop	2021-11-28 22:12:52 +02:00
Avi Kivity	b23af15432	tests: consolidate boost xunit result files The recent parallelization of boost unit tests caused an increase in xml result files. This is challenging to Jenkins, since it appears to use rpc-over-ssh to read the result files, and as a result it takes more than an hour to read all result files when the Jenkins main node is not on the same continent as the agent. To fix this, merge the result files in test.py and leave one result file per mode. Later we can leave one result file overall (integrating the mode into the testsuite name), but that can wait. Tested on a local Jenkins instance (just reading the result files, not the entire build). Closes #9668	2021-11-28 22:12:52 +02:00
Piotr Sarna	ecd122a1b0	Merge 'alternator: rudimentary implementation of TTL expiration service' from Nadav Har'El In this patch series we add an implementation of an expiration service to Alternator, which periodically scans the data in the table, looking for expired items and deleting them. We also continue to improve the TTL test suite to cover additional corner cases discovered during the development of the code. This implementation is good enough to make all existing tests but one, plus a few new ones, pass, but is still a very partial and inefficient implementation littered with FIXMEs throughout the code. Among other things, this initial implementation doesn't do anything reasonable about pacing of the scan or about multiple tables, it scans entire items instead of only the needed parts, and because each shard "owns" a different subset of the token ranges, if a node goes down, partitions which it "owns" will not get expired. The current tests cannot expose these problems, so we will need to develop additional tests for them. Because this implementation is very partial, the Alternator TTL continues to remain "experimental", cannot be used without explicitly enabling this experimental feature, and must not be used for any important deployment. Refs #5060 but doesn't close the issue (let's not close it until we have a reasonably complete implementation - not this partial one). Closes #9624 * github.com:scylladb/scylla: alternator: fix TTL expiration scanner's handling of floating point test/alternator: add TTL test for more data test/alternator: remove "xfail" tag from passing tests in test_ttl.py test/alternator: make test_ttl.py tests fast on Alternator alternator: initial implmentation of TTL expiration service alternator: add another unwrap_number() variant alternator: add find_tag() function test/alternator: test another corner case of TTL setting test/alternator: test TTL expiration for table with sort key test/alternator: improve basic test for TTL expiration test/alternator: extract is_aws() function	2021-11-28 22:12:52 +02:00
Avi Kivity	25bd945a2c	Merge "reverse range scans: use the correct schema for result building" from Botond " Reverse queries has to use the reverse schema (query schema) for the read itself but the table schema for the result building, according to the established interface with the coordinator (half-reverse format). Range scans were using the query schema for both, which produced un-parseable reconcilable results for mutation range scans. This series fixes this and adds unit tests to cover this previously uncovered area. " Fixes #9673. * 'reverse-range-scan-test/v1' of https://github.com/denesb/scylla: test/boost/multishard_mutation_query_test: add reverse read test test/boost/multishard_mutation_query_test: add test for combinations of limits, paging and stateful test/boost/multishard_mutation_query_test: generalize read_partitions_with_paged_scan() test/boost/multishard_mutation_query_test: add read_all_partitions_one_by_one() overload with slice multishard_mutation_query: fix reverse scans partition_slice: init all fields in copy ctor partition_slice: operator<<: print the entire partition row limit partition_slice_builder: add with_partition_row_limit()	2021-11-28 14:18:28 +02:00
Avi Kivity	ec775ba292	Merge "Remove more gms::get(_local)?_gossiper() calls" from Pavel E " This set covers simple but diverse cases: - cache hitrace calculator - repair - system keyspace (virtual table) - dht code - transport event notifier All the places just require straightforward arguments passing. And a reparation in transport -- event notifier needs a backref to the owning server. Remaining after this set is the snitch<->gossiper interaction and the cache hitrate app state update from table code. tests: unit(dev) " * 'br-unglobal-gossiper-cont' of https://github.com/xemul/scylla: transport: Use server gossiper in event notifier transport: Keep backreference from event_notifier transport: Keep gossiper on server dht: Pass gossiper to range_streamer::add_ranges dht: Pass gossiper argument to bootstrap system_keyspace: Keep gossiper on cluster_status_table code: Carry gossiper down to virtual tables creation repair: Use local gossiper reference cache_hitrate_calculator: Keep reference on gossiper	2021-11-28 14:18:28 +02:00
Tomasz Grabiec	0df14a48cf	Merge "gms: features should keep 'enabled' state" from Pavel Solodovnikov This patchset implements part of the solution of the problem described in the https://github.com/scylladb/scylla/issues/4458. Introduce a new key `enabled_features` in the `system.scylla_local` table, update it when each gms feature is enabled, then read them from the table on node startup and perform validation and re-enable these features early. The solution provides a way to prevent a way to do prohibited node downgrades, that is: when a node does not understand some features that were enabled previously, it means it's doing a prohibited downgrade procedure. Also, enabling features early allows to shorten the time frame for which the feature is not enabled on a node and that also can affect cluster liveness (until a node contacts others to discover features state in the cluster and re-enable them again). Features should be enabled before commitlog starts replaying since some features affect storage (for example, when determining used sstable format). * manmanson/persist_enabled_features_v8: gms: feature_service: re-enable features on node startup gms: gossiper: maybe_enable_features() should enable features in seastar::async context gms: feature_service: expose registered features map gms: feature_service: persist enabled features gms: move `to_feature_set()` function from gossiper to feature_service	2021-11-28 14:18:28 +02:00
Pavel Solodovnikov	1365e2f13e	gms: feature_service: re-enable features on node startup Re-enable previously persisted enabled features on node startup. The features list to be enabled is read from `system.local#enabled_features`. In case an unknown feature is encountered, the node fails to boot with an exception, because that means the node is doing a prohibited downgrade procedure. Features should be enabled before commitlog starts replaying since some features affect storage (for example, when determining used sstable format). This patch implements a part of solution proposed by Tomek in https://github.com/scylladb/scylla/issues/4458. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:24 +02:00
Pavel Solodovnikov	777985b64d	gms: gossiper: maybe_enable_features() should enable features in seastar::async context Since `gms::feature::enable()` requires `seastar::async` context to be present. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	5b5fbb4b33	gms: feature_service: expose registered features map This will be used for re-enabling previously enabled cluster features, which will be introduces in later patches. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	a2f5ad432f	gms: feature_service: persist enabled features Save each feature enabled through the feature_service instance in the `system.scylla_local` under the 'enabled_features' key. The features would be persisted only if the underlying query context used by `db::system_keyspace` is initialized. Since `system.scylla_local` table is essentially a string->string map, use an ad-hoc method for serializing enabled features set: the same as used in gossiper for translating supported features set via gossip. The entry should be saved before we enable the feature so that crash-after-enable is safe. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	e891f874df	gms: move `to_feature_set()` function from gossiper to feature_service This utility will also be used for de-serialization of persisted enabled features, which will be introduced in a later patch. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Nadav Har'El	f1997be989	alternator: fix TTL expiration scanner's handling of floating point The expiration-time attribute used by Alternator's TTL feature has a numeric type, meaning that it may be a floating point number - not just an integer, and implemented as big_decimal which has a separate integer mantissa and exponent. Our code which checked expiration incorrectly looked only at the mantissa - resulting in incorrect handling of expiration times which have a fractional part - 123.4 was treated as 1234 instead of 123. This patch fixes the big_decimal handling in the expiration checking, and also adds to the test test_ttl.py::test_ttl_expiration check also for non-integer floating point as well as one with an exponent. The new tests pass on DynamoDB, and failed on Alternator before this patch - and pass with it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:37 +02:00
Nadav Har'El	84e0004ff6	test/alternator: add TTL test for more data The existing TTL tests use only tiny tables, so don't exercise the expiration-time scanner's use of paging. So in this patch we add another test with a much larger table (with 40,000 items). To verify that this test indeed checks paging, I stopped the scanner's iteration after one page, and saw that this test starts failing (but the smaller tests all pass). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:37 +02:00
Nadav Har'El	baea76c33b	test/alternator: remove "xfail" tag from passing tests in test_ttl.py Most tests in test_ttl.py now pass, so remove their "xfail" tag. The only remaining failing test is test_ttl_expiration_streams - which cannot yet pass because the expiration event is not yet marked. Note that the fact that almost all tests for Alternator's TTL feature now pass does not mean the feature is complete. The current implementation is very partial and inefficient, and only works reasonably in tests on a single node. The current tests cannot expose these problems, so we will need to develop additional tests for them. The tests will of course remain useful to see that as the implementation continues to improve, none of the tests that already work will break. The Alternator TTL continues to remain "experimental", cannot be used without explicitly enabling this experimental feature, and must not be used for any important deployment. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:37 +02:00
Nadav Har'El	0b97da5f46	test/alternator: make test_ttl.py tests fast on Alternator The tests for the TTL feature in test/alternator/test_ttl.py takes huge amount of time on DynamoDB - 10 to 30 minutes (!) - because it delays expiration of items a long time after their intended expiration times. We intend Scylla's implementation to have a configurable delay for the expiration scanner, which we will be able to configure to very short delays for tests. So These tests can be made much faster on Scylla. So in this patch we change all of the tests to finish much more quickly on Scylla. Many of the tests still fail, because the TTL feature is not implemented yet. Although after this change all the tests in test_ttl.py complete in a reasonable amount of time (around 3 seconds each), we still mark them as "veryslow" and the "--runveryslow" flag is needed to run them. We should consider changing this in the future, so that these tests will run as part of our default test suite. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:37 +02:00
Nadav Har'El	13a3aca460	alternator: initial implmentation of TTL expiration service In this patch we add an incomplete implementation of an expiration service to Alternator, which periodically scans the data in the table, looking for expired items and deleting them. This implementation involves a new "expiration service" which runs a background scan in each shard. Each shard "owns" a subset of the token ranges - the intersection of the node's primary ranges with this shard's token ranges - and scans those ranges over and over, deleting any items which are found expired. This implementation is good enough to make all existing tests but one pass, but is still a partial and inefficient implementation littered with FIXMEs throughout the code. Among other things, this implementation doesn't do anything reasonable about pacing of the scan or about multiple tables, it scans entire items instead of only the needed parts, and if a node goes down, the part of the token range which it "owns" will not be scanned for expiration (we need living nodes to take over the background expiration work for dead nodes). The current tests cannot expose these problems, so we will need to develop additional tests for them. Because this implementation is very partial, the Alternator TTL continues to remain "experimental", cannot be used without explicitly enabling this experimental feature, and must not be used for any important deployment. The new TTL expiration service will only run (at the moment) in the background if the Alternator TTL experimental feature is enabled and and if Alternator is enabled as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:37 +02:00
Nadav Har'El	f7e984110d	alternator: add another unwrap_number() variant We have an unwrap_number() function which in case of data errors (such as the value not being a number) throws an exception with a given string used in the message. In this patch we add a variant of unwrap_number() - try_unwrap_number() - which doesn't take a message, and doesn't throw exceptions - instead it returns an empty std::optional if the given value is not a number. This function is useful in places where we need to know if we got a number or not, but both are fine but not errors. We'll use it in a following patch to parse expiration times for the TTL feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:37 +02:00
Nadav Har'El	be969ff995	alternator: add find_tag() function find_tag() returns the value of a specific tag on a table, or nothing if it doesn't exist. Unlike the existing get_tags_of_table() above, if the table is missing the tags extension (e.g., is not an Alternator table) it's not an error - we return nothing, as in the case that tags exist but not this tag. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:36 +02:00
Nadav Har'El	88f175d0a8	test/alternator: test another corner case of TTL setting Although it isn't terribly useful, an Alternator user can enable TTL with an expiration-time attribute set to a key attribute. Because expiration times should be numeric - not other types like strings - DynamoDB could warn the user when a chosen key attribute hs a non- numeric type (since key attributes do have fixed types!). But DynamoDB doesn't warn about this - it simply expires nothing. This test verifies this that it indeed does this. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:36 +02:00
Nadav Har'El	a982d161ad	test/alternator: test TTL expiration for table with sort key The basic test for TTL expiration, test_ttl.py::test_ttl_expiration, uses a table with only a partition key. Most of the item expiration logic is exactly the same for tables that also have a sort key, but the step of deleting the item is different, so let's add a test that verifies that also in this case, the expired item is properly deleted. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:36 +02:00
Nadav Har'El	69b4f53aa9	test/alternator: improve basic test for TTL expiration This patch improves test_ttl.py::test_ttl_expiration in two ways: First, it checks yet another case - that items that have the wrong type for the expiration-time column (e.g., a string) never get expired - even if that string happens to contain a number that looks like an expiration time. Second, instead of the huge 15-minute duration for this test, the test now has a configurable duration; We still need to use a very long duration on AWS, but in Scylla we expect to be able to configure the TTL scan frequency, and can finish this test in just a few seconds! We already have experimental code which makes this test pass in just 3 seconds. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:36 +02:00
Nadav Har'El	fd9a6cf851	test/alternator: extract is_aws() function Extract a boolean function is_aws() out of the "scylla_only" fixture, so it can be used in tests for other purposes. For example, in the next patch the TTL tests will use them to pick different timeouts on AWS (where TTL expiration have huge many-minute delays) and on Scylla (which can be configured to have very short delays). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:36 +02:00
Konstantin Osipov	eea82f1262	raft: (server) improve tracing	2021-11-25 12:35:43 +03:00
Konstantin Osipov	0d830d4c11	raft: (metrics) fix spelling of waiters_awaken The usage of awake and awaken is quite messy, but awoken is more common for passive voice, so use waiters_awoken.	2021-11-25 12:35:43 +03:00
Konstantin Osipov	6d28927550	raft: make forwarding optional In absence of abort_source or timeouts in Raft API, automatic bouncing can create too much noise during testing, especially during network failures. Add an option to disable follower bouncing feature, since randomized_nemesis_test has its own bouncing which handles timeouts correctly. Optionally disable forwarding in basic_generator_test.	2021-11-25 12:35:43 +03:00
Konstantin Osipov	c22f945f11	raft: (service) manage Raft configuration during topology changes Operations of adding or removing a node to Raft configuration are made idempotent: they do nothing if already done, and they are safe to resume after a failure. However, since topology changes are not transactional, if a bootstrap or removal procedure fails midway, Raft group 0 configuration may go out of sync with topology state as seen by gossip. In future we must change gossip to avoid making any persistent changes to the cluster: all changes to persistent topology state will be done exclusively through Raft Group 0. Specifically, instead of persisting the tokens by advertising them through gossip, the bootstrap will commit a change to a system table using Raft group 0. nodetool will switch from looking at gossip-managed tables to consulting with Raft Group 0 configuration or Raft-managed tables. Once this transformation is done, naturally, adding a node to Raft configuration (perhaps as a non-voting member at first) will become the first persistent change to ring state applied when a node joins; removing a node from the Raft Group 0 configuration will become the last action when removing a node. Until this is done, do our best to avoid a cluster state when a removed node or a node which addition failed is stuck in Raft configuration, but the node is no longer present in gossip-managed system tables. In other words, keep the gossip the primary source of truth. For this purpose, carefully chose the timing when we join and leave Raft group 0: Join the Raft group 0 only after we've advertised our tokens, so the cluster is aware of this node, it's visible in nodetool status, but before node state jumps to "normal", i.e. before it accepts queries. Since the operation is idempotent, invoke it on each restart. Remove the node from Group 0 before its tokens are removed from gossip-managed system tables. This guarantees that if removal from Raft group 0 fails for whatever reason, the node stays in the ring, so nodetool removenode and friends are re-tried. Add tracing.	2021-11-25 12:35:42 +03:00
Konstantin Osipov	96e2594207	raft: (service) break a dependency loop Break a dependency loop raft_rpc <-> raft_group_registry via raft_address_map. Pass raft_address_map to raft_rpc and raft_gossip_failure_detector explicitly, not entire raft_group_registry. Extract server_for_group into a helper class. It's going to be used by raft_group0 so make it easier to reference.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	8ee88a9d8a	raft: (discovery) introduce leader discovery state machine Introduce a special state machine used to to find a leader of an existing Raft cluster or create a new cluster. This state machine should be used when a new Scylla node has no persisted Raft Group 0 configuration. The algorithm is initialized with a list of seed IP addresses, IP address of this server, and, this server's Raft server id. The IP addresses are used to construct an initial list of peers. Then, the algorithm tries to contact each peer (excluding self) from its peer list and share the peer list with this peer, as well as get the peer's peer list. If this peer is already part of some Raft cluster, this information is also shared. On a response from a peer, the current peer's peer list is updated. The algorithm stops when all peers have exchanged peer information or one of the peers responds with id of a Raft group and Raft server address of the group leader. (If any of the peers fails to respond, the algorithm re-tries ad infinitum with a timeout). More formally, the algorithm stops when one of the following is true: - it finds an instance with initialized Raft Group 0, with a leader - all the peers have been contacted, and this server's Raft server id is the smallest among all contacted peers.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	30e3227e0b	system_keyspace: mark scylla_local table as always-sync commitlog It is infrequently updated (typically once at start) but stores critical state for this instance survival (Raft Group 0 id, Raft server id, sstables format), so always write it to commit log in sync mode.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	fd295850fe	system_keyspace: persistence for Raft Group 0 id and Raft Server Id Implement system_keyspace helpers to persist Raft Group 0 id and Raft Server id. Do not use coroutines in a template function to work around https://bugs.llvm.org/show_bug.cgi?id=50345	2021-11-25 11:50:38 +03:00
Konstantin Osipov	65e549946f	raft: add a test case for adding entries on follower	2021-11-25 11:50:38 +03:00
Konstantin Osipov	e3751068fe	raft: (server) allow adding entries/modify config on a follower Implement an RPC to forward add_entry calls from the follower to leader. Bounce & retry in case of not_a_leader. Do not retry in case of uncertainty - this can lead to adding duplicate entries. The feature is added to core Raft since it's needed by all current clients - both topology and schema changes. When forwarding an entry to a remote leader we may get back a term/index pair that conflicts (has the same index, but is with a higher term) with a local entry we're still waiting on. This can happen, e.g. because there was a leader change and the log was truncated, but we still haven't got the append_entries RPC from the new leader, still haven't truncated the log locally, still haven't aborted all the local waits for truncated entries. Only remove the offending entry from the wait list and abort it. There may be entries labeled with an older term to the right (with higher commit index) of the conflicting entry. However, finding them, would require a linear scan. If we allow it, we may end up doing this linear scan for every conflicting entry during the transition period, which brings us to N^2 complexity of this step. At the same time, as soon as append_entries that commits a higher-term entry with the same index reaches the follower, the waits for the respective truncated entry will be aborted anyway (see notify_waiters() which sets dropped_entry exception), so the scan is unnecessary. Similarly to being able to add entries, allow to modify Raft group configuration on a follower. The implementation works the same way as adding entries - forwards the command to the leader. Now that add_entry() or modify_config never throws not_a_leader, it's more likely to throw timed_out_error, e.g. in case the network is partitioned. Previously it was only possible due to a semaphore wait timeout, and this scenario was not tested. Handle timed_out_error on RPC level to let the existing tests (specifically the randomized nemesis test) pass.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	ae5dc8e980	raft: (test) replace virtual with override in derived class Clang 12 complains if use of override is inconsistent, so stick to it everywhere.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	8f303844df	raft: (server) fix a typo in exception message	2021-11-25 11:50:38 +03:00
Konstantin Osipov	9cde1cdf71	raft: (server) implement id() helper There is no easy way to get server id otherwise.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	b9faf41513	raft: (server) remove apply_dummy_entry() It's currently unused, and going forward we'd like to make it work on the follower, which requires a new implementation.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	2763fdd3b7	raft: (test) fix missing initialization in generator.hh A missing initialization in poll_timeout of class interpreter could manifest itself as a sporadically failing randomized_nemesis_test. The test would prematurely run out of allowed limit of virtual clock ticks.	2021-11-25 11:50:38 +03:00
Pavel Emelyanov	c04ddc5aa9	transport: Use server gossiper in event notifier The notifier is automatic friend of server and can access its private fields without additional wrappers/decorations. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:56:05 +03:00
Pavel Emelyanov	2cb18c2404	transport: Keep backreference from event_notifier The event_notifier is private server subclass that's created once per server to handle events from storage_service. The notifier needs gossiper that already sits on the server, and to get it the simplest way is to equip notifier with the server backreference. Since these two objects are in strict 1:1 relation this reference is safe. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:55:41 +03:00
Pavel Emelyanov	43951318c8	transport: Keep gossiper on server The gossiper is needed by the transport::event_notifier. There's already gossiper reference on the transport controller, but it's a local reference, because controller doesn't need more. This patch upgrages controller reference to sharded<> and propagates it further up to the server. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:54:45 +03:00
Pavel Emelyanov	831f18e392	dht: Pass gossiper to range_streamer::add_ranges A continuation of the previous patch. The range_streamer needs gossiper too, and is called from boot_strapper and storage_service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:54:16 +03:00
Pavel Emelyanov	6a2f6068cb	dht: Pass gossiper argument to bootstrap The boot_strapper::bootstrap needs gossiper and is called only from the storage_service code that has it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:53:56 +03:00
Pavel Emelyanov	aaf268ae58	system_keyspace: Keep gossiper on cluster_status_table This table gets endpoint states map from global gossiper. Now there's a local reference nearby. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:53:18 +03:00
Pavel Emelyanov	ef1960d034	code: Carry gossiper down to virtual tables creation One of the tables needs gossiper and uses global one. This patch prepares the fix by patching the main -> register_virtual_tables stack with the gossiper reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:52:55 +03:00
Pavel Emelyanov	1168c4154b	repair: Use local gossiper reference There are two places in repair that call for global gossiper instance. However, the repair_service already has sharded gossiper on board, and it can use it directly in the first place. The second place is called from inside repair_info method. This place is fixed by keeping the gossiper reference on the info, just like it's done for other services that info needs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:52:37 +03:00
Pavel Emelyanov	770d34796b	cache_hitrate_calculator: Keep reference on gossiper The calculator needs to update its app-state on gossiper. Keeping a reference is safe -- gossiper starts early, the calculator -- at the very very end, stop in reverse. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-25 10:52:27 +03:00
Nadav Har'El	2bdc31f8a3	test/alternator: two more tests for unimplemented Select=COUNT This patch adds two more tests for the unimplemented Select=COUNT feature (which asks to only count queried items and not return the actual items). Because this feature has not yet been implemented in Alternator (Refs #5058), the new tests xfail. They pass on DynamoDB. The two tests added here are for the interaction of the Select=COUNT feature with filters - in one of the two supported syntaxes (QueryFilter and FilterExpression). We want to verify that even though the user doesn't need the content of the items (since only the counts were requested), they are still retrieved from disk as needed for doing proper filtering - but not returned. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211124225429.739744-1-nyh@scylladb.com>	2021-11-25 08:47:14 +01:00
Mikołaj Sielużycki	44f4ea38c5	test: Future-proof reader conversions tests. Query time must be fetched after populate. If compaction is executed during populate it may be executed with timestamp later than query_time. This would cause the test expected compaction and compaction during populate to be executed at different time points producing different results. The result would be sporadic test failures depending on relative timing of those operations. If no other mutations happen after populate, and query_time is later than the compaction time during population, we're guaranteed to have the same results. Message-Id: <20211123134808.105068-1-mikolaj.sieluzycki@scylladb.com>	2021-11-24 21:01:57 +01:00
Michał Chojnowski	08f7b81b36	dist: scylla_io_setup: run iotune for supported but not preconfigured AWS instance types Currently, for AWS instances in `is_supported_instance_class()` other than i3* and *gd (for example: m5d), scylla_io_setup neither provides preconfigured values for io_properties.yaml nor runs iotune nor fails. This silently results in a broken io_properties.yaml, like so: disks: - mountpoint: /var/lib/scylla Fix that. Closes #9660	2021-11-24 18:28:13 +02:00
Avi Kivity	f3faa48f8b	Merge "Unglobal stream manager" from Pavel E " There's a nest of globals in streaming/ code. The stream_manager itself and a whole lot of its dependencies (database, sys_dist_ks, view_update_generator and messaging). Also streaming code gets gossiper instance via global call. The fix is, as usual, in keeping the sharded<stream_manager> in the main() code and pushing its reference everywhere. Somwehere in the middle the global pointers go away being replaced with respective references pushed to the stream_manager ctor. This reveals an implicit dependency: storage_service -> stream_manager tests: unit(dev), dtest.cdc_tests.cluster_reduction_with_cdc(dev) v1: dtest.bootstrap_test.add_node(dev) v1: dtest.bootstrap_test.simple_bootstrap(dev) " * 'br-unglobal-stream-manager-3-rebase' of https://github.com/xemul/scylla: (26 commits) streaming, main: Remove global stream_manager stream_transfer_task: Get manager from session (result-future) stream_transfer_task: Keep Updater fn onboard stream_transfer_task: Remove unused database reference stream_session: Use manager reference from result-future stream_session: Capture container() in message handler stream_session: Keep stream_manager reference stream_session: Remove unused default contructor stream_result_future: Use local manager reference stream_result_future: Keep stream_manager reference stream_plan: Keep stream_manager onboard dht: Keep stream_manager on board streaming, api: Use captured manager in handlers streaming, api: Standardize the API start/stop storage_service: Sanitize streaming shutdown storage_service: Keep streaming_manager reference stream_manager: Use container() in notification code streaming: Move get_session into stream_manager streaming: Use container.invoke_on in rpc handlers streaming: Fix interaction with gossiper ...	2021-11-24 12:23:18 +02:00
Pavel Emelyanov	4a34226aa6	streaming, main: Remove global stream_manager Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	50e6d334a9	stream_transfer_task: Get manager from session (result-future) When the task starts it needs the stream_manager to get messaging service and database from. There's a session at hands and this session is properly initialized thus it has the result-future. Voila -- we have the manager! Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	95d26bc420	stream_transfer_task: Keep Updater fn onboard The helper function called send_mutation_fragments needs the manager to update stats about stream_transfer_task as it goes on. Carrying the manager over its stack is quite boring, but there's a helper send_info object that lives there. Equip the guy with the updating function and capture the manager by it early to kill one more usage of the global stream_manager call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	9ee208de8d	stream_transfer_task: Remove unused database reference The send_info helper keeps it, but doesn't use. Remove. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	a3b4d4d3cf	stream_session: Use manager reference from result-future When the stream_session initializes it's being equipped with the shared-pointer on the stream_result_future very early. In all the places where stream_session needs the manager this pointer is alive and session get get manager from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	56f5327450	stream_session: Capture container() in message handler The stream_mutation_fragments handler need to access the manager. Since the handler is registered by the manager itself, it can capture the local manager reference and use container() where appropriate. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	db33607eb2	stream_session: Keep stream_manager reference The manager is needed to get messaging service and database from. Actually, the database can be pushed though arguments in all the places, so effectively session only needs the messaging. However, the stream-task's need the manager badly and there's no other place to get it from other than the session. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	f2ae080c63	stream_session: Remove unused default contructor Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	307a2583ee	stream_result_future: Use local manager reference The reference is present in all the required places already. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	5b748a72de	stream_result_future: Keep stream_manager reference The stream_result_future needs manager to register on it and to unregister from it. Also the result-future is referenced from stream_session that also needs the manager (see next patches). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	3087422d4d	stream_plan: Keep stream_manager onboard The plan itself doesn't need it, but it creates some lower level objects that do. Next patches will use this reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	c593f8624d	dht: Keep stream_manager on board This is the preparation for the future patching. The stream_plan creation will need the manager reference, so keep one on dht object in advance. These are only created from the storage service bootstrap code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	5166a98ce4	streaming, api: Use captured manager in handlers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	fd920e2420	streaming, api: Standardize the API start/stop Todays idea of API reg/unreg is to carry the target service via lambda captures down to the route handlers and unregister those handers before the target is about to stop. This patch makes it so for the streaming API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	390a971bd8	storage_service: Sanitize streaming shutdown Use local reference and don't use 'is_stopped' boolean as the whole stop_transport is guarded with its own lock. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:37 +03:00
Pavel Emelyanov	aaa58b7b89	storage_service: Keep streaming_manager reference The manager is drained() on drain/decommission/isolate. Since now it's storage_service who orchestrates all of the above, it needs and explicit reference on the target. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:17:35 +03:00
Pavel Emelyanov	3a9eb6af28	stream_manager: Use container() in notification code Continuation of the previous patch -- some native stream_manager methods can enjoy using container() call. One nit -- the [] access to the map of statistics now runs in const context and cannot create elements, so switch this place into .at() method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:59 +03:00
Pavel Emelyanov	8ab96a8362	streaming: Move get_session into stream_manager This makes the code a bit shorter and helps removing one more call for global stream manager. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:59 +03:00
Pavel Emelyanov	228b4520a6	streaming: Use container.invoke_on in rpc handlers This will help to reduce the usage of global manager instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:59 +03:00
Pavel Emelyanov	c2c676784a	streaming: Fix interaction with gossiper Streaming manager registers itself in gossiper, so it needs an explicit dependency reference. Also it forgets to unregister itself, so do it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:59 +03:00
Pavel Emelyanov	73e10c7aed	streaming: Move start/stop onto common rails In case of streaming this mostly means dropping the global init/uninit calls and replacing them with sharded<stream_manager> instance. It's still global, but it's being fixed atm. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:58 +03:00
Pavel Emelyanov	08818ffe75	streaming: Rename .stop() into .shutdown() The start/stop standard is becoming like sharded<foo> foo; foo.start(); defer([] { foo.stop() }); foo.invoke_on_all(&foo::start); ... defer([] { foo.shutdown() }); wait_for_stop_signal(); /* quit making the above defers self-unroll */ where .shutdown() for a service would mean "do whatever is appropriate to start stopping, the real synchronous .stop() will come some time later". According to that, rename .stop() as it's really the mentioned preparation, not real stopping. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:58 +03:00
Pavel Emelyanov	ba298bd5c6	streaming: Remove global dependency pointers Now they are not needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:58 +03:00
Pavel Emelyanov	6d7eb76fad	streaming: Use get_stream_manager to get dependencies Currently streaming uses global pointers to save and get a dependency. Now all the dependencies live on the manager, this patch changes all the places in streaming/ to get the needed dependencies from it, not from global pointer (next patch will remove those globals). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:58 +03:00
Pavel Emelyanov	e448774588	streaming: Move rpc verbs reg/unreg into manager As a part of streaming start/stop unification. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:58 +03:00
Pavel Emelyanov	165971fb7f	streaming: Initialize stream manager with proper deps The stream manager is going to become central point of control for the streaming subsys. This patch makes its dependencies explicit and prepares the gound for further patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-24 12:15:58 +03:00
Nadav Har'El	e71131091a	cql-pytest: translate Cassandra's tests for user-defined types This is a translation of Cassandra's CQL unit test source file validation/entities/UserTypesTest.java into our our cql-pytest framework. This test file includes 26 tests for various features and corners of the user-defined type feature. Two additional tests which were more involved to translate were dropped with a comment explaining why. All 26 tests pass on Cassandra, and all but one pass on Scylla: The test testUDTWithUnsetValues fails on Scylla so marked xfail. It reproduces a previously-unknown Scylla bug: Refs #9671: In some cases, trying to assign an UNSET value into part of a UDT is not detected Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211124074001.708183-1-nyh@scylladb.com>	2021-11-24 10:37:15 +02:00
Botond Dénes	05ae2f88b8	test/boost/multishard_mutation_query_test: add reverse read test Which also tests combinations of limits, paging and statefulness. Fixes: #9328 This patch fixes the above issue by providing the test said issue was asking for to be considered fixed. The bug described therein was already fixed by an earlier patch.	2021-11-23 18:32:32 +02:00
Avi Kivity	965ea4a3fa	Merge "tools/scylla-sstable: add dumpers for all components" from Botond " Except for TOC, Filter, Digest and CRC32, these are trivial to read with any text/binary editor. " * 'scylla-sstable-dump-components' of https://github.com/denesb/scylla: tools/scylla-sstable: add --dump-scylla-metadata tools/scylla-sstable: add --dump-statistics tools/scylla-sstable: add --dump-summary tools/scylla-sstable: add --dump-compression-info tools/scylla-sstable: extract unsupported flag checking into function sstables/sstable: add scylla metadata getter sstables/sstable: add statistics accessor	2021-11-23 16:13:02 +02:00
Michał Sala	27ff3e7de7	storage_proxy: check partition ranges contiguity storage_proxy::query_partition_key_range_concurrent() iterates through vnodes produced by its argument query_ranges_to_vnodes_generator&& ranges_to_vnodes and tries to merge them. This commit introduces checking if subsequent vnodes are contiguous with each other, before merging them. Fixes #9167 Closes #9175	2021-11-23 15:48:55 +02:00
Botond Dénes	9746dbe20d	Merge "Add --cpus option to test.py" from Pavel Emelyanov " When provided all the tests start from under the 'taskset -c $value'. This is _not_ the same as just doing 'taskset -c ... ./test.py ...' because in the latter case test.py will compete with all the tests for the provided cpuset and may not be able to run at desired speed. With this option it's possible to isolate the tests themselves on a cpuset without affecting the test.py performance. One of the examples when test.py speed can be critical is catching flaky tests that reveal their buggy nature only when ran in a tight environment. The combination of --cpus, --repeat and --jobs creates nice pressure on the cpu, and keeping the test.py out of the mincer lets it fork and exec (and wait) the tests really fast. tests: unit(dev, with and without --cpus) " * 'br-test-taskset-2' of https://github.com/xemul/scylla: test.py: Add --cpus option test.py: Lazily calculate args.jobs	2021-11-23 15:06:59 +02:00
Botond Dénes	a5b5171a73	test/boost/multishard_mutation_query_test: add test for combinations of limits, paging and stateful	2021-11-23 14:23:35 +02:00
Botond Dénes	25713b1d62	test/boost/multishard_mutation_query_test: generalize read_partitions_with_paged_scan() Extract all logic related to issuing the actual read and building the combined result. This is now done by an ResultBuilder template object, which allows reusing the paging logic for both mutation and data scans. ResultBuilder implementations for which are also provided by this patch. The paging logic is also fixed to work with correctly with per-partition-row-limit.	2021-11-23 14:23:35 +02:00
Botond Dénes	810cc8bd1c	test/boost/multishard_mutation_query_test: add read_all_partitions_one_by_one() overload with slice	2021-11-23 14:23:35 +02:00
Botond Dénes	3210dee4a6	multishard_mutation_query: fix reverse scans The read itself has to be done with the reversed schema (query schema) but the result building has to be done with the table schema. For data queries this doesn't matter, but replicate the distinction for consistency (and because this might change).	2021-11-23 14:22:01 +02:00
Botond Dénes	15af80800a	partition_slice: init all fields in copy ctor _partition_row_limit_high_bits was left out for some reason, corrupting the per-partition row limit.	2021-11-23 14:21:50 +02:00
Botond Dénes	c372b9676d	partition_slice: operator<<: print the entire partition row limit Not just the low bits.	2021-11-23 14:21:50 +02:00
Botond Dénes	3881de6353	partition_slice_builder: add with_partition_row_limit()	2021-11-23 14:21:50 +02:00
Pavel Emelyanov	bd24c1eecf	Merge "Deglobalize batchlog_manager" from Benny This series gets rid of the global batchlog_manager instance. It does so by first, allowing to set a global pointer and instatiating stack-local instances in main and cql_test_env. Expose the cql_test_env batchlog_manager to tests so they won't need the global `get_batchlog_manager()` as used in batchlog_manager_test.test_execute_batch. Then we pass a reference to the `sharded<db::batchlog_manager>` to storage_service so it can be used instead of the global one. Derive batchlog_manager from peering_sharded_service so it get its `container()` rather than relying on the global `get_batchlog_manager()`. And finally, handle a circular dependency between the batchlog_manager, that relies on the query_processor that, in turn, relies on the storage_proxy, and the the storage_proxy itself that depends on the batchlog_manager for `mutate_atomically`. Moved `endpoint_filter` to gossiper so `storage_proxy::mutate_atomically` can call it via the `_gossiper` member it already has. The function requires a gossiper object rather than a batchlog_manager object. Also moved `get_batch_log_mutation_for` to storage_proxy so it can be called from `sync_write_to_batchlog` (also from the mutate_atomically path) Test: unit(dev) DTest: batch_test.py:TestBatch.test_batchlog_manager_issue(dev) * git@github.com:bhalevy/scylla.git deglobalize-batchlog_manager-v2 get rid of the global batchlog_manager batchlog_manager: get_batch_log_mutation_for: move to storage_proxy batchlog_manager: endpoint_filter: move to gossiper batchlog_manager: do_batch_log_replay: use lambda coroutine batchlog_manager: derive from peering_sharded_service storage_service: keep a reference to the batchlog_manager test: cql_test_env: expose batchlog_manager main: allow setting the global batchlog_manager	2021-11-23 15:10:50 +03:00
Benny Halevy	1740833324	test: sstable_compaction_test: autocompaction_control_test: use deferred_stop To auto-stop the table and the compaction_manager, making the test case exception-safe. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211122204340.1020932-2-bhalevy@scylladb.com>	2021-11-23 12:10:12 +02:00
Benny Halevy	dfa6a494c2	test: sstable_compaction_test: require smp::count==1 where needed These test cases may crash if running with more shards. This is not required for test.py runs, but rather when running the test manually using the command line. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211122204340.1020932-1-bhalevy@scylladb.com>	2021-11-23 12:10:12 +02:00
Kamil Braun	a33b0649b1	Merge 'Block creation of MV on CDC Log' from Piotr Jastrzębski Add a restriction in create_view_statement to disallow creation of MV for CDC Log table. Also add a CQL test that checks the new restriction works. Test: unit(dev) Fixes #9233 Closes #9663 * 'fix9233' of https://github.com/haaawk/scylla: tests: Add cql test to verify it's impossible to create MV for CDC Log cql3: Make it impossible to create MV on CDC log	2021-11-23 10:51:02 +01:00
Nadav Har'El	3c0e7037be	conf/scylla.yaml: change default Prometheus listen address Developers often run Scylla with the default conf/scylla.yaml provided with the source distribution. The existing default listens for all ports but one (19042, 10000, 9042, 7000) on the localhost IP address (127.0.0.1). But just one port - 9180 (Prometheus metrics) - is listened on 0.0.0.0. This patch changes the default to be 127.0.0.1 for port 9180 as well. Note that this just changes the default scylla.yaml - users can still choose whatever listening address they want by changing scylla.yaml and/or passing command line parameters. The benefits of this patch are: 1. More consistent. 2. Better security for developers (don't open ports on external addresses while testing). 3. Allow test/cql-pytest/run to run in parallel with a default run of Scylla (currently, it fails to run Scylla on a random IP address, because the default run of Scylla already took port 9180 on all IP addresses. The third benefit is what led me to write this patch. Fixes #8757. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210530130307.906051-1-nyh@scylladb.com>	2021-11-23 11:45:35 +02:00
Benny Halevy	ff18c0c14c	messaging_service: remove unused include of db/system_keyspace.hh As a followup to `eba20c7e5d` "messaging_service: init_local_preferred_ip_cache: get preferred ips from caller". Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211123080457.1247970-1-bhalevy@scylladb.com>	2021-11-23 11:12:36 +03:00
Pavel Emelyanov	dcefe98fbb	test.py: Add --cpus option The option accepts taskset-style cpulist and limits the launched tests respectively. When specified, the default number of jobs is adjusted accordingly, if --jobs is given it overrides this "default" as expected. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-23 11:08:41 +03:00
Pavel Emelyanov	0246841c5e	test.py: Lazily calculate args.jobs Next patch will need to know if the --jobs option was specified or the caller is OK with the default. One way to achieve it is to keep 0 as the default and set the default value afterwards. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-23 11:05:56 +03:00
Nadav Har'El	253387ea07	alternator: implement AttributeUpdates DELETE operation with Value In the DynamoDB API, UpdateItem's AttributeUpdates parameter (the older syntax, which was superseded by UpdateExpression) has a DELETE operation that can do two different things: It can delete an attribute, or it can delete elements from a set. Before this patch we only implemented the first feature, and this patch implements the second. Note that unlike the ordinary delete, the second feature - set subtraction - is a read-modify-write operation. This is not only because of Alternator's serialization (as JSON strings, not CRDTs) - but also fundementally because of the API's guarantees - e.g., the operation is supposed to fail if the attribute's existing value is not a set of the correct type, so it needs to read the old value. The test for this feature begins to pass, so its "xfail" mark is removed. After this, all tests in test/alternator/test_item.py pass :-) Fixes #5864. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211103151206.157184-1-nyh@scylladb.com>	2021-11-23 08:51:06 +01:00
Benny Halevy	0a33762fb1	compaction_manager: add compaction_state when table is constructed With that, it is always expected that _compaction_state[cf] exists when compaction jobs are submnitted. Otherwise, throw std::out_of_range exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 09:40:06 +02:00
Benny Halevy	29dd24ab46	compaction_manager: remove: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 09:40:06 +02:00
Benny Halevy	46ac139490	compaction_manager: remove: detach compaction_state before stopping ongoing compactions So that the compaction_state won't be found from this point on, while stopping the ongoing compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 09:40:06 +02:00
Benny Halevy	75a2509b07	compaction_manager: remove: serialize stop_ongoing_compactions and gate.close Now that compaction tasks enter the compaction_state gate there is no point in stopping ongoing compaction in parallel to closing the gate. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 09:40:06 +02:00
Benny Halevy	3940ffb085	compaction_manager: task: keep a reference on compaction_state And hold its gate to make sure the compaction_state outlives the task and can be used to wait on all tasks and functions using it. With that, doing access _compaction_state[cf] to acquire shared/exclusive locks but rather get to it via task->compaction_state so it can be detached from _compaction_state while task is running, if needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 09:40:06 +02:00
Benny Halevy	f482d8f377	test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed. It must remove itself from the compaction_manager, that will stop_ongoing_compactions. Without that we're hitting ``` sstable_compaction_test: ./seastar/include/seastar/core/gate.hh:56: seastar::gate::~gate(): Assertion `!_count && "gate destroyed with outstanding requests"' failed. ``` when destroying the compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 09:40:06 +02:00
Benny Halevy	3955829286	test: sstable_utils: compact_sstables: deregister compaction also on error path We need to call deregister_compaction(cdata) also if compact_sstables failed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 09:39:10 +02:00
Benny Halevy	d344765ec6	get rid of the global batchlog_manager Now that it's unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	744275df73	batchlog_manager: get_batch_log_mutation_for: move to storage_proxy And rename to get_batchlog_mutation_for while at it, as it's about the batchlog, not batch_log. This resolves a circular dependency between the batchlog_manager and the storage_proxy that required it in the case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	55967a8597	batchlog_manager: endpoint_filter: move to gossiper There's nothing in this function that actually requries the batchlog manager instance. It uses a random number engine that's moved along with it to class gossiper. This resolves a circular dependency between the batchlog_manager and storage_proxy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	85d0bbb4fc	batchlog_manager: do_batch_log_replay: use lambda coroutine Ssimplify the function implemention and error handling by invoking a lambda coroutine on shard 0 that keeps a gate holder and semaphore units on its stack, for RAII- style unwinding. It then may invoke a function on another shard, using the peered service container() to do the replay on the destination shard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	691afe1c4d	batchlog_manager: derive from peering_sharded_service So that do_batch_log_replay can get the sharded batchlog_manager as container(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	9cde52c58f	storage_service: keep a reference to the batchlog_manager Rather than accessing the global batchlog_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	c6d82891cc	test: cql_test_env: expose batchlog_manager And use in batchlog_manager_test.test_execute_batch to help deglobalize the batchlog_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	03039e8f8a	main: allow setting the global batchlog_manager As a prerequisite to globalizing the batchlog_manager, allow setting a global pointer to it and instantiate the sharded<db::batchlog_manager> on the main/cql_test_env stack. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Benny Halevy	5fb66ecd03	test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:09:40 +02:00
Benny Halevy	8d7909de83	test: compaction_manager_test: add debug logging to register/deregister compaction Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:09:40 +02:00
Benny Halevy	ca97c919eb	test: compaction_manager_test: deregister_compaction: erase by iterator No need to search for the task again in the list. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:09:40 +02:00
Benny Halevy	5d6ea651d7	test: compaction_manager_test: move methods out of line No need for them to be inlined in the sstable_utils.hh. While at it, mark constructor noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:09:40 +02:00
Benny Halevy	e7ab1f8581	compaction_manager: compaction_state: use counter for compaction_disabled We'd like to use compaction_state::gate both for functions running with compaction disabled and for and tasks referring to the compaction_state so that stop_ongoing_compactions could wait on all functions referring to the state structure. This is also cleaner with respect to not relying on gate::use_count() when re-submitting regular compaction when compaction is re-enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:08:42 +02:00
Benny Halevy	3268c94e72	compaction_manager: task: delete move and copy constructors We use a lw_shared_ptr<task> everywhere. So prevent moving or copying task objects. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:00:18 +02:00
Benny Halevy	0cc6060552	compaction_manager: add per-task debug log messages Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:00:18 +02:00
Benny Halevy	1d8d472028	compaction_manager: stop_ongoing_compactions: log number of tasks to stop get_compactions().size() may return 0 while there are non-zero tasks to stop. Some tasks may not be marked as `compaction_running` since they are either: - postponed (due to compaction manger throttling of regular compaction) - sleeping before retry. In both cases we still want to stop them so the log message should reflect both the number of ongoing compactions and the actual number of tasks we're stopping. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-22 22:00:18 +02:00
Tomasz Grabiec	1d84bc6c3b	sstables: partition_index_cache: Avoid abort due to benign bad_alloc inside allocating section shared_promise::get_shared_future() is marked noexcept, but can allocate memory. It is invoked by sstable partition index cache inside an allocating section, which means that allocations can throw bad_alloc even though there is memory to reclaim, so under normal conditions. Fix by allocating the shared_promise in a stable memory, in the standard allocator via lw_shared_ptr<>, so that it can be accessed outside allocating section. Fixes #9666 Tests: - build/dev/test/boost/sstable_partition_index_cache_test Message-Id: <20211122165100.1606854-1-tgrabiec@scylladb.com>	2021-11-22 19:07:51 +02:00
Tomasz Grabiec	1e4da2dcce	cql: Fix missing data in indexed queries with base table short reads Indexed queries are using paging over the materialized view table. Results of the view read are then used to issue reads of the base table. If base table reads are short reads, the page is returned to the user and paging state is adjusted accordingly so that when paging is resumed it will query the view starting from the row corresponding to the next row in the base which was not yet returned. However, paging state's "remaining" count was not reset, so if the view read was exhausted the reading will stop even though the base table read was short. Fix by restoring the "remaining" count when adjusting the paging state on short read. Tests: - index_with_paging_test - secondary_index_test Fixes #9198 Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>	2021-11-22 17:42:49 +02:00
Benny Halevy	6b6cf73b48	test: manual: gossip: stop services on exit All sharded service that were started must be stopped before destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211122081305.789375-3-bhalevy@scylladb.com>	2021-11-22 16:15:43 +02:00
Benny Halevy	d2703eace7	test: remove gossip_test First, it doesn't test the gossiper so it's unclear why have it at all. And it doesn't test anything more than what we test using the cql_test_env either. For testing gossip there is test/manual/gossip. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211122081305.789375-2-bhalevy@scylladb.com>	2021-11-22 16:15:41 +02:00
Tomasz Grabiec	0d080d19fb	Merge "raft: improve handling of non voting members" from Gleb This series contains fixes for non voting members handling for stepdown and stable leader check. * scylla-dev/raft-stepdown-fixes-v2: raft: handle non voting members correctly in stepdown procedure raft: exclude non voting nodes from the stable leader check raft: fix configuration::can_vote() to worth correctly with joint config	2021-11-22 12:00:44 +01:00
Benny Halevy	ce9836e2fd	messaging_service: init_local_preferred_ip_cache: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211119143523.3424773-2-bhalevy@scylladb.com>	2021-11-22 13:29:21 +03:00
Benny Halevy	eba20c7e5d	messaging_service: init_local_preferred_ip_cache: get preferred ips from caller To avoid back-calling the system_keyspace from the messaging layer let the system_keyspace get the preferred ips vector and pass it down to the messaging_service. This is part of the effort to deglobalize the system keyspace and query context. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211119143523.3424773-1-bhalevy@scylladb.com>	2021-11-22 13:29:17 +03:00
Gleb Natapov	e56022a8ba	migration_manager: co-routinize announce_column_family_update The patch also removes the usage of map_reduce() because it is no longer needed after `6191fd7701` that drops futures from the view mutation building path. The patch preserves yielding point that map_reduce() provides though by calling to coroutine::maybe_yield() explicitly. Message-Id: <YZoV3GzJsxR9AZfl@scylladb.com>	2021-11-22 10:48:25 +02:00
Benny Halevy	599ed69023	repair_service: do_decommission_removenode_with_repair: maybe yield everywhere_replication_strategy::calculate_natural_endpoints is synchronous and doesn't yield, so add maybe_yield() calls when looping over many token ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211121090339.3955278-1-bhalevy@scylladb.com> Message-Id: <20211121102606.76700-2-bhalevy@scylladb.com>	2021-11-22 10:48:25 +02:00
Benny Halevy	9d2631daaf	token_metadata: calculate_pending_ranges_for_leaving: maybe yield We see long stalls as reported in https://github.com/scylladb/scylla/issues/8030#issuecomment-974783526 everywhere_replication_strategy::calculate_natural_endpoints is synchronous and doesn't yield, so add maybe_yield() calls when looping over many token ranges. Refs #8030 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211121090339.3955278-1-bhalevy@scylladb.com> Message-Id: <20211121102606.76700-1-bhalevy@scylladb.com>	2021-11-22 10:48:25 +02:00
Benny Halevy	df5ccb8884	storage_service: get_changed_ranges_for_leaving: maybe yield We see long stalls as reported in https://github.com/scylladb/scylla/issues/8030#issuecomment-974647167 Even before the change to use erm->get_natural_endpoints, everywhere_replication_strategy::calculate_natural_endpoints is synchronous and doesn't yield, so add maybe_yield() calls when looping over all token ranges. Refs #8030 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211121090339.3955278-1-bhalevy@scylladb.com>	2021-11-21 11:31:56 +02:00
Raphael S. Carvalho	2b2f0eae05	compaction: STCS: kill needless include of database.hh This is part of work for reducing compilation time and removing layer violation in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211120042727.114909-1-raphaelsc@scylladb.com>	2021-11-21 11:28:29 +02:00
Raphael S. Carvalho	8d9704c030	compaction: LCS: kill needless include of database.hh This is part of work for reducing compilation time and removing layer violation in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211120042232.106651-1-raphaelsc@scylladb.com>	2021-11-20 18:28:55 +02:00
Avi Kivity	96e9c3951c	Merge "Finally stop including database.hh in compaction.cc" from Raphael " After this series, compaction will finally stop including database.hh. tests: unit(debug). " * 'stop_including_database_hh_for_compaction' of github.com:raphaelsc/scylla: compaction: stop including database.hh compaction: switch to table_state in get_fully_expired_sstables() compaction: switch to table_state compaction: table_state: Add missing methods required by compaction	2021-11-20 18:28:05 +02:00
Raphael S. Carvalho	06405729ce	compaction: stop including database.hh after switching to table_state, compaction code can finally stop including database.hh Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-19 22:06:03 -03:00
Raphael S. Carvalho	69ab5c9dff	compaction: switch to table_state in get_fully_expired_sstables() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-19 22:06:02 -03:00
Raphael S. Carvalho	d89edad9fb	compaction: switch to table_state Make compaction procedure switch to table_state. Only function in compaction.cc still directly using table is get_fully_expired_sstables(T,...), but subsequently we'll make it switch to table_state and then we can finally stop including database.hh in the compaction code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-19 22:06:01 -03:00
Raphael S. Carvalho	12137bca73	compaction: table_state: Add missing methods required by compaction These are the only methods left for compaction to switch to table_state, so compaction can finally stop including database.hh Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-19 22:05:59 -03:00
Piotr Jastrzebski	16de68aba5	tests: Add cql test to verify it's impossible to create MV for CDC Log Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-11-19 17:34:09 +01:00
Piotr Jastrzebski	e12ee2d9cc	cql3: Make it impossible to create MV on CDC log Fixes #9233 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-11-19 17:33:10 +01:00
Avi Kivity	f3d5b2b2b0	Merge "Add effective_replication_map factory" from Benny " Add a sharded locator::effective_replication_map_factory that holds shared effective_replication_maps. To search for e_r_m in the factory, we use a compound `factory_key`: <replication_strategy type, replication_strategy options, token_metadata ring version>. Start the sharded factory in main (plus cql_test_env and tools/schema_loader) and pass a reference to it to storage_proxy and storage_server. For each keyspace, use the registry to create the effective_replication_map. When registered, effective_replication_map objects erase themselves from the factory when destroyed. effective_replication_map then schedules a background task to clear_gently its contents, protected by the e_r_m_f::stop() function. Note that for non-shard 0 instances, if the map is not found in the registry, we construct it by cloning the precalculated replication_map from shard 0 to save the cpu cycles of re-calculating it time and again on every shard. Test: unit(dev), schema_loader_test(debug) DTest: bootstrap_test.py:TestBootstrap.decommissioned_wiped_node_can_join_test update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_with_repair_test (dev) " * tag 'effective_replication_map_factory-v7' of https://github.com/bhalevy/scylla: effective_replication_map: clear_gently when destroyed database: shutdown keyspaces test: cql_test_env: stop view_update_generator before database shuts down effective_replication_map_factory: try cloning replication map from shard 0 tools: schema_loader: start a sharded erm_factory storage_service: use erm_factory to create effective_replication_map keyspace: use erm_factory to create effective_replication_map effective_replication_map: erase from factory when destroyed effective_replication_map_factory: add create_effective_replication_map effective_replication_map: enable_lw_shared_from_this effective_replication_map: define factory_key keyspace: get a reference to the erm_factory main: pass erm_factory to storage_service main: pass erm_factory to storage_proxy locator: add effective_replication_map_factory	2021-11-19 18:19:38 +02:00
Botond Dénes	f8a6857987	tools/scylla-sstable: add --dump-scylla-metadata Dumps the scylla component.	2021-11-19 15:52:41 +02:00
Botond Dénes	a0d1c0948c	tools/scylla-sstable: add --dump-statistics Dumps the statistics component.	2021-11-19 15:52:41 +02:00
Botond Dénes	d3dbf1b0e4	tools/scylla-sstable: add --dump-summary Dumps the summary component.	2021-11-19 15:52:41 +02:00
Botond Dénes	5f59aabc1b	tools/scylla-sstable: add --dump-compression-info Dump the compression-info component.	2021-11-19 15:52:41 +02:00
Botond Dénes	25e9e1f2d4	tools/scylla-sstable: extract unsupported flag checking into function Some of the common flags are unsupported for dumping components other than the data one. Currently this is checked in the only non-data dumper: dump-index. Move this into a separate function in preparation of adding dumpers for other components as well.	2021-11-19 15:52:41 +02:00
Botond Dénes	16e105c8e1	sstables/sstable: add scylla metadata getter	2021-11-19 15:52:41 +02:00
Botond Dénes	78a57c34f9	sstables/sstable: add statistics accessor	2021-11-19 15:52:38 +02:00
Raphael S. Carvalho	c94e6f8567	compaction: Merge GC writer into regular compaction writer Turns out most of regular writer can be reused by GC writer, so let's merge the latter into the former. We gain a lot of simplification, lots of duplication is removed, and additionally, GC writer can now be enabled with interposer as it can be created on demand by each interposer consumer (will be done in a later patch). Refs #6472. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211119120841.164317-1-raphaelsc@scylladb.com>	2021-11-19 14:19:50 +02:00
GavinJE	f8c91bdd1e	Update debugging.md Line 7 does not display correctly in reality. "crashed" appears as "chrashed" on the website. Bug needs to be fixed. Closes #9652	2021-11-19 14:21:53 +03:00
GavinJE	22fa7ecf99	Update compaction_controller.md Line 15. "ee" changed to "they" Closes #9651	2021-11-19 14:19:20 +03:00
Benny Halevy	eed3e95704	effective_replication_map: clear_gently when destroyed Prevent reactor stalls by gently clearing the replication_map and token_metadata_ptr when the effective_replication_map is destroyed. This is done in the background, protected by the effective_replication_map_factory::stop() method. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:41 +02:00
Benny Halevy	cd0061dcb5	database: shutdown keyspaces release the keyspace effective_replication_map during shutdown so that effective_replication_map_factory can be stopped cleanly with no outstanding e_r_m:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:41 +02:00
Benny Halevy	1e259665fe	test: cql_test_env: stop view_update_generator before database shuts down We can't have view updates happening after the database shuts down. In particular, mutateMV depends on the keyspace effective_replaication_map and it is going to be released when all keyspaces shut down, in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:41 +02:00
Benny Halevy	866e1b8479	effective_replication_map_factory: try cloning replication map from shard 0 Calculating a new effective_replication_map on each shard is expensive. To try to save that, use the factory key to look up an e_r_m on shard 0 and if found, use to to clone its replication map and use that to make the shard-local e_r_m copy. In the future, we may want to improve that in 2 ways: - instead of always going to shard 0, use hash(key) % smp::count to create the first copy. - make full copies only on NUMA nodes and keep a shared pointer on all other shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:41 +02:00
Benny Halevy	0a3d66839a	tools: schema_loader: start a sharded erm_factory This is required for an upcoming change to create effective_replication_map on all shards in storage_service::replication_to_all_cores. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:41 +02:00
Benny Halevy	23e1344b72	storage_service: use erm_factory to create effective_replication_map Instead of calculating the effective_replication_map in replicate_to_all_cores, use effective_replication_map_factory:: create_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:41 +02:00
Benny Halevy	cb240ffbae	keyspace: use erm_factory to create effective_replication_map Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:41 +02:00
Benny Halevy	6754e6ca2b	effective_replication_map: erase from factory when destroyed The effective_replication_map_factory keeps nakes pointers to outstanding effective_replication_map:s. These are kept valid using a shared effective_replication_map_ptr. When the last shared ptr reference is dropped the effective_replication_map object is destroyed, therefore the raw pointer to it in the factory must be erased. This now happens in ~effective_replication_map when the object is marked as registered. Registration happens when effective_replication_map_factory inserts the newly created effective_replication_map to its _replication_maps map, and the factory calles effective_replication_map::set_factory.. Note that effective_replication_map may be created temporarily and not be inserted to the factory's map, therefore erase is called only when required. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:52:20 +02:00
Benny Halevy	8a6fbe800f	effective_replication_map_factory: add create_effective_replication_map Make a factory key using the replication_strategy type and config options, plus the token_metadata ring version and use it to search an already-registred effective_replication_map. If not found, calculate a new create_effective_replication_map and register it using the above key. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:46:51 +02:00
Benny Halevy	ecba37dbfd	effective_replication_map: enable_lw_shared_from_this So a effective_replication_map_ptr can be generated using a raw pointer by effective_replication_map_factory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:46:51 +02:00
Benny Halevy	f4f41e2908	effective_replication_map: define factory_key To be used to locate the effective_replication_map in the to-be-introduced effective_replication_map_factory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:46:51 +02:00
Benny Halevy	5947de7674	keyspace: get a reference to the erm_factory To be used for creating effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:46:51 +02:00
Benny Halevy	1d7556d099	main: pass erm_factory to storage_service To be used for creating effective_replication_map when token_metadata changes, and update all keyspaces with it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:46:51 +02:00
Benny Halevy	242043368e	main: pass erm_factory to storage_proxy To be used for creating the effective_replication_map per keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:46:51 +02:00
Benny Halevy	3fed73e7c2	locator: add effective_replication_map_factory It will be used further to create shared copies of effective_replication_map based on replication_strategy type and config options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-19 10:46:51 +02:00
Benny Halevy	3c0fec6b17	storage_proxy: paxos_response_handler::prune: demote write timeout error printout to debug level Similar to other timeout handling paths, there is no need to print an ERROR for timeout as the error is not returned anyhow. Eventually the error will be reported at the query level when the query times out or fails in any other way. Also, similar to `storage_proxy::mutate_end`, traces were added also for the error cases. FWIW, these extraneous timeout error causes dtest failures. E.g. alternator_tests:AlternatorTest.test_slow_query_logging Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211118153603.2975509-1-bhalevy@scylladb.com>	2021-11-19 11:09:09 +03:00
Raphael S. Carvalho	5f7ee2e135	test: sstable_compaction_test: fix twcs_reshape_with_disjoint_set_test by using a non-coarse timestamp resolution We're using a coarse resolution when rounding clock time for sstables to be evenly distributed across time buckets. Let's use a better resolution, to make sure sstables won't fall into the edges. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211118172126.34545-1-raphaelsc@scylladb.com>	2021-11-19 11:09:09 +03:00
Pavel Emelyanov	1dd08e367e	test, cross-shard-barrier: Increase stall detector period The test checks every 100 * smp::count milliseconds that a shard had been able to make at least once step. Shards, in turn, take up to 100 ms sleeping breaks between steps. It seems like on heavily loaded nodes the checking period is too small and the test stuck-detector shoots false-positives. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211118154932.25859-1-xemul@scylladb.com>	2021-11-19 11:09:09 +03:00
Mikołaj Sielużycki	87a212fa56	memtable-sstable: Fix indentation in table::try_flush_memtable_to_sstable. Message-Id: <20211118131441.215628-3-mikolaj.sieluzycki@scylladb.com>	2021-11-19 11:09:09 +03:00
Mikołaj Sielużycki	6df07f7ff7	memtable-sstable: Convert table::try_flush_memtable_to_sstable to coroutines. I intentionally store lambdas in variables and pass them to with_scheduling_group using std::ref. Coroutines don't put variables captured by lambdas on stack frame. If the lambda containing them is not stored, the captured variables will be lost, resulting in stack/heap use after free errors. An alternative is to capture variables, then create local variables inside lambda bodies that contain a copy/moved version of the captured ones. For example, if the post_flush lambda wasn't stored in a dedicated variable, then it wouldn't be put on the coroutine frame. At the first co_await inside of it, the lambda object along with variables captured by it (old and &newtabs created inside square brackets) would go away. The underlying objects (e.g. newtabs created in the outer scope) would still be valid, but the reference to it would be gone, causing most of the tests to fail. Message-Id: <20211118131441.215628-2-mikolaj.sieluzycki@scylladb.com>	2021-11-19 11:09:09 +03:00
Kamil Braun	0f404c727e	test: raft: randomized_nemesis_test: better RPC message receiving implementation The previous implementation based on `delivery_queue` had a serious defect: if receiving a message (`rpc::receive`) blocked, other messages in the queue had to wait. This would cause, for example, `vote_request` messages to stop being handled by a server if the server was in the middle of applying a snapshot. Now `rpc::receive` returns `void`, not `future<>`. Thus we no longer need `delivery_queue`: the network message delivery function can simply call `rpc::receive` directly. Messages which require asynchronous work to be performed (such as snapshot application) are handled in `rpc::receive` by spawning a background task. The number of such background tasks is limited separately for each message type; now if we exceed that limit, we drop other messages of this type (previously they would queue up indefinitely and block not only other messages of this type but different types as well). Message-Id: <20211116163316.129970-1-kbraun@scylladb.com>	2021-11-19 11:09:09 +03:00
Botond Dénes	a51529dd15	protocol_servers: strengthen guarantees of listen_addresses() In early versions of the series which proposed protocol servers, the interface had two methods answering pretty much the same question of whether the server is running or not: * listen_addresses(): empty list -> server not running * is_server_running() To reduce redundancy and to avoid possible inconsistencies between the two methods, `is_server_running()` was scrapped, but re-added by a follow-up patch because `listen_addresses()` proved to be unreliable as a source for whether the server is running or not. This patch restores the previous state of having only `listen_addresses()` with two additional changes: * rephrase the comment on `listen_addresses()` to make it clear that implementations must return empty list when the server is not running; * those implementations that have a reliable source of whether the server is running or not, use it to force-return an empty list when the server is not running Tests: dtest(nodetool_additional_test.py) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211117062539.16932-1-bdenes@scylladb.com>	2021-11-19 11:09:09 +03:00
Gleb Natapov	814dea3600	raft: handle non voting members correctly in stepdown procedure For leader stepdown purposes a non voting member is not different from a node outside of the config. The patch makes relevant code paths to check for both conditions.	2021-11-18 11:35:29 +02:00
Gleb Natapov	6a9b3cdb49	raft: exclude non voting nodes from the stable leader check If a node is a non voting member it cannot be a leader, so the stable leader rule should not be applied to it. This patch aligns non voting node behaviour with a node that was removed from the cluster. Both of them stepdown from leader position if they happen to be a leader when the state change occurred.	2021-11-18 11:18:13 +02:00
Raphael S. Carvalho	4b1bb26d5a	compaction: Make maybe_replace_exhausted_sstables_by_sst() more robust Make it more robust by tracking both partial and sealed sstables. This way, maybe_r__e__s__by_sst() won't pick partial sstables as part of incremental compaction. It works today because interposer consumer isn't enabled with incremental compaction, so there's a single consumer which will have sealed the sstable before the function for early replacement is called, but the story is different if both is enabled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211117135817.16274-1-raphaelsc@scylladb.com>	2021-11-17 17:21:53 +02:00
Avi Kivity	bc75e2c1d1	treewide: wrap runtime formats with fmt::runtime for fmt 8 fmt 8 checks format strings at compile time, and requires that non-compile-time format strings be wrapped with fmt::runtime(). Do that, and to allow coexistence with fmt 7, supply our own do-nothing version of fmt::runtime() if needed. Strictly speaking we shouldn't be introducing names into the fmt namespace, but this is transitional only. Closes #9640	2021-11-17 15:21:36 +02:00
Gleb Natapov	6744b466e4	cql3: co-routinize alter_type_statement::announce_migration Message-Id: <YZUAlx3fHdVRSlqX@scylladb.com>	2021-11-17 15:20:37 +02:00
Gavin Howell	28f8c3987e	docs/alternator: copyedit alternator.md Line 41. Grammar correction needed. Unclear meaning in sentence. word "message" added after "error". Comma added after "message". Closes #9648	2021-11-17 15:06:21 +02:00
Gavin Howell	7b0a5cdeb2	docs/alternator: typo in compatibility.md Line 170. "PoinInTime" changed to "PointInTime" Closes #9650	2021-11-17 15:03:40 +02:00
Calle Wilund	a8bb4dcd28	tls: Add certficate_revocation_list option for client/server encryption options Fixes #9630 Adds support for importing a CRL certificate reovcation list. This will be monitored and reloaded like certs/keys. Allows blacklisting individual certs. Closes #9655	2021-11-17 14:24:22 +02:00
Nadav Har'El	82bcc2cbd2	Merge: redis: get controller in line Merged patch series from Botond Dénes: Redis's controller, unlike all other protocol's controllers is called service and is not even in the redis namespace. This is made even worse by the redis directory also having a server.{hh,cc}, making one always second guessing on which is what. This series applies to the redis controller the convention used by (almost) all other service controller classes: * They are called controller * They are in a file called ${protocol}/controller.{hh,cc} * They are in a namespace ${protocol} (Thrift is not perfectly following this either). Botond Dénes (3): redis: redis_service: move in redis namespace redis: redis::service -> redis::controller redis: mv service.* -> controller.* configure.py \| 2 +- main.cc \| 10 ++++----- redis/{service.cc => controller.cc} \| 32 ++++++++++++++++------------- redis/{service.hh => controller.hh} \| 10 ++++----- 4 files changed, 29 insertions(+), 25 deletions(-) rename redis/{service.cc => controller.cc} (87%) rename redis/{service.hh => controller.hh} (93%)	2021-11-17 14:19:36 +02:00
Botond Dénes	d4d4c0ace7	redis: mv service.* -> controller.*	2021-11-17 13:58:49 +02:00
Botond Dénes	618adeddd8	redis: redis::service -> redis::controller Follow the naming scheme for the controller class/instance used by all other protocol controllers: * rename class: service -> controller; * rename variable in main.cc: redis -> redis_ctl;	2021-11-17 13:47:44 +02:00
Botond Dénes	95510c6f92	redis: redis_service: move in redis namespace	2021-11-17 13:44:41 +02:00
Piotr Jastrzebski	033a75ff96	cdc: Don't support "on" and "off" values for preimage any more This is an undocumented feature that causes confusion so let's get rid of it. tests: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #9639	2021-11-17 11:54:11 +01:00
Tzach Livyatan	0b6c49b03e	docs website: update latest branch to 4.5 Closes #9638	2021-11-17 12:33:22 +02:00
Avi Kivity	12d29b28ab	raft: generator: correct constraints on members A member variable is a reference, not a pure value, so std::same_as<> needs to be given a reference (and clanf 13 insists). However, clang 12 doesn't accept the correct constraint, so use std::convertible_to<> as a compromise. Closes #9642	2021-11-17 11:27:52 +02:00
Gleb Natapov	8f64a6d2d2	raft: fix configuration::can_vote() to worth correctly with joint config Fix configuration::can_vote() to return true if a node is a voting member in any of the configs.	2021-11-17 11:06:42 +02:00
Benny Halevy	9548220b70	compaction_manager: submit_offstrategy: remove task in finally clause Now, when the offstrategy task is stopped, it exits the repeat loop if (!can_proceed(task)) without going through _tasks.remove(task) - causing the assert in compaction_manger::remove to trip, as stop_ongoing_compactions will be resolved while the task is still listed in _tasks. Fixes #9634 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-17 09:53:59 +02:00
Avi Kivity	720e9521f0	utils: build_id: correct fmt include fmt::print(std::ostream&) is in <fmt/ostream.h> Closes #9641	2021-11-17 09:02:57 +02:00
Avi Kivity	edcdbc16d3	db: heat weighted load balancing: remove unused variable total_deficit The variable is write-only. Closes #9647	2021-11-17 09:02:23 +02:00
Avi Kivity	e51fcc22f3	sstable_loader: add missing include <cfloat> Needed for FLT_EPSILON Closes #9646	2021-11-17 09:01:49 +02:00
Avi Kivity	2c1e30a12a	test: network_topology_strategy_test: remove unused variable total_rf It is write-only. Closes #9645	2021-11-17 09:01:24 +02:00
Avi Kivity	cba07a3145	test: perf: fix format string for scheduling_latency_measurer Need a colon to introduce the format after the default argument specifier. Found by fmt 8. Closes #9644	2021-11-17 09:00:56 +02:00
Avi Kivity	6ece375fc8	repair: add missing include <cfloat> Needed for FLT_EPSILON Closes #9643	2021-11-17 09:00:11 +02:00
Amos Kong	32e62252e1	debian/build_offline_installer.sh: config apt to keep downloaded packages The downloaded packages might be deleted autotically after installation, then we will provide an incomplete installer to user. This patch changed to config apt to keep the downloaded packages before installation. Signed-off-by: Amos Kong <kongjianjun@gmail.com> Closes #9592	2021-11-16 17:47:01 +02:00
Avi Kivity	e2c27ee743	Merge 'commitlog: recalculate disk footprint on delete_segment exceptions' from Calle Wilund If we get errors/exceptions in delete_segments we can (and probably will) loose track of disk footprint counters. This can in turn, if using hard limits, cause us to block indefinitely on segment allocation since we might think we have larger footprint than we actually do. Of course, if we actually fail deleting a segment, it is 100% true that we still technically hold this disk footprint (now unreachable), but for cases where for example outside forces (or wacky tests) delete a file behind our backs, this might not be true. One could also argue that our footprint is the segments and file names we keep track of, and the rest is exterior sludge. In any case, if we have any exceptions in delete_segments, we should recalculate disk footprint based on current state, and restart all new_segment paths etc. Fixes #9348 (Note: this is based on previous PR #9344 - so shows these commits as well. Actual changes are only the latter two). Closes #9349 * github.com:scylladb/scylla: commitlog: Recalculate footprint on delete_segment exceptions commitlog_test: Add test for exception in alloc w. deleted underlying file commitlog: Ensure failed-to-create-segment is re-deleted commitlog::allocate_segment_ex: Don't re-throw out of function	2021-11-16 17:44:56 +02:00
Tomasz Grabiec	bf6898a5a0	lsa: Add sanity checks around lsa_buffer operations We've been observing hard to explain crashes recently around lsa_buffer destruction, where the containing segment is absent in _segment_descs which causes log_heap::adjust_up to abort. Add more checks to catch certain impossible senarios which can lead to this sooner. Refs #9192. Message-Id: <20211116122346.814437-1-tgrabiec@scylladb.com>	2021-11-16 14:25:02 +02:00
Tomasz Grabiec	4d627affc3	lsa: Mark compact_segment_locked() as noexcept We cannot recover from a failure in this method. The implementation makes sure it never happens. Invariants will be broken if this throws. Detect violations early by marking as noexcept. We could make it exception safe and try to leave the data structures in a consistent state but the reclaimer cannot make progress if this throws, so it's pointless. Refs #9192 Message-Id: <20211116122019.813418-1-tgrabiec@scylladb.com>	2021-11-16 14:23:10 +02:00
Pavel Emelyanov	a62631d441	config: Enable developer-mode by default in dev/debug modes Other than looking sane, this change continues the founded by the --workdir option tradition of freeing the developer form annoying necessity to type too many options when scylla is started by hand for devel purposes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211116104815.31822-1-xemul@scylladb.com>	2021-11-16 12:53:33 +02:00
Pavel Emelyanov	ba16318457	generic_server: Keep server alive during conn background processing There's at least one tiny race in generic_server code. The trailing .handle_exception after the conn->process() captures this, but since the whole continuation chain happens in the background, that this can be released thus causing the whole lambda to execute on freed generic_server instance. This, in turn, is not nice because captured this is used to get a _logger from. The fix is based on the observation that all connections pin the server in memory until all of them (connections) are destructed. Said that, to keep the server alive in the aforementioned lambda it's enough to make sure the conn variable (it's lw_shared_ptr on the connection) is alive in it. Not to generate a bunch of tiny continuations with identical set of captures -- tail the single .then_wrapped() one and do whatever is needed to wrap up the connection processing in it. tests: unit(dev) fixes: #9316 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211115105818.11348-1-xemul@scylladb.com>	2021-11-16 11:10:39 +02:00
Pavel Emelyanov	6131aea203	scylla-gdb: Handle new fair_queue::priority_class_data representation In the full-duplex capable scheduler the _handles list contains direct pointers on pclass data, not lw_shared_ptr's. Most of the time this container is empty so this bug is not triggerable right at once. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211116084250.21399-1-xemul@scylladb.com>	2021-11-16 11:07:21 +02:00
Botond Dénes	b136746040	mutation_partition: deletable_row::apply(shadowable_tombstone): remove redundant maybe_shadow() Shadowing is already checked by the underlying row_tombstone::apply(). This redundant check was introduced by a previous fix to #9483 (`6a76e12768`). The rest of that patch is good. Refs: #9483 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211115091513.181233-1-bdenes@scylladb.com>	2021-11-15 17:50:41 +01:00
Kamil Braun	e48e0ff7db	test: sstable_conforms_to_mutation_source_test: fix `vector::erase` call ... in `test_sstable_reversing_reader_random_schema`. The call was missing an end iterator.	2021-11-15 17:32:22 +01:00
Kamil Braun	3abcbf6875	test: mutation_source_test: extract `forwardable_reader_to_mutation` function The function shall be used in other places as well.	2021-11-15 17:32:17 +01:00
Kamil Braun	9f0e13dd0b	test: random_schema: fix clustering column printing in `random_schema::cql` Also leave a FIXME to include the key ordering in the string as well.	2021-11-15 17:30:59 +01:00
Botond Dénes	64bb48855c	flat_mutation_reader: revamp flat_mutation_reader_from_mutations() Add schema parameter so that: * Caller has better control over schema -- especially relevant for reverse reads where it is not possible to follow the convention of passing the query schema which is reversed compared to that of the mutations. * Now that we don't depend on the mutations for the schema, we can lift the restriction on mutations not being empty: this leads to safer code. When the mutations parameter is empty, an empty reader is created. Add "make_" prefix to follow convention of similar reader factory functions. Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211115155614.363663-1-bdenes@scylladb.com>	2021-11-15 17:58:46 +02:00
Nadav Har'El	6e1344eb4f	alternator: better error handling for wrongly-encoded numbers In the DynamoDB API, a number is encoded in JSON requests as something like: {"N": "123"} - the type is "N" and the value "123". Note that the value of the number is encoded as a string, because the floating-point range and accuracy of DynamoDB differs from what various JSON libraries may support. We have a function unwrap_number() which supported the value of the number being encoded as an actual number, not a string. But we should NOT support this case - DynamoDB doesn't. In this patch we add a test that confirms that DynamoDB doesn't, and remove the unnecessary case from unwrap_number(). The unnecessary case also had a FIXME, so it's a good opportunity to get rid of a FIXME. When writing the test, I noticed that the error which DynamoDB returns in this case is SerializionException instead of the more usual ValidationException. I don't know why, but let's also change the error type in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211115125738.197099-1-nyh@scylladb.com>	2021-11-15 14:47:49 +01:00
Botond Dénes	802e3642a0	Update tools/java submodule * tools/java cb6c1d07a7...8fae618f7f (2): > removeNode: support ignoreNodes options > build: replace yum with dnf	2021-11-15 15:41:27 +02:00
Botond Dénes	f313706d80	Update tools/jmx submodule * tools/jmx d6225c5...2c43d99 (2): > removeNode: support ignoreNodes options > build: replace yum with dnf	2021-11-15 15:41:27 +02:00
Avi Kivity	a19d00ef9b	dist: scylla_raid_setup: mount XFS with online discard Online discard asks the disk to erase flash memory cells as soon as files are deleted. This gives the disk more freedom to choose where to place new files, so it improves performance. On older kernel versions, and on really bad disks, this can reduce performance so we add an option to disable it. Since fstrim is pointless when online discard is enabled, we don't configure it if online discard is selected. I tested it on an AWS i3.large instance, the flag showd up in `mount` after configuration. Closes #9608	2021-11-15 14:16:08 +02:00
Avi Kivity	c17101604f	Merge 'Revert "scylla_util.py: return bool value on systemd_unit.is_active()"' from Takuya ASADA On scylla_unit.py, we provide `systemd_unit.is_active()` to return `systemctl is-active` output. When we introduced systemd_unit class, we just returned `systemctl is-active` output as string, but we changed the return value to bool after that (`2545d7fd43`). This was because `if unit.is_active():` always becomes True even it returns "failed" or "inactive", to avoid such scripting bug. However, probably this was mistake. Because systemd unit state is not 2 state, like "start" / "stop", there are many state. And we already using multiple unit state ("activating", "failed", "inactive", "active") in our Cloud image login prompt: https://github.com/scylladb/scylla-machine-image/blob/next/common/scylla_login#L135 After we merged `2545d7fd43`, the login prompt is broken, because it does not return string as script expected (https://github.com/scylladb/scylla-machine-image/issues/241). I think we should revert `2545d7fd43`, it should return exactly same value as `systemctl is-active` says. Fixes #9627 Fixes scylladb/scylla-machine-image#241 Closes #9628 * github.com:scylladb/scylla: scylla_ntp_setup: use string in systemd_unit.is_active() Revert "scylla_util.py: return bool value on systemd_unit.is_active()"	2021-11-15 13:56:28 +02:00
Takuya ASADA	279fabe9b4	scylla_ntp_setup: use string in systemd_unit.is_active() Since we reverted `2545d7fd43`, we need to use string instead of bool value.	2021-11-15 19:50:31 +09:00
Takuya ASADA	d646673705	Revert "scylla_util.py: return bool value on systemd_unit.is_active()" This reverts commit `2545d7fd43`. Fixes #9627 Fixes scylladb/scylla-machine-image#241	2021-11-15 19:50:31 +09:00
Pavel Emelyanov	4e86936850	redis: Remove stop_server deferred action from main Commit `3f56c49a9e` put redis into protocol_servers list of storage service. Since then there's no need in explicit stop_server call on shutdown -- the protocol_servers thing will do it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211109154259.1196-1-xemul@scylladb.com>	2021-11-15 11:58:44 +02:00
Avi Kivity	4d7a013e94	sstables: mx: writer: make large partition stats accounting branch-free It is bad form to introduce branches just for statistics, since branches can be expensive (even when perfectly predictable, they consume branch history resources). Switch to simple addition instead; this should be not cause any cache misses since we already touch other statistics earlier. The inputs are already boolean, but cast them to boolean just so it is clear we're adding 0/1, not a count. Closes #9626	2021-11-15 11:28:48 +02:00
Benny Halevy	9d4262e264	protocol_server: add per-protocol is_server_running method Change `b0a2a9771f` broke the generic api implementation of is_native_transport_running that relied on the addresses list being empty agter the server is stopped. To fix that, this change introduces a pure virtual method: protocol_server::is_server_running that can be implemented by each derived class. Test: unit(dev) DTest: nodetool_additional_test.py:TestNodetool.binary_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211114135248.588798-1-bhalevy@scylladb.com>	2021-11-14 16:01:31 +02:00
Avi Kivity	c9b8b84411	build: replace yum with dnf dnf has replaced yum on Fedora and CentOS. On modern versions of Fedora, you have to install an extra package to get the old name working, so avoid that inconvenience and use dnf directly. Closes #9622	2021-11-14 14:41:47 +02:00
Michael Livshin	a7511cf600	system keyspace: record partitions with too many rows Add "rows" field to system.large_partitions. Add partitions to the table when they are too large or have too many rows. Fixes #9506 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9577	2021-11-14 14:25:18 +02:00
Avi Kivity	98ec98ba36	Update seastar submodule * seastar 04c6787b35...f8a038a0a2 (1): > http: disable Nagle's algorithm for the http server Fixes #9619.	2021-11-14 13:21:06 +02:00
Avi Kivity	6cb3caaf39	Update seastar submodule * seastar a189cdc45...04c6787b3 (12): > Convert std::result_of to std::invoke_result > Merge "IO queue full-duplex mode" from Pavel E > Merge "Report bytes/ops for R and W separately" from Pavel E > websocket: override std::exception::what() correctly > tests: websocket_test: remove unused lambda capture > Merge "Improve IO classes preemption" from Pavel E > Revert "Merge "Improve IO classes preemption" from Pavel E" > Merge "Add skeleton implementation of a WebSocket server" from Piotr S > Merge "Improve IO classes preemption" from Pavel E > io_queue: Add starvation time metrics (per-class) > Revert "Merge "Add skeleton implementation of a WebSocket server" from Piotr S" > Merge "Add skeleton implementation of a WebSocket server" from Piotr S	2021-11-13 11:56:28 +02:00
Piotr Sarna	cc544ba117	service: coroutinize client_state.cc No functional changes, but makes the code shorter and gets rid of a few allocations. Coroutinizing has_column_family_access is deliberately skipped and commented, since some callers expect this function to throw instead of returning an exceptional future. Message-Id: <958848a1eeeef490b162d2d2b805c8a14fc9082b.1636704996.git.sarna@scylladb.com>	2021-11-12 21:52:29 +02:00
Tomasz Grabiec	4e3b54d9fe	Merge "Teach scylla-gdb.py duplex IO queues" from Pavel Emelyanov Fresh seastar has duplex IO queues (and some more goodies). The former one needs respective changes in scylla-gdb.py * xemul/br-gdb-duplex-ioqueues: scylla-gdb: Support new fair_{queue\|group}s layout scylla-gdb: Add boost::container::small_vector wrapper scylla-gdb: Fix indentation aft^w before next patch	2021-11-12 19:43:22 +01:00
Pavel Emelyanov	123286d5cd	database: Remove infinite_bound_range_deletion bits Have been unused for quite a while already Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211112150837.24125-1-xemul@scylladb.com>	2021-11-12 19:40:17 +01:00
Pavel Emelyanov	5877b84a1a	range_streamer: Remove stream_plan from The streamer creates stream_plan "on demand" and doesnt use the on-board one Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211112180335.27831-1-xemul@scylladb.com>	2021-11-12 19:38:45 +01:00
Pavel Emelyanov	29892af828	scylla-gdb: Support new fair_{queue\|group}s layout In the recent seastar io_queues carry several fair_queues on board, so do the io_groups. The queues are in boost small_vector, the groups are in a vector of unique_ptrs. This patch adds this knowledge to scylla-gdb script. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-12 16:16:25 +03:00
Pavel Emelyanov	c032794556	scylla-gdb: Add boost::container::small_vector wrapper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-12 16:15:51 +03:00
Pavel Emelyanov	b321cccaad	scylla-gdb: Fix indentation aft^w before next patch The upcoming seastar update will have fair_groups and fair_queues to become arrays. Thus scylla-gdb will need to iterate over them with some sort of loop. This patch makes the queue/group prining indentation to match this future loop body and prepares the loop variables while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-12 16:11:59 +03:00
Gleb Natapov	123ece611b	lwt: co-routinize accept_proposal Message-Id: <20211111163942.121827-4-gleb@scylladb.com>	2021-11-11 22:13:26 +02:00
Gleb Natapov	588768f4af	lwt: co-routinize prepare_ballot Message-Id: <20211111163942.121827-3-gleb@scylladb.com>	2021-11-11 22:13:26 +02:00
Gleb Natapov	61b2e41a23	lwt: co-routinize begin_and_repair_paxos Message-Id: <20211111163942.121827-2-gleb@scylladb.com>	2021-11-11 22:13:26 +02:00
Avi Kivity	f74b258928	Merge "Add the system.config virtual table (updateable)" from Pavel E " Scylla can be configured via a bunch of config files plus a bunch of commandline options. Collecting these altogether can be challenging. The proposed table solves a big portion of this by dupming the db::config contents as a table. For convenience (and, maybe, to facilitate Benny's CLI) it's possible to update the 'value' column of the table with CQL request. There exists a PR with a table that exports loglevels in a form of a table. The updating technique used in this set is applicable to that table as well. tests: compilation(dev, release, debug), unit(debug) " * 'br-db-config-virtual-table-3' of https://github.com/xemul/scylla: tests: Unit test for system.config virtual table system_keyspace: Table with config options code: Push db::config down to virtual tables storage_proxy: Propagate virtual table exceptions messages table: Virtual writer hook (mutation applier) table: Rewrap table::apply() table: Mark virtual reader branch with unlikely utils: Add config_src::source_name() method utils: Ability to set_value(sstring) for an option utils: Internal change of config option utils: Mark some config_file methods noexcept	2021-11-11 22:13:26 +02:00
Yaron Kaikov	060a91431d	dist/docker/debian/build_docker.sh: debian version fix for rc releases When building a docker we relay on `VERSION` value from `SCYLLA-VERSION-GEN` . For `rc` releases only there is a different between the configured version (X.X.rcX) and the actualy debian package we generate (X.X~rcX) Using a similar solution as i did in `dcb10374a5` Fixes: #9616 Closes #9617	2021-11-11 22:13:26 +02:00
Pavel Emelyanov	e6ef5e7e43	tests: Unit test for system.config virtual table Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 16:39:34 +03:00
Pavel Emelyanov	4a70e0aa57	system_keyspace: Table with config options A config option value is reported as 'text' type and contains a string as it would looks like in json config. The table is UPDATE-able. Only the 'value' columnt can be set and the value accepted must be string. It will be converted into the option type automatically, however in current implementation is't not 100% precise -- conversion is lexicographical cast which only works for simple types. However, liveupdate-able values are only of those types, so it works in supported cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 16:39:34 +03:00
Pavel Emelyanov	947e4c9a10	code: Push db::config down to virtual tables The db::config reference is available on the database, which can be get from the virtual_table itself. The problem is that it's a const refernece, while system.config will be updateable and will need non-const reference. Adding non-const get_config() on the database looks wrong. The database shouldn't be used as config provider, even the const one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 16:39:34 +03:00
Pavel Emelyanov	1ea301ad07	storage_proxy: Propagate virtual table exceptions messages The intention is to return some meaningful info to the CQL caller if a virtual table update fails. Unfortunately the "generic" error reporting in CQL is not extremely flexible, so the best option seems to report regular write failre with custom message in it. For now this only works for virtual table errors. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 16:39:34 +03:00
Pavel Emelyanov	5aefc48e28	table: Virtual writer hook (mutation applier) Symmetrically to virtual reader one, add the virtual writer callback on a table that will be in charge of applying the provided mutation. If a virtual table doesn't override this apply method the dedicated exception is thrown. Next patch will catch it and propagate back to caller, so it's a new exception type, not existing/std one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 16:39:34 +03:00
Pavel Emelyanov	80460f66fc	table: Rewrap table::apply() The main motivation is to have future returning apply (to be used by next patches). As a side effect -- indentation fix and private dirty_memory_region_group() method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 16:39:34 +03:00
Pavel Emelyanov	c3d15c3e18	table: Mark virtual reader branch with unlikely Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 15:15:05 +03:00
Pavel Emelyanov	b3fee616ea	utils: Add config_src::source_name() method To get a human-readable string from abstract source type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 15:15:05 +03:00
Pavel Emelyanov	d513034ca4	utils: Ability to set_value(sstring) for an option There soon will appear an updateable system.config table that will push sstrings into names_value-s. Prepare for this change by adding the respective .set_value() call. Since the update only works for LiveUpdate-able options, and inability to do it can be propagated back to the caller make this method return true/false whether the update took place or not. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 15:15:05 +03:00
Pavel Emelyanov	c226c0a149	utils: Internal change of config option When a named_value is .set_value()-d the caller may specify the reason for this change. If not specified it's set to None, but None means "it was there by default and was't changed" so it's a bit of a lie. Add an explicit Internal reason. It's actually used by the directories thing that update all directories according to --workdir option. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 15:15:05 +03:00
Pavel Emelyanov	2959ebf393	utils: Mark some config_file methods noexcept Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 15:15:05 +03:00
Botond Dénes	b58403fb63	Merge "Flatten database drain" from Pavel E " Draining the database is now scattered across the do_drain() method of the storage_service. Also it tells shutdown drain from API drain. This set packs this logic into the database::drain() method. tests: unit(dev), start-stop-drain(dev) " * 'br-database-drain' of https://github.com/xemul/scylla: database, storage_service: Pack database::drain() method storage_service: Shuffle drain sequence storage_service, database: Move flush-on-drain code storage_service: Remove bool from do_drain	2021-11-11 08:19:35 +02:00
Tomasz Grabiec	a084c8c10f	Merge "raft fixes for bugs found by randomized nemesis testing" from Gleb The series fixes issues: server may use the wrong configuration after applying a remote snapshot, causing a split-brain situation assertion ins raft::server_impl::notify_waiters() snapshot transfer to a server removed from the configuration should be aborted cluster may become stuck when a follower takes a snapshot after an accepted entry that the leader didn't learn about * scylla-dev/random-test-fixes-v2: raft: rename rpc_configuration to configuration in fsm output raft: test: test case for the issue #9552 raft: fix matching of a snapshotted log on a follower raft: abort snapshot transfer to a server that was removed from the configuration raft: fix race between snapshot application and committing of new entries raft: test: add test for correct last configuration index calculation during snapshot application raft: do not maintain _last_conf_idx and _prev_conf_idx past snapshot index raft: correctly truncate the log in a persistence module during snapshot application	2021-11-10 20:36:53 +01:00
Avi Kivity	d949202615	Update tools/java submodule (PyYAML dependency removal) * tools/java fd10821045...cb6c1d07a7 (1): > dist: remove unneeded dependency to PyYAML	2021-11-10 14:16:01 +02:00
Raphael S. Carvalho	49863ab11c	tests: sstable_compaction_test: Fix test compaction_with_fully_expired_table column_family_for_tests was missing the schema which contained the gc_grace_seconds used by the test. Fixes #8872. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211109163440.75592-1-raphaelsc@scylladb.com>	2021-11-09 19:21:57 +02:00
Michał Radwański	eff392073c	memtable: fix gcc function argument evaluation order induced use after move clang evaluates function arguments from left to right, while gcc does so in reverse. Therefore, this code can be correct on clang and incorrect on gcc: ``` f(x.sth(), std::move(x)) ``` This patch fixes one such instance of this bug, in memtable.cc. Fixes #9605. Closes #9606	2021-11-09 19:21:57 +02:00
Avi Kivity	d2e02ea7aa	Merge " Abstract table for compaction layer with table_state" from Raphael " table_state is being introduced for compaction subsystem, to remove table dependency from compaction interface, fix layer violations, and also make unit testing easier as table_state is an abstraction that can be implemented even with no actual table backing it. In this series, compaction strategy interfaces are switching to table_state, and eventually, we'll make compact_sstables() switch to it too. The idea is that no compaction code will directly reference a table object, but only work with the abstraction instead. So compaction subdirectory can stop including database.hh altogether, which is a great step forward. " * 'table_state_v5' of https://github.com/raphaelsc/scylla: sstable_compaction_test: switch to table_state compaction: stop including database.hh for compaction_strategy compaction: switch to table_state in estimated_pending_compactions() compaction: switch to table_state in compaction_strategy::get_major_compaction_job() compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction() DTCS: reduce table dependency for task estimation LCS: reduce table dependency for task estimation table: Implement table_state compaction: make table param of get_fully_expired_sstables() const compaction_manager: make table param of has_table_ongoing_compaction() const Introduce table_state	2021-11-09 19:21:57 +02:00
Pavel Emelyanov	2005b4c330	Merge branch 'move_disable_compaction_to_manager/v6' from Raphael S. Carvalho Move run_with_compaction_disabled() into compaction manager run_with_compaction_disabled() living in table is a layer violation as the logic of disabling compaction for a table T clearly belongs to manager and table shouldn't be aware of such implementation details. This makes things less error prone too as there's no longer a need for coordination between table and manager. Manager now takes all the responsibility. * 'move_disable_compaction_to_manager/v6' of https://github.com/raphaelsc/scylla: compaction: move run_with_compaction_disabled() from table into compaction_manager compaction_manager: switch to coroutine in compaction_manager::remove() compaction_manager: add struct for per table compaction state compaction_manager: wire stop_ongoing_compactions() into remove() compaction_manager: introduce stop_ongoing_compactions() for a table compaction_manager: prevent compaction from being postponed when stopping tasks compaction_manager: extract "stop tasks" from stop_ongoing_compactions() into new function	2021-11-09 19:21:56 +02:00
Pavel Emelyanov	43f6a13a30	database, storage_service: Pack database::drain() method The storage_service::do_drain() now ends up with shutting down compaction manager, flushing CFs and shutting down commitlog. All three belong to the database and deserve being packed into a single database::drain() method. A note -- these steps are cross-shard synchronized, but database already has a barrier for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-09 19:17:38 +03:00
Pavel Emelyanov	906cac0f86	storage_service: Shuffle drain sequence Right now the draining sequence is - stop transport (protocol servers, gossiper, streaming) - shutdown tracing - shutdown compaction manager - flush CFs - drain batchlog manager - stop migration manager - shutdown commitlog This violates the layering -- both batchlog and migration managers are higher-level services than the database, so they should be shutdown/drained before it, i.e. -- before shutting down compaction manager and flushing all CFs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-09 19:13:56 +03:00
Pavel Emelyanov	82509c9e74	storage_service, database: Move flush-on-drain code Flushing all CFs on shutdown is now fully managed in storage service and it looks weird. Some better place for it seems to be the database itself. Moving the flushing code also imples moving the drain_progress thing and patching the relevant API call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-09 19:11:49 +03:00
Pavel Emelyanov	aba475fe1d	storage_service: Remove bool from do_drain The do_drain() today tells shutdown drain from API drain. The reason is that compaction manager subscribes on the main's abort signal and drains itself early. Thus, on regular drain it needs this extra kick that would crash if called from shutdown drain. This differentiation should sit in the compaction manager itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-09 19:10:13 +03:00
Raphael S. Carvalho	df4bce03ae	sstable_compaction_test: switch to table_state Let's make compaction tests switch to table_state. All disabled ones can now be reenabled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 11:35:45 -03:00
Raphael S. Carvalho	bb5a8682f3	compaction: stop including database.hh for compaction_strategy Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 11:29:47 -03:00
Raphael S. Carvalho	e2f6a47999	compaction: switch to table_state in estimated_pending_compactions() Last method in compaction_strategy using table. From now on, compaction strategy no longer works directly with table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 11:25:28 -03:00
Raphael S. Carvalho	93ae9225f7	compaction: switch to table_state in compaction_strategy::get_major_compaction_job() From now on, get_major_compaction_job() will use table_state instead of a plain reference to table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 11:25:22 -03:00
Raphael S. Carvalho	d881310b52	compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction() From now on, get_sstables_for_compaction() will use table_state. With table_state, we avoid layer violations like strategy using manager and also makes testing easier. Compaction unit tests were temporarily disabled to avoid a giant commit which is hard to parse. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:52:14 -03:00
Raphael S. Carvalho	9f2d2eee98	DTCS: reduce table dependency for task estimation Similar to LCS, let's reduce table dependency in DTCS, to make it easier to switch to table_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:50:29 -03:00
Raphael S. Carvalho	83fc59402f	LCS: reduce table dependency for task estimation let's reduce table dependency from LCS task estimation, to make it easier to switch to table_state. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:50:27 -03:00
Raphael S. Carvalho	03c819b8f5	table: Implement table_state This is the first implementation of table_state, intended to be used within compaction. It contains everything needed for compaction strategies. Subsequently, compaction strategy interface will replace table by table_state, and later all compaction procedures. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:45:40 -03:00
Raphael S. Carvalho	29df862f57	compaction: make table param of get_fully_expired_sstables() const Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:41:54 -03:00
Raphael S. Carvalho	ff4953206b	compaction_manager: make table param of has_table_ongoing_compaction() const Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:41:52 -03:00
Raphael S. Carvalho	ccb87a6b24	Introduce table_state This abstraction is intended to be used within compaction layer, to replace direct usage of table. This will simplify interfaces, and also simplify testing as an actual table is no longer strictly required. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:41:44 -03:00
Gleb Natapov	7aac6c2086	raft: rename rpc_configuration to configuration in fsm output The filed is generic and used not only for rpc configuration now.	2021-11-09 15:16:57 +02:00
Gleb Natapov	db25f1dbb8	raft: test: test case for the issue #9552 test that if a leader tries to append an entry that falls inside a follower's snapshot the protocol stays alive.	2021-11-09 14:51:40 +02:00
Gleb Natapov	a59779155f	raft: fix matching of a snapshotted log on a follower There can be a situation where a leader will send to a follower entries that the latter already snapshotted. Currently a follower consider those to be outdated appends and it rejects them, but it may cause the follower progress to be stuck: - A is a leader, B is a follower, there are other followers which A used to commit entries - A remembers that the last matched entry for B is 10, so the next entry to send is 11. A managed to commit the 11 entry using other followers - A sends entry 11 to B - B receives it, accepts, and updates its commit index to 11. It sends a success reply to A, but it never reaches A due to a network partition - B takes a snapshot at index 11 - A sends entry 11 to B again - B rejects it since it is inside the snapshot - A receives the reject and retries from the same entry - Same thing happen again We should not reject such outdated entries since if they fall inside a snapshot it means they match (according to log matching property). Accepting them will make the case above alive. Fixes #9552	2021-11-09 14:51:40 +02:00
Gleb Natapov	9d505c48de	raft: abort snapshot transfer to a server that was removed from the configuration If a node is removed from a config we should stop transferring snapshot to it. Do exactly that. Fixes #9547	2021-11-09 14:51:40 +02:00
Gleb Natapov	88a6e2446d	raft: fix race between snapshot application and committing of new entries Completion notification code assumes that previous snapshot is applied before new entries are committed, otherwise it asserts that some notifications were missing. But currently commit notifications and snapshot application run in different fibers, so the can be race between those. Fix that by moving commit notification into applier fiber as well. Fixes #9550	2021-11-09 14:51:40 +02:00
Gleb Natapov	3a88fa5f70	raft: test: add test for correct last configuration index calculation during snapshot application	2021-11-09 14:51:40 +02:00
Gleb Natapov	a04eb2d51f	raft: do not maintain _last_conf_idx and _prev_conf_idx past snapshot index The log maintains _last_conf_idx and _prev_conf_idx indexes into the log to point to where the latest and previous configuration can be found. If they are zero it means that the latest config is in the snapshot. When snapshot with a trailing is applied we can safely reset those indexes that are smaller than the snapshot one to zero because the snapshot will have the latest config anyway. This simplifies maintenance of those indexes since their value will not depend on user configured snapshot_trailing parameter.	2021-11-09 14:03:36 +02:00
Calle Wilund	3929b7da1f	commitlog: Add explicit track var for "wasted space" to avoid double counting Refs #9331 In segment::close() we add space to managers "wasted" counter. In destructor, if we can cleanly delete/recycle the file we remove it. However, if we never went through close (shutdown - ok, exception in batch_cycle - not ok), we can end up subtracting numbers that were never added in the first place. Just keep track of the bytes added in a var. Observed behaviour in above issue is timeouts in batch_cycle, where we declare the segment closed early (because we cannot add anything more safely - chunks could get partial/misplaced). Exception will propagate to caller(s), but the segment will not go through actual close() call -> destructor should not assume such. Closes #9598	2021-11-09 09:15:44 +02:00
Botond Dénes	4b6c0fe592	mutation_writer/feed_writer: don't drop readers with small amount of content Due to an error in transforming the above routine, readers who have <= a buffer worth of content are dropped without consuming them. This is due to the outer consume loop being conditioned on `is_end_of_stream()`, which will be set for readers that eagerly pre-fill their buffer and also have no more data then what is in their buffer. Change the condition to also check for `is_buffer_empty()` and only drop the reader if both of these are true. Fixes: #9594 Tests: unit(mutation_writer_test --repeat=200, dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211108092923.104504-1-bdenes@scylladb.com>	2021-11-09 09:15:44 +02:00
Avi Kivity	9e2b6176a2	Merge "Run gossiper message handlers in a gate" from Pavel E " When gossiper processes its messages in the background some of the continuations may pop up after the gossiper is shutdown. This, in turn, may result in unwanted code to be executed when it doesn't expect. In particular, storage_service notification hooks may try to update system keyspace (with "fresh" peer info/state/tokens/etc). This update doesn't work after drain because drain shuts down commitlog. The intention was that gossiper did _not_ notify anyone after drain, because it's shut down during drain too. But since there are background continuations left, it's not working as expected. refs: #9567 tests: unit(dev), dtest.concurrent_schema_changes.snapshot(dev) " * 'br-gossiper-background-messages-2' of https://github.com/xemul/scylla: gossiper: Guard background processing with gate gossiper: Helper for background messaging processing	2021-11-09 09:15:44 +02:00
Avi Kivity	b0a2a9771f	Merge "Sanitize hostnames resolving on start" from Pavel E " On start scylla resolves several hostnames into addresses. Different places use different hostname selection logic, e.g. the API address can be the listen one if the dedicated option not set. Failure to resolve a hostname is reported with an exception that (sometimes) contains the hostname, but it doesn't look very convenient -- better to know the config option name. Also resolving of different hostnames has different decoration around, e.g. prometheus carries a main-local lambda just to nicely wrap the try/catch block. This set unifies this zoo and makes main() shorter and less hairy: 1. All failures to resolve a hostname are reported with an exception containing the relevant config option 2. The \|\| operator for named_value's is introduced to make the option selection look as short as resolve(cfg->some_address() \|\| cfg->another_address()) 3. All sanity checks are explicit and happen early in main 4. No dangling local variables carrying the cfg->...() value 5. Use resolved IP when logging a "... is listening on ..." message after a service start tests: unit(dev) " * 'br-ip-resolve-on-start' of https://github.com/xemul/scylla: main: Move fb-utilities initialization up the main code: Use utils::resolve instead of inet_address::lookup main: Remove unused variable main: Sanitize resolving of listen address main: Sanitize resolving of broadcast address main: Sanitize resolving of broadcast RPC address main: Sanitize resolving of API address main: Sanitize resolving of prometheus address utils: Introduce \|\| operator for named_values db.config: Verbose address resolver helper main: Remove api-port and prometheus-port variables alternator: Resolve address with the help of inet_address redis, thrift: Remove unused captures	2021-11-09 09:15:40 +02:00
Michael Livshin	f6bbc7fc9b	tests: remove remaining uses of sstable_assertions::make_reader_v1() * flat_reader_assertions::produces_range_tombstone() does not actually check range tombstones beyond the fact that they are in fact range tombstones (unless non-empty ck_ranges is passed). Fix the immediate problem, change assertion logic to take split and overlapping range tombstones into account properly, and also fix several accidentally-incorrect tests. Fixes #9470 * Convert the remaining sstable_3_x reader tests to v2, now that they are more correct and only the actual convertion remains. This deals with the sstable reader tests that involve range tombstones. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-11-09 09:13:51 +02:00
Botond Dénes	5b3ac3147b	db/schema_tables: merge_tables_and_views(): match old/new view with old/new base table For altered tables, the above function creates schema objects representing before/after (old/new) table states. In case of views, there is a matching mechanism to set the base table field of the view to the appropriate base table object. This works by iterating over the list of altered tables and selecting the "new_schema" field of the first instance matching the keyspace/name of the base-table. This ends up pairing the after/old version of the base table to both the before and after version of the view. This means the base attached to the view is possibly incompatible with the view it is attached to. This patch fixes this by passing the schema generation (before/after) to the function responsible for this matching, so it can select the appropriate version of the base class. For example, given the following input to `merge_tables_and_views()`: tables_before = { t1_before } tables_after = { t1_after } views_before = { v1_before } views_after = { v1_after } Before this patch, the `base_schema` field of `v1_before` would be `t1_after`, while it obviously should be `t1_before`. This sounds scary but has no practical implications currently as `v1_before` is only computed and then discarded without being used. Tests: unit(dev) Fixes: #9586 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211108124806.151268-1-bdenes@scylladb.com>	2021-11-09 09:13:51 +02:00
Raphael S. Carvalho	33b39a2bfc	compaction: move run_with_compaction_disabled() from table into compaction_manager That's intended to fix a bad layer violation as table was given the responsibility of disabling compaction for a given table T, but that logic clearly belongs to compaction_manager instead. Additionally, gate will be used instead of counter, as former provides manager with a way to synchronize with functions running under run_with_compaction_disabled. so remove() can wait for their termination. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-08 15:12:46 -03:00
Raphael S. Carvalho	52feb41468	compaction_manager: switch to coroutine in compaction_manager::remove() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-08 14:24:39 -03:00
Raphael S. Carvalho	aa9b1c1fa3	compaction_manager: add struct for per table compaction state This will make it easier to pack all state data for a given table T. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-08 14:24:33 -03:00
Raphael S. Carvalho	7876bd4331	compaction_manager: wire stop_ongoing_compactions() into remove() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-08 14:24:24 -03:00
Raphael S. Carvalho	c0047bb9c0	compaction_manager: introduce stop_ongoing_compactions() for a table New variant of stop_ongoing_compactions() which will stop all compactions for a given table. Will be reused in both remove() and by run_with_compaction_disabled() which soon be moved into the compaction_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-08 14:24:14 -03:00
Raphael S. Carvalho	2f293fa09c	compaction_manager: prevent compaction from being postponed when stopping tasks stop_tasks() must make sure that no ongoing task will postpone compaction when asked to stop. Therefore, let's set all tasks as stopping before any deferring point, such that no task will postpone compaction for a table which is being stopped. compaction_manager::remove() already handles this race with the same method, and given that remove() will later switch to stop_tasks(), let's do the same in stop_tasks(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-08 14:23:57 -03:00
Raphael S. Carvalho	0643faafd7	compaction_manager: extract "stop tasks" from stop_ongoing_compactions() into new function Procedure will be reused to stop a list of tasks Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-08 14:23:37 -03:00
Pavel Emelyanov	92e8e217b7	main: Move fb-utilities initialization up the main Setting up the fb_utilities addresses sits in the middle of starting/stopping the real services. It's a bit cleaner to make it earlier. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	2f9c21644b	code: Use utils::resolve instead of inet_address::lookup There are some users of the latter call left. They all suffer from the same problem -- the lack of verbosity on resolving errors. While at it also get rid of useless local variables that are only there to carry the cfg->...() option over. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	7cf4e848ec	main: Remove unused variable This one left hanging after the previous patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	3a9b0d83fc	main: Sanitize resolving of listen address Nother special here, just get rid of on-shot local variable and use the util::resolve to improve the verbosity of the exception thrown on error. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	f190d99998	main: Sanitize resolving of broadcast address To resolve this one main selects between the config option of the same name or picks the listen address. Similarly to the broadcast RPC address, on error the thrown exception is very generic and doesn't tell which option contained the faulty address. THe utils::resolve, \|\| operator and dedicated excplicit sanity check make this place look better. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	a1b6600e7f	main: Sanitize resolving of broadcast RPC address The broadcast RPC address is taken from either the config option of the same name or from the rpc_address one. Also there's a sanity check on the latter. On resolution failure it's impossible to find out which option caused this, just the seastar-level exception is printed. Using recently added utils helper and \|\| for named values makes things shorter. The sanity check for INADDR_ANY is moved upper the main() to where other options sanity checks sit. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	98ab8d9827	main: Sanitize resolving of API address To find out the API address there's a main-local lambda to make the verbose exception as well as an ?:-selection of which option to use as the API address. Using the utils::resolve and recently introduced \|\| for named values makes things much nicer and shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	3188161f93	main: Sanitize resolving of prometheus address Right now there's a main-local lambda to resolve the address and throw some meaningful exception. Using recently introduced utils::resolve() helper makes things look nicer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	b3a4f9e194	utils: Introduce \|\| operator for named_values Those named_values that support .empty() check can be "selected" like this auto& v = option_a() \|\| option_b() \|\| option_c(); This code will put into v a reference to the first non-empty named_value out of a/b/c. This "selection" is actually used on start when scylla decides which config options to use as listen/broadcact/rpc/etc. addresses. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	71ce7c6e87	db.config: Verbose address resolver helper The helper works on named_value() and throws and exception containing the option name for convenient error reporting. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Pavel Emelyanov	df08fb3025	main: Remove api-port and prometheus-port variables Those variables just pollute the main's scope for no gain. It's simpler and more friendly to the next patches to use cfg-> stuff directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:04:07 +03:00
Pavel Emelyanov	acb7068ab5	alternator: Resolve address with the help of inet_address Alternator needs to lookup its address without preferring ipv4 or ipv6. To do it calls seastar method, but the same effect is achieved by calling inet_address::lookup. This change makes all places in scylla resolve addresses in a similar way, makes this code line shorter and removes the need to specifically explain the alternator hunks from next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:04:07 +03:00
Pavel Emelyanov	7f6fbaf3c6	redis, thrift: Remove unused captures Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 16:39:30 +03:00
Jenkins	158f47dfc7	release: prepare for 4.7.dev	2021-11-08 09:46:13 +02:00
Pavel Emelyanov	9fccf7f3af	gossiper: Guard background processing with gate When shutdown gossiper may have some messages being processed in the background. This brings two problems. First, the gossiper itself is about to disappear soon and messages might step on the freed instance (however, this one is not real now, gossiper is not freed for real, just ::stop() is called). Second, messages processing may notify other subsystems which, in turn, do not expect this after gossiper is shutdown. The common solution to this is to run background code through a gate that gets closed at some point, the ::shutdown() in gossiper case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 10:25:03 +03:00
Pavel Emelyanov	42f44adb98	gossiper: Helper for background messaging processing Some messages are processed by gossiper on shard0 in the no-wait manner. Add a generic helper for that to facilitate next patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 10:24:44 +03:00
Takuya ASADA	76519751bc	install.sh: add fix_system_distributed_tables.py to the package Related with #4601 Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2021-11-08 08:07:49 +02:00
Michael Livshin	806b5310fd	tests: remove remaining uses of sstable_assertions::make_reader_v1() This deals with the sstable reader tests that involve range tombstones. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-11-08 00:56:39 +02:00
Michael Livshin	4941e2ec41	tests: fix range tombstone checking and deal with the fallout flat_reader_assertions::produces_range_tombstone() does not actually check range tombstones beyond the fact that they are in fact range tombstones (unless non-empty ck_ranges is passed). Fixing the immediate problem reveals that: * The assertion logic is not flexible enough to deal with creatively-split or creatively-overlapping range tombstones. * Some existing tests involving range tombstones are in fact wrong: some assertions may (at least with some readers) refer to wrong tombstones entirely, while others assert wrong things about right tombstones. * Range tombstones in pre-made sstables (such as those read by sstable_3_x_test) have deletion time drift, and that now has to be somehow dealt with. This patch (which is not split into smaller ones because that would either generate unreasonable amount of work towards ensuring bisectability or entail "temporarily" disabling problematic tests, which is cheating) contains the following changes: * flat_reader_assertions check range tombstones more carefully, by accumulating both expected and actually-read range tombstones into lists and comparing those lists when a partition ends (or when the assertion object is destroyed). * flat_reader_assertions::may_produce_tombstones() can take constraining ck_ranges. * Both flat_reader_assertions and flat_reader_assertions_v2 can be instructed to ignore tombstone deletion times, to help with tests that read pre-made sstables. * Affected tests are changed to reflect reality. Most changes to tests make sense; the only one I am not completely sure about is in test_uncompressed_filtering_and_forwarding_range_tombstones_read. Fixes #9470 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-11-08 00:56:39 +02:00
Avi Kivity	247f2b69d5	Merge "system tables: create the schema more efficiently" from Botond " System tables currently almost uniformly use a pattern like this to create their schema: return schema_builder(make_shared_schema(...)) // [...] .with_version(...) .build(...); This pattern is very wasteful because it first creates a schema, then dismantles it just to recreate it again. This series abolishes this pattern without much churn by simply adding a constructor to schema builder that takes identical parameters to `make_shared_schema()`, then simply removing `make_shared_schema()` from these users, who now build a schema builder object directly and build the schema only once. Tests: unit(dev) " * 'schema-builder-make-shared-schema-ctor/v1' of https://github.com/denesb/scylla: treewide: system tables: don't use make_shared_schema() for creating schemas schema_builder: add a constructor providing make_shared_schema semantics schema_builder: without_column(): don't assume column_specification exists schema: add static variant of column_name_type()	2021-11-07 18:23:22 +02:00
Takuya ASADA	546e4adf9e	dist/docker: configure default locale correctly Since cqlsh requires UTF-8 locale, we should configure default locale correctly, on both directly executed shell with docker and via SSH. (Directly executed shell means "docker exec -ti <image> /bin/bash") For SSH, we need to set correct parameter on /etc/default/locale, which can set by update-locale command. However, directly executed shell won't load this parameter, because it configured at PAM but we skip login on this case. To fix this issue, we also need to set locale variables on container image configuration (ENV in Dockerfile, --env in buildah). Fixes #9570 Closes #9587	2021-11-07 17:03:12 +02:00
Takuya ASADA	201a97e4a4	dist/docker: fix bashrc filename for Ubuntu For Debian variants, correct filename is /etc/bash.bashrc. Fixes #9588 Closes #9589	2021-11-07 17:01:13 +02:00
Avi Kivity	7a3930f7cf	Merge 'More nodetool-replacing virtual tables' from Botond Dénes This PR introduces 4 new virtual tables aimed at replacing nodetool commands, working towards the long-term goal of replacing nodetool completely at least for cluster information retrieval purposes. As you may have noticed, most of these replacement are not exact matches. This is on purpose. I feel that the nodetool commands are somewhat chaotic: they might have had a clear plan on what command prints what but after years of organic development they are a mess of fields that feel like don't belong. In addition to this, they are centered on C* terminology which often sounds strange or doesn't make any sense for scylla (off-heap memory, counter cache, etc.). So in this PR I tried to do a few things: * Drop all fields that don't make sense for scylla; * Rename/reformat/rephrase fields that have a corresponding concept in scylla, so that it uses the scylla terminology; * Group information in tables based on some common theme; With these guidelines in mind lets look at the virtual tables introduced in this PR: * `system.snapshots` - replacement for `nodetool listnapshots`; * `system.protocol_servers`- replacement for `nodetool statusbinary` as well as `Thrift active` and `Native Transport active` from `nodetool info`; * `system.runtime_info` - replacement for `nodetool info`, not an exact match: some fields were removed, some were refactored to make sense for scylla; * `system.versions` - replacement for `nodetool version`, prints all versions, including build-id; Closes #9517 * github.com:scylladb/scylla: test/cql-pytest: add virtual_tables.py test/cql-pytest: nodetool.py: add take_snapshot() db/system_keyspace: add versions table configure.py: move release.cc and build_id.cc to scylla_core db/system_keyspace: add runtime_info table db/system_keyspace: add protocol_servers table service: storage_service: s/client_shutdown_hooks/protocol_servers/ service: storage_service: remove unused unregister_client_shutdown_hook redis: redis_service: implement the protocol_server interface alternator: controller: implement the protocol_server interface transport: controller: implement the protocol_server interface thrift: controller: implement the protocol_server interface Add protocol_server interface db/system_keyspace: add snapshots virtual table db/virtual_table: remove _db member db/system_keyspace: propagate distributed<> database and storage_service to register_virtual_tables() docs/design-notes/system_keyspace.md: add listing of existing virtual tables docs/guides: add virtual-tables.md	2021-11-07 16:55:31 +02:00
Avi Kivity	c6ac1462c2	build, submodules: use utc for build datestamp This helps keep packages built on different machines have the same datestamp, if started on the same time. * tools/java 05ec511bbb...fd10821045 (1): > build: use utc for build datestamp * tools/jmx 48d37f3...d6225c5 (1): > build: use utc for build datestamp * tools/python3 c51db54...8a77e76 (1): > build: use utc for build datestamp [avi: commit own patches as this one requires excessive coordination across submodules, for something quite innocuous] Ref #9563 (doesn't really fix it, but helps a little)	2021-11-07 15:58:48 +02:00
Avi Kivity	1d4f6498c8	Update tools/python3 submodule for .orig cleanup * tools/python3 279aae1...c51db54 (1): > reloc: clean up '.orig' temporary directory before building deb package	2021-11-07 15:55:49 +02:00
Botond Dénes	6993a55ff3	test/cql-pytest: add virtual_tables.py Presence and column check for virtual tables. Where possible (and simple) more is checked.	2021-11-05 16:26:21 +02:00
Botond Dénes	18f9d329ed	test/cql-pytest: nodetool.py: add take_snapshot()	2021-11-05 16:26:01 +02:00
Botond Dénes	d51aa66a8a	db/system_keyspace: add versions table Contains all version related information (`nodetool version` and more). Example printout: (cqlsh) select * from system.versions; key \| build_id \| build_mode \| version -------+------------------------------------------+------------+------------------------------- local \| aaecce2f5068b0160efd04a09b0e28e100b9cd9e \| dev \| 4.6.dev-0.20211021.0d744fd3fa	2021-11-05 15:42:42 +02:00
Botond Dénes	5c87263ff8	configure.py: move release.cc and build_id.cc to scylla_core These two files were only added to the scylla executable and some specific unit tests. As we are about to use the symbols defined in these files in some scylla_core code move them there.	2021-11-05 15:42:42 +02:00
Botond Dénes	89cc016f07	db/system_keyspace: add runtime_info table Loosly contains the equivalent of the `nodetool info` command, with some notable differences: * Protocol server related information is in `system.protocol_servers`; * Information about memory, memtable and cache is reformatted to be tailored to scylla: C* specific terminology and metrics are dropped; * Information that doesn't change and is already in `system.local` is not contained; * Added trace-probability too (`nodetool gettraceprobability`); TODO(follow-up): exceptions.	2021-11-05 15:42:42 +02:00
Botond Dénes	78adda197f	db/system_keyspace: add protocol_servers table Lists all the client protocol server and their status. Example output: (cqlsh) select * from system.protocol_servers; name \| is_running \| listen_addresses \| protocol \| protocol_version ------------------+------------+---------------------------------------+----------+------------------ native transport \| True \| ['127.0.0.1:9042', '127.0.0.1:19042'] \| cql \| 3.3.1 alternator \| False \| [] \| dynamodb \| rpc \| False \| [] \| thrift \| 20.1.0 redis \| False \| [] \| redis \| This prints the equivalent of `nodetool statusbinary` and the "Thrift active" and "Native Transport active" fields from the `nodetool info` output with some additional information: * It contains alternator and redis status; * It contains the protocol version; * It contains the listen addresses (if respective server is running);	2021-11-05 15:42:42 +02:00
Botond Dénes	3f56c49a9e	service: storage_service: s/client_shutdown_hooks/protocol_servers/ Replace the simple client shutdown hook registry mechanism with a more powerful registry of the protocol servers themselves. This allows enumerating the protocol servers at runtime, checking whether they are running or not and starting/stopping them.	2021-11-05 15:42:42 +02:00
Botond Dénes	e9c9a39c06	service: storage_service: remove unused unregister_client_shutdown_hook Nobody seems to unregister client shutdown hooks ever. We are about to refactor the client shutdown hook machinery so remove this unused code to make this easier.	2021-11-05 15:42:41 +02:00
Botond Dénes	f56f4ade22	redis: redis_service: implement the protocol_server interface In the process de-globalize redis service and pass dependencies in the constructor.	2021-11-05 15:42:41 +02:00
Botond Dénes	8ddfdd8aa9	alternator: controller: implement the protocol_server interface	2021-11-05 15:42:41 +02:00
Botond Dénes	134fa98ff4	transport: controller: implement the protocol_server interface	2021-11-05 15:42:41 +02:00
Botond Dénes	bda0d0ccba	thrift: controller: implement the protocol_server interface	2021-11-05 15:42:41 +02:00
Botond Dénes	3ff8ba9146	Add protocol_server interface We want to replace the current `storage_service::register_client_shutdown_hook()` machinery with something more powerful. We want to register all running client protocol servers with the storage service, allowing enumerating these at runtime, checking whether they are running or not and starting/stopping them. As the first step towards this, we introduce an abstract interface that we are going to implement at the controllers of the various protocol servers we have. Then we will switch storage service to collect pointers to this interface instead of simple stop functors.	2021-11-05 15:42:41 +02:00
Botond Dénes	64f658aea4	db/system_keyspace: add snapshots virtual table Lists the equivalent of the `nodetool listsnapshots` command.	2021-11-05 15:42:41 +02:00
Botond Dénes	f0281eaa98	db/virtual_table: remove _db member This member is potentially dangerous as it only becomes non-null sometimes after the virtual table object is constructed. This is asking for nullptr dereference. Instead, remove this member and have virtual table implementations that need a db, ask for it in the constructor, it is available in `register_virtual_tables()` now.	2021-11-05 15:42:41 +02:00
Botond Dénes	200e2fad4d	db/system_keyspace: propagate distributed<> database and storage_service to register_virtual_tables() As some virtual tables will need the distributed versions of these.	2021-11-05 15:42:41 +02:00
Botond Dénes	185c5f1f5b	docs/design-notes/system_keyspace.md: add listing of existing virtual tables As well as a link to the newly added docs/guides/virtual-tables.md	2021-11-05 15:42:39 +02:00
Botond Dénes	b8c156d4f7	docs/guides: add virtual-tables.md Explaining what virtual tables are, what are good candidates for virtual tables and how you can write one.	2021-11-05 11:49:27 +02:00
Botond Dénes	ccf5c31776	treewide: system tables: don't use make_shared_schema() for creating schemas `make_shared_schema()` is a convenience method for creating a schema in a single function call, however it doesn't have all the advanced capabilities as `schema_builder`. So most users (which all happen to be system tables) pass the schema created by it to schema builder immediately to do some further tweaking, effectively building the schema twice. This is wasteful. This patch changes all these users to use the newly added `schema_builder()` constructor which has the same signature (and therefore ease-of-use) as `make_shared_schema()`.	2021-11-05 11:41:04 +02:00
Botond Dénes	4dea339e0c	schema_builder: add a constructor providing make_shared_schema semantics make_shared_schema() is often used to create a schema that is then passed to schema_builder to modify it further. This is wasteful as the schema is built just to be disassembled and rebuilt again. To replace this wasteful pattern we provide a schema_builder constructor that has the same signature as `make_shared_schema()`, allowing follow-up modifications on the schema before it is fully built.	2021-11-05 11:41:04 +02:00
Botond Dénes	476f49c693	schema_builder: without_column(): don't assume column_specification exists It is only created for column definitions in the schema constructor, so it will only exists for schema builders created from schema instances. It is not guaranteed that `without_column()` will only be called on such builder instances so ensure the implementation doesn't depend on it.	2021-11-05 11:41:04 +02:00
Botond Dénes	d3833c5978	schema: add static variant of column_name_type() So schema_builder can use it too (without a schema instance at hand).	2021-11-05 11:41:04 +02:00
Jan Ciolek	51a8a1f89b	cql3: Remove remaining mentions of term There were a few places where term was still mentioned. Removed/replaced term with expression. search_and_replace is still done only on LHS of binary_operator because the existing code would break otherwise. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:57:00 +01:00
Jan Ciolek	e458340821	cql3: Remove term term isn't used anywhere now. We can remove it and all classes that derive from it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	dcd3199037	cql3: Rename prepare_term to prepare_expression prepare_term now takes an expression and returns a prepared expression. It should be renamed to prepare_expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	219f1a4359	cql3: Make prepare_term return an expression instead of term prepare_term is now the only function that uses terms. Change it so that it returns expression instead of term and remove all occurences of expr::to_expression(prepare_term(...)) Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	c84e941df9	cql3: expr: Add size check to evaluate_set In old code sets::delayed_value::bind() contained a check that each serialized value is less than certain size. I missed this when implementing evaluate(), so it's brought back to ensure identical behaviour. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	7bc65868eb	cql3: expr: Add expr::contains_bind_marker Add a function that checks whether there is a bind marker somewhere inside an expression. It's important to note, that even when there are no bind markers, there can be other things that prevent immediate evaluation of an expression. For example an expression can contain calls to nonpure functions. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	080286cb96	cql3: expr: Rename find_atom to find_binop Soon there will be other functions that also search in expression, find_atom would be confusing then. find_binop is a more descriptive name. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	7cabed9ebf	cql3: expr: Add find_in_expression find_in_expression is a function that looks into the expression and finds the given expression variant for which the predicate function returns true. If nothing is found returns nullptr. For example: find_in_expression<binary_operator>(e, [](const binary_operator&) {return true;}) Will return the first binary operator found in the expression. It is now used in find_atom, and soon will be used in other similar functions. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	890c8f4026	cql3: Remove term in operations Replace term with expression in cql3/operation and its children. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	3b4dc39eb8	cql3: Remove term in relations Replace uses of term with expression in cql3/*relation.hh Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	7f2ecf1aa2	cql3: Remove term in multi_column_restrictions Replace all uses of term with expression in cql3/multi_column_restrictions Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	a59683b929	cql3: Remove term in term_slice, rename to bounds_slice term_slice is an interval from one term to the other. [term1, term2] Replaced terms with expressions. Because the name has 'term' in it, it was changed to bounds_slice. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:45 +01:00
Jan Ciolek	e37906ae34	cql3: expr: Remove term in expression Some struct inside the expression variant still contained term. Replace those terms with expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-11-04 15:56:44 +01:00
Gleb Natapov	bdf7d1a411	raft: correctly truncate the log in a persistence module during snapshot application When remote snapshot is applied the log is completely cleared because snapshot transfer happens only when common log prefix cannot be found, so we cannot be sure that existing entries in the log are correct. But currently it only happens for in memory log by calling apply_snapshot with trailing set to zero, but when persistence module is called to store the snapshot _config.snapshot_trailing is used which can be non zero. This may cause the log to contain incorrect entries after restart. The patch fixes this by using zero trailing for non local snapshots. Fixes #9551	2021-11-04 15:11:19 +02:00
Jan Ciolek	fd1596171e	cql3: expr: Add evaluate_IN_list(expression, options) evaluate_IN_list was only defined for a term, but now we are removing term so it should be also defined for an expression. The internal code is the same - this function used to convert the term to expression and then did all operations on expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 20:55:09 +02:00
Jan Ciolek	805ba145d7	cql3: Remove term in column_condition Replace all uses of term with expression in cql3/column_condition Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 20:55:09 +02:00
Jan Ciolek	a24d06c195	cql3: Remove term in select_statement Replace all uses of term with expression in cql3/statements/select_statement Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 20:55:09 +02:00
Jan Ciolek	d36847801b	cql3: Remove term in update_statement Replace all uses of term with expression in cql3/statements/update_statement There was some trouble with extracting values from json. The original code worked this way on a map example: > There is a json string to parse: {'b': 1, 'a': 2, 'b': 3} > The code parses the json and creates bytes where this map is serialized but without removing duplicates, sorting etc. > Then a maps::delayed_value is created from these bytes. During creation map elements are extracted, sorted and duplicates are removed. This map value is then used in setter Now when maps::delayed_value is changed to expr::constant the step where elements are sorted is lost. Because of this we need to do this earlier, the best place is during original json parsing. Additionally I suspect that removing duplicated elements used to work only on the first level, in case of map of maps it wouldn't work. Now it will work no matter how many layers of maps there are. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 20:55:03 +02:00
Jan Ciolek	ba202cd8bd	cql3: Use internal cql format in insert_prepared_json_statement cache expr::constant is always serialized using the internal cql serialization format, but currently the code keeps values in the cache in other format. As preparation for moving from term to expression change so that values kept in the cache are serialized using the internal format. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 20:42:43 +02:00
Jan Ciolek	e5391f1eed	types: Add map_type_impl::serialize(range of <bytes, bytes>) Adds two functions that take a range over pairs of serialized values and return a serialized map value. There are 2 functions - one operating on bytes and one operating on managed_bytes. The version with managed_bytes is used in expression.cc, used to be a local static function. The bytes version will be used in type_json.cc in the next commit. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 15:14:52 +02:00
Jan Ciolek	1502abaca1	cql3: Remove term in cql3/attributes Replace all uses of term with expression in cql3/attributes Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 15:14:52 +02:00
Jan Ciolek	a82351dc79	cql3: expr: Add constant::view() method Add a method that returns raw_value_view to expr::constant. It's added for convenience - without it in many places we would have to write my_value.value.to_view(). Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 15:14:52 +02:00
Jan Ciolek	c2eb3a58b8	cql3: expr: Implement fill_prepare_context(expression) Adds a new function - expr::fill_prepare_context. This function has the same functionality as term::fill_prepare_context, which will be removed soon. fill_prepare_context used to take its argument with a const qualifier, but it turns out that the argume> It sets the cache ids of function calls corresponding to partition key restrictions. New function doesn't have const to make this clear and avoid surprises. Added expr::visit that takes an argument without const qualifier. There were some problems with cache_ids in function_call. prepare_context used to collect ::shared_ptr<functions::function_call> of some function call, and then this allowed it to clear cache ids of all involved functions on demand. To replicate this prepare_context now collects shared pointers to expr::function_call cache ids. It currently collects both, but functions::function_call will be removed soon. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 15:14:52 +02:00
Jan Ciolek	edaa3b5dc2	cql3: expr: add expr::visit that takes a mutable expression Currently expr::visit can only take a const expression as an argument. For cases where we want to visit the expression and modify it a new function is needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 15:14:52 +02:00
Jan Ciolek	9c40516071	cql3: expr: Add receiver to expr::bind_variable bind_variable used to have only the type of bound value. Now this type is replaced with receiver, which describes information about column corresponding to this value. A receiver contains type, column name, etc. Receiver is needed in order to implement fill_prepare_context in the next commit. It's an argument of prepare_context::add_variable_specification. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-10-28 15:14:52 +02:00
Calle Wilund	dcc73c5d4e	commitlog: Recalculate footprint on delete_segment exceptions Fixes #9348 If we get exceptions in delete_segments, we can, and probably will, loose track of footprint counters. We need to recompute the used disk footprint, otherwise we will flush too often, and even block indefinately on new_seg iff using hard limits.	2021-09-15 11:53:03 +00:00
Calle Wilund	8a326638af	commitlog_test: Add test for exception in alloc w. deleted underlying file Tests that we can handle exception-in-alloc cleanup if the file actually does not exist. This however uncovers another weakness (addressed in next patch) - that we can loose track of disk footprint here, and w. hard limits end up waiting for disk space that never comes. Thus test does not use hard limit.	2021-09-15 11:51:05 +00:00
Calle Wilund	21152a2f5a	commitlog: Ensure failed-to-create-segment is re-deleted Fixes #9343 If we fail in allocate_segment_ex, we should push the file opened/created to the delete set to ensure we reclaim the disk space. We should also ensure that if we did not recycle a file in delete_segments, we still wake up any recycle waiters iff we made a file delete instead. Included a small unit test.	2021-09-15 11:40:34 +00:00
Calle Wilund	f3a9f361b9	commitlog::allocate_segment_ex: Don't re-throw out of function Fixes #9342 commitlog_error_handler rethrows. But we want to not. And run post-handler cleanup (co_await)	2021-09-15 11:40:34 +00:00

3132 changed files with 211878 additions and 95905 deletions

1

.gitattributes vendored

View File

@@ -1,2 +1,3 @@
 *.cc diff=cpp
 *.hh diff=cpp
 *.svg binary

42

.github/CODEOWNERS vendored

View File

@@ -2,17 +2,17 @@
 auth/* @elcallio @vladzcloudius
 # CACHE
 row_cache* @tgrabiec @haaawk
 *mutation* @tgrabiec @haaawk
 test/boost/mvcc* @tgrabiec @haaawk
 row_cache* @tgrabiec
 *mutation* @tgrabiec
 test/boost/mvcc* @tgrabiec
 # CDC
 cdc/* @haaawk @kbr- @elcallio @piodul @jul-stas
 test/cql/cdc_* @haaawk @kbr- @elcallio @piodul @jul-stas
 test/boost/cdc_* @haaawk @kbr- @elcallio @piodul @jul-stas
 cdc/* @kbr- @elcallio @piodul @jul-stas
 test/cql/cdc_* @kbr- @elcallio @piodul @jul-stas
 test/boost/cdc_* @kbr- @elcallio @piodul @jul-stas
 # COMMITLOG / BATCHLOG
 db/commitlog/* @elcallio
 db/commitlog/* @elcallio @eliransin
 db/batch* @elcallio
 # COORDINATOR
@@ -25,11 +25,15 @@ compaction/* @raphaelsc @nyh
 transport/*
 # CQL QUERY LANGUAGE
 cql3/* @tgrabiec @psarna @cvybhu
 cql3/* @tgrabiec @cvybhu @nyh
 # COUNTERS
 counters* @haaawk @jul-stas
 tests/counter_test* @haaawk @jul-stas
 counters* @jul-stas
 tests/counter_test* @jul-stas
 # DOCS
 docs/* @annastuchlik @tzach
 docs/alternator @annastuchlik @tzach @nyh @havaker @nuivall
 # GOSSIP
 gms/* @tgrabiec @asias
@@ -41,9 +45,9 @@ dist/docker/*
 utils/logalloc* @tgrabiec
 # MATERIALIZED VIEWS
 db/view/* @nyh @psarna
 cql3/statements/*view* @nyh @psarna
 test/boost/view_* @nyh @psarna
 db/view/* @nyh @cvybhu @piodul
 cql3/statements/*view* @nyh @cvybhu @piodul
 test/boost/view_* @nyh @cvybhu @piodul
 # PACKAGING
 dist/* @syuu1228
@@ -58,9 +62,9 @@ service/migration* @tgrabiec @nyh
 schema* @tgrabiec @nyh
 # SECONDARY INDEXES
 db/index/* @nyh @psarna
 cql3/statements/*index* @nyh @psarna
 test/boost/*index* @nyh @psarna
 index/* @nyh @cvybhu @piodul
 cql3/statements/*index* @nyh @cvybhu @piodul
 test/boost/*index* @nyh @cvybhu @piodul
 # SSTABLES
 sstables/* @tgrabiec @raphaelsc @nyh
@@ -70,11 +74,11 @@ streaming/* @tgrabiec @asias
 service/storage_service.* @tgrabiec @asias
 # ALTERNATOR
 alternator/* @nyh @psarna
 test/alternator/* @nyh @psarna
 alternator/* @nyh @havaker @nuivall
 test/alternator/* @nyh @havaker @nuivall
 # HINTED HANDOFF
 db/hints/* @haaawk @piodul @vladzcloudius
 db/hints/* @piodul @vladzcloudius @eliransin
 # REDIS
 redis/* @nyh @syuu1228

									
										17

.github/workflows/docs-amplify-enhanced.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,17 @@

				name: "Docs / Amplify enhanced"

				on: issue_comment

				jobs:

				  build:

				    runs-on: ubuntu-latest

				    if: ${{ github.event.issue.pull_request }}

				    steps:

				      - name: Checkout

				        uses: actions/checkout@v3

				        with:

				          fetch-depth: 0

				      - name: Amplify enhanced

				        env:

				          TOKEN: ${{ secrets.GITHUB_TOKEN }}

				        uses: scylladb/sphinx-scylladb-theme/.github/actions/amplify-enhanced@master

									
										35

.github/workflows/docs-pages.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,35 @@

				name: "Docs / Publish"

				# For more information,

				# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows

				on:

				  push:

				    branches:

				      - master

				    paths:

				      - "docs/**"

				  workflow_dispatch:

				jobs:

				  release:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Checkout

				        uses: actions/checkout@v3

				        with:

				          persist-credentials: false

				          fetch-depth: 0

				      - name: Set up Python

				        uses: actions/setup-python@v3

				        with:

				          python-version: 3.7

				      - name: Set up env

				        run: make -C docs setupenv

				      - name: Build docs

				        run: make -C docs multiversion

				      - name: Build redirects

				        run: make -C docs redirects

				      - name: Deploy docs to GitHub Pages

				        run: ./docs/_utils/deploy.sh

				        env:

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

									
										29

.github/workflows/docs-pages@v2.yaml
									
										vendored
									
												View File
											
				@@ -1,29 +0,0 @@

				name: "Docs / Publish"

				on:

				  push:

				    branches:

				    - master

				    paths:

				    - "docs/**"

				  workflow_dispatch:

				jobs:

				  release:

				    runs-on: ubuntu-latest

				    steps:

				    - name: Checkout

				      uses: actions/checkout@v2

				      with:

				        persist-credentials: false

				        fetch-depth: 0

				    - name: Set up Python

				      uses: actions/setup-python@v1

				      with:

				        python-version: 3.7

				    - name: Build docs

				      run: make -C docs multiversion

				    - name: Deploy

				      run: ./docs/_utils/deploy.sh

				      env:

				        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

									
										28

.github/workflows/docs-pr.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,28 @@

				name: "Docs / Build PR"

				# For more information,

				# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows

				on:

				  pull_request:

				    branches:

				      - master

				    paths:

				      - "docs/**"

				jobs:

				  build:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Checkout

				        uses: actions/checkout@v3

				        with:

				          persist-credentials: false

				          fetch-depth: 0

				      - name: Set up Python

				        uses: actions/setup-python@v3

				        with:

				          python-version: 3.7

				      - name: Set up env

				        run: make -C docs setupenv

				      - name: Build docs

				        run: make -C docs test

									
										25

.github/workflows/docs-pr@v1.yaml
									
										vendored
									
												View File
											
				@@ -1,25 +0,0 @@

				name: "Docs / Build PR"

				on:

				  pull_request:

				    branches:

				    - master

				    paths:

				    - "docs/**"

				jobs:

				  build:

				    name: Build

				    runs-on: ubuntu-latest

				    steps:

				    - name: Checkout

				      uses: actions/checkout@v2

				      with:

				        persist-credentials: false

				        fetch-depth: 0

				    - name: Set up Python

				      uses: actions/setup-python@v1

				      with:

				        python-version: 3.7

				    - name: Build docs

				      run: make -C docs test

4

.gitignore vendored

View File

@@ -22,6 +22,7 @@ resources
 .pytest_cache
 /expressions.tokens
 tags
 !db/tags/
 testlog
 test/*/*.reject
 .vscode
@@ -29,3 +30,6 @@ docs/_build
 docs/poetry.lock
 compile_commands.json
 .ccls-cache/
 .mypy_cache
 .envrc
 rust/Cargo.lock

6

.gitmodules vendored

View File

@@ -6,12 +6,6 @@
 	path = swagger-ui
 	url = ../scylla-swagger-ui
 	ignore = dirty
 [submodule "libdeflate"]
 	path = libdeflate
 	url = ../libdeflate
 [submodule "abseil"]
 	path = abseil
 	url = ../abseil-cpp
 [submodule "scylla-jmx"]
 	path = tools/jmx
 	url = ../scylla-jmx

3

.mailmap Normal file

View File

@@ -0,0 +1,3 @@
 Avi Kivity <avi@scylladb.com> Avi Kivity' via ScyllaDB development <scylladb-dev@googlegroups.com>
 Raphael S. Carvalho <raphaelsc@scylladb.com> Raphael S. Carvalho' via ScyllaDB development <scylladb-dev@googlegroups.com>
 Pavel Emelyanov <xemul@scylladb.com> Pavel Emelyanov' via ScyllaDB development <scylladb-dev@googlegroups.com>

									
										51

CMakeLists.txt
									
												View File
												
				@@ -42,22 +42,13 @@ set(Seastar_CXX_FLAGS ${cxx_coro_flag} ${target_arch_flag} CACHE INTERNAL "" FOR

				set(Seastar_CXX_DIALECT gnu++20 CACHE INTERNAL "" FORCE)

				add_subdirectory(seastar)

				add_subdirectory(abseil)

				# Exclude absl::strerror from the default "all" target since it's not

				# used in Scylla build and, moreover, makes use of deprecated glibc APIs,

				# such as sys_nerr, which are not exposed from "stdio.h" since glibc 2.32,

				# which happens to be the case for recent Fedora distribution versions.

				#

				# Need to use the internal "absl_strerror" target name instead of namespaced

				# variant because `set_target_properties` does not understand the latter form,

				# unfortunately.

				set_target_properties(absl_strerror PROPERTIES EXCLUDE_FROM_ALL TRUE)

				# System libraries dependencies

				find_package(Boost COMPONENTS filesystem program_options system thread regex REQUIRED)

				find_package(Lua REQUIRED)

				find_package(ZLIB REQUIRED)

				find_package(ICU COMPONENTS uc REQUIRED)

				find_package(Abseil REQUIRED)

				set(scylla_build_dir "${CMAKE_BINARY_DIR}/build/${BUILD_TYPE}")

				set(scylla_gen_build_dir "${scylla_build_dir}/gen")

				@@ -189,6 +180,8 @@ set(swagger_files

				    api/api-doc/storage_service.json

				    api/api-doc/stream_manager.json

				    api/api-doc/system.json

				    api/api-doc/task_manager.json

				    api/api-doc/task_manager_test.json

				    api/api-doc/utils.json)

				set(swagger_gen_files)

				@@ -301,6 +294,8 @@ set(scylla_sources

				    api/storage_service.cc

				    api/stream_manager.cc

				    api/system.cc

				    api/task_manager.cc

				    api/task_manager_test.cc

				    atomic_cell.cc

				    auth/allow_all_authenticator.cc

				    auth/allow_all_authorizer.cc

				@@ -337,7 +332,6 @@ set(scylla_sources

				    compaction/size_tiered_compaction_strategy.cc

				    compaction/time_window_compaction_strategy.cc

				    compress.cc

				    connection_notifier.cc

				    converting_mutation_partition_applier.cc

				    counters.cc

				    cql3/abstract_marker.cc

				@@ -349,7 +343,8 @@ set(scylla_sources

				    cql3/constants.cc

				    cql3/cql3_type.cc

				    cql3/expr/expression.cc

				    cql3/expr/term_expr.cc

				    cql3/expr/prepare_expr.cc

				    cql3/expr/restrictions.cc

				    cql3/functions/aggregate_fcts.cc

				    cql3/functions/castas_fcts.cc

				    cql3/functions/error_injection_fcts.cc

				@@ -363,7 +358,6 @@ set(scylla_sources

				    cql3/prepare_context.cc

				    cql3/query_options.cc

				    cql3/query_processor.cc

				    cql3/relation.cc

				    cql3/restrictions/statement_restrictions.cc

				    cql3/result_set.cc

				    cql3/role_name.cc

				@@ -374,7 +368,6 @@ set(scylla_sources

				    cql3/selection/selector_factories.cc

				    cql3/selection/simple_selector.cc

				    cql3/sets.cc

				    cql3/single_column_relation.cc

				    cql3/statements/alter_keyspace_statement.cc

				    cql3/statements/alter_service_level_statement.cc

				    cql3/statements/alter_table_statement.cc

				@@ -426,9 +419,9 @@ set(scylla_sources

				    cql3/statements/sl_prop_defs.cc

				    cql3/statements/truncate_statement.cc

				    cql3/statements/update_statement.cc

				    cql3/statements/strongly_consistent_modification_statement.cc

				    cql3/statements/strongly_consistent_select_statement.cc

				    cql3/statements/use_statement.cc

				    cql3/token_relation.cc

				    cql3/tuples.cc

				    cql3/type_json.cc

				    cql3/untyped_result_set.cc

				    cql3/update_parameters.cc

				@@ -436,7 +429,7 @@ set(scylla_sources

				    cql3/util.cc

				    cql3/ut_name.cc

				    cql3/values.cc

				    database.cc

				    data_dictionary/data_dictionary.cc

				    db/batchlog_manager.cc

				    db/commitlog/commitlog.cc

				    db/commitlog/commitlog_entry.cc

				@@ -454,6 +447,7 @@ set(scylla_sources

				    db/large_data_handler.cc

				    db/legacy_schema_migrator.cc

				    db/marshal/type_parser.cc

				    db/rate_limiter.cc

				    db/schema_tables.cc

				    db/size_estimates_virtual_reader.cc

				    db/snapshot-ctl.cc

				@@ -469,10 +463,10 @@ set(scylla_sources

				    dht/murmur3_partitioner.cc

				    dht/range_streamer.cc

				    dht/token.cc

				    distributed_loader.cc

				    replica/distributed_loader.cc

				    duration.cc

				    exceptions/exceptions.cc

				    flat_mutation_reader.cc

				    readers/mutation_readers.cc

				    frozen_mutation.cc

				    frozen_schema.cc

				    generic_server.cc

				@@ -492,7 +486,7 @@ set(scylla_sources

				    index/secondary_index_manager.cc

				    init.cc

				    keys.cc

				    lister.cc

				    utils/lister.cc

				    locator/abstract_replication_strategy.cc

				    locator/azure_snitch.cc

				    locator/ec2_multi_region_snitch.cc

				@@ -510,7 +504,7 @@ set(scylla_sources

				    locator/token_metadata.cc

				    lang/lua.cc

				    main.cc

				    memtable.cc

				    replica/memtable.cc

				    message/messaging_service.cc

				    multishard_mutation_query.cc

				    mutation.cc

				@@ -519,7 +513,7 @@ set(scylla_sources

				    mutation_partition_serializer.cc

				    mutation_partition_view.cc

				    mutation_query.cc

				    mutation_reader.cc

				    readers/mutation_reader.cc

				    mutation_writer/feed_writers.cc

				    mutation_writer/multishard_writer.cc

				    mutation_writer/partition_based_splitting_writer.cc

				@@ -529,14 +523,18 @@ set(scylla_sources

				    partition_version.cc

				    querier.cc

				    query.cc

				    query_ranges_to_vnodes.cc

				    query-result-set.cc

				    raft/fsm.cc

				    raft/log.cc

				    raft/raft.cc

				    raft/server.cc

				    raft/tracker.cc

				    service/broadcast_tables/experimental/lang.cc

				    range_tombstone.cc

				    range_tombstone_list.cc

				    tombstone_gc_options.cc

				    tombstone_gc.cc

				    reader_concurrency_semaphore.cc

				    redis/abstract_command.cc

				    redis/command_factory.cc

				@@ -553,12 +551,15 @@ set(scylla_sources

				    release.cc

				    repair/repair.cc

				    repair/row_level.cc

				    replica/database.cc

				    replica/table.cc

				    row_cache.cc

				    schema.cc

				    schema_mutations.cc

				    schema_registry.cc

				    serializer.cc

				    service/client_state.cc

				    service/forward_service.cc

				    service/migration_manager.cc

				    service/misc_services.cc

				    service/pager/paging_state.cc

				@@ -571,11 +572,10 @@ set(scylla_sources

				    service/qos/qos_common.cc

				    service/qos/service_level_controller.cc

				    service/qos/standard_service_level_distributed_data_accessor.cc

				    service/raft/raft_gossip_failure_detector.cc

				    service/raft/raft_group_registry.cc

				    service/raft/raft_rpc.cc

				    service/raft/raft_sys_table_storage.cc

				    service/raft/schema_raft_state_machine.cc

				    service/raft/group0_state_machine.cc

				    service/storage_proxy.cc

				    service/storage_service.cc

				    sstables/compress.cc

				@@ -609,8 +609,8 @@ set(scylla_sources

				    streaming/stream_summary.cc

				    streaming/stream_task.cc

				    streaming/stream_transfer_task.cc

				    table.cc

				    table_helper.cc

				    tasks/task_manager.cc

				    thrift/controller.cc

				    thrift/handler.cc

				    thrift/server.cc

				@@ -737,7 +737,6 @@ target_compile_definitions(scylla PRIVATE XXH_PRIVATE_API HAVE_LZ4_COMPRESS_DEFA

				target_include_directories(scylla PRIVATE

				    "${CMAKE_CURRENT_SOURCE_DIR}"

				    libdeflate

				    abseil

				    "${scylla_gen_build_dir}")

				###

									
										2

CONTRIBUTING.md
									
												View File
												
				@@ -18,3 +18,5 @@ If you need help formatting or sending patches, [check out these instructions](h

				The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.

				Header files in Scylla must be self-contained, i.e., each can be included without having to include specific other headers first. To verify that your change did not break this property, run `ninja dev-headers`. If you added or removed header files, you must `touch configure.py` first - this will cause `configure.py` to be automatically re-run to generate a fresh list of header files.

				For more criteria on what reviewers consider good code, see the [review checklist](https://github.com/scylladb/scylla/blob/master/docs/dev/review-checklist.md).

									
										36

HACKING.md
									
												View File
												
				@@ -383,6 +383,40 @@ Open the link printed at the end. Be horrified. Go and write more tests.

				For more details see `./scripts/coverage.py --help`.

				### Resolving stack backtraces

				Scylla may print stack backtraces to the log for several reasons.

				For example:

				- When aborting (e.g. due to assertion failure, internal error, or segfault)

				- When detecting seastar reactor stalls (where a seastar task runs for a long time without yielding the cpu to other tasks on that shard)

				The backtraces contain code pointers so they are not very helpful without resolving into code locations.

				To resolve the backtraces, one needs the scylla relocatable package that contains the scylla binary (with debug information),

				as well as the dynamic libraries it is linked against.

				Builds from our automated build system are uploaded to the cloud

				and can be searched on http://backtrace.scylladb.com/

				Make sure you have the scylla server exact `build-id` to locate

				its respective relocatable package, required for decoding backtraces it prints.

				The build-id is printed to the system log when scylla starts.

				It can also be found by executing `scylla --build-id`, or

				by using the `file` utility, for example:

				```

				$ scylla --build-id

				4cba12e6eb290a406bfa4930918db23941fd4be3

				$ file scylla

				scylla: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=4cba12e6eb290a406bfa4930918db23941fd4be3, with debug_info, not stripped, too many notes (256)

				```

				To find the build-id of a coredump, use the `eu-unstrip` utility as follows:

				```

				$ eu-unstrip -n --core <coredump> | awk '/scylla$/ { s=$2; sub(/@.*$/, "", s); print s; exit(0); }'

				4cba12e6eb290a406bfa4930918db23941fd4be3

				```

				### Core dump debugging

				See [debugging.md](debugging.md).

				See [debugging.md](docs/dev/debugging.md).

									
										6

README.md
									
												View File
												
				@@ -42,7 +42,7 @@ For further information, please see:

				* [Docker image build documentation] for information on how to build Docker images.

				[developer documentation]: HACKING.md

				[build documentation]: docs/guides/building.md

				[build documentation]: docs/dev/building.md

				[docker image build documentation]: dist/docker/debian/README.md

				## Running Scylla

				@@ -65,7 +65,7 @@ $ ./tools/toolchain/dbuild ./build/release/scylla --help

				## Testing

				See [test.py manual](docs/guides/testing.md).

				See [test.py manual](docs/dev/testing.md).

				## Scylla APIs and compatibility

				By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and

				@@ -78,7 +78,7 @@ and the current compatibility of this feature as well as Scylla-specific extensi

				## Documentation

				Documentation can be found [here](https://scylla.docs.scylladb.com).

				Documentation can be found [here](docs/dev/README.md).

				Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).

				User documentation can be found [here](https://docs.scylladb.com/).

39

SCYLLA-VERSION-GEN

View File

@@ -1,11 +1,12 @@
 #!/bin/sh
 USAGE=$(cat <<-END
 Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] -- generate Scylla version and build information files.
 Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] [--date-stamp DATE] -- generate Scylla version and build information files.
 Options:
   -h|--help show this help message.
   -o|--output-dir PATH specify destination path at which the version files are to be created.
   -d|--date-stamp DATE manually set date for release parameter
 By default, the script will attempt to parse 'version' file
 in the current directory, which should contain a string of
@@ -31,7 +32,9 @@ using '-o PATH' option.
 END
 )
 while [[ $# -gt 0 ]]; do
 DATE=""
 while [ $# -gt 0 ]; do
 	opt="$1"
 	case $opt in
 		-h|--help)
@@ -43,6 +46,11 @@ while [[ $# -gt 0 ]]; do
 			shift
 			shift
 			;;
 		--date-stamp)
 			DATE="$2"
 			shift
 			shift
 			;;
 		*)
 			echo "Unexpected argument found: $1"
 			echo
@@ -58,24 +66,33 @@ if [ -z "$OUTPUT_DIR" ]; then
 	OUTPUT_DIR="$SCRIPT_DIR/build"
 fi
 if [ -z "$DATE" ]; then
   DATE=$(date --utc +%Y%m%d)
 fi
 # Default scylla product/version tags
 PRODUCT=scylla
 VERSION=4.6.11
 VERSION=5.2.19
 if test -f version
 then
 	SCYLLA_VERSION=$(cat version | awk -F'-' '{print $1}')
 	SCYLLA_RELEASE=$(cat version | awk -F'-' '{print $2}')
 else
 	DATE=$(date +%Y%m%d)
 	GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1)
 	SCYLLA_VERSION=$VERSION
 	# For custom package builds, replace "0" with "counter.your_name",
 	# where counter starts at 1 and increments for successive versions.
 	# This ensures that the package manager will select your custom
 	# package over the standard release.
 	SCYLLA_BUILD=0
 	SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
 	if [ -z "$SCYLLA_RELEASE" ]; then
 		DATE=$(date --utc +%Y%m%d)
 		GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1 --abbrev=12)
 		# For custom package builds, replace "0" with "counter.your_name",
 		# where counter starts at 1 and increments for successive versions.
 		# This ensures that the package manager will select your custom
 		# package over the standard release.
 		SCYLLA_BUILD=0
 		SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
 	elif [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
 		echo "setting SCYLLA_RELEASE only makes sense in clean builds" 1>&2
 		exit 1
 	fi
 fi
 if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then

1

abseil

Submodule abseil deleted from f70eadadd7

									
										15

absl-flat_hash_map.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "absl-flat_hash_map.hh"

									
										15

absl-flat_hash_map.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										29

alternator/auth.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "alternator/error.hh"

				@@ -34,7 +21,6 @@

				#include "service/storage_proxy.hh"

				#include "alternator/executor.hh"

				#include "cql3/selection/selection.hh"

				#include "database.hh"

				#include "query-result-set.hh"

				#include "cql3/result_set.hh"

				#include <seastar/core/coroutine.hh>

				@@ -137,33 +123,34 @@ std::string get_signature(std::string_view access_key_id, std::string_view secre

				}

				future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username) {

				    schema_ptr schema = proxy.get_db().local().find_schema("system_auth", "roles");

				    schema_ptr schema = proxy.data_dictionary().find_schema("system_auth", "roles");

				    partition_key pk = partition_key::from_single_value(*schema, utf8_type->decompose(username));

				    dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};

				    std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};

				    const column_definition* salted_hash_col = schema->get_column_definition(bytes("salted_hash"));

				    if (!salted_hash_col) {

				        co_return coroutine::make_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));

				        co_await coroutine::return_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));

				    }

				    auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col});

				    auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id}, selection->get_query_options());

				    auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice));

				    auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,

				            proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));

				    auto cl = auth::password_authenticator::consistency_for_user(username);

				    service::client_state client_state{service::client_state::internal_tag()};

				    service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,

				            service::storage_proxy::coordinator_query_options(executor::default_timeout(), empty_service_permit(), client_state));

				    cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());

				    cql3::selection::result_set_builder builder(*selection, gc_clock::now());

				    query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));

				    auto result_set = builder.build();

				    if (result_set->empty()) {

				        co_return coroutine::make_exception(api_error::unrecognized_client(format("User not found: {}", username)));

				        co_await coroutine::return_exception(api_error::unrecognized_client(format("User not found: {}", username)));

				    }

				    const bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column

				    if (!salted_hash) {

				        co_return coroutine::make_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));

				        co_await coroutine::return_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));

				    }

				    co_return value_cast<sstring>(utf8_type->deserialize(*salted_hash));

				}

									
										17

alternator/auth.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -35,7 +22,7 @@ namespace alternator {

				using hmac_sha256_digest = std::array<char, 32>;

				using key_cache = utils::loading_cache<std::string, std::string>;

				using key_cache = utils::loading_cache<std::string, std::string, 1>;

				std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,

				        std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,

									
										15

alternator/conditions.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <list>

									
										15

alternator/conditions.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				/*

									
										79

alternator/controller.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <seastar/net/dns.hh>

				@@ -27,6 +14,8 @@

				#include "db/config.hh"

				#include "cdc/generation_service.hh"

				#include "service/memory_limiter.hh"

				#include "auth/service.hh"

				#include "service/qos/service_level_controller.hh"

				using namespace seastar;

				@@ -41,6 +30,8 @@ controller::controller(

				        sharded<db::system_distributed_keyspace>& sys_dist_ks,

				        sharded<cdc::generation_service>& cdc_gen_svc,

				        sharded<service::memory_limiter>& memory_limiter,

				        sharded<auth::service>& auth_service,

				        sharded<qos::service_level_controller>& sl_controller,

				        const db::config& config)

				    : _gossiper(gossiper)

				    , _proxy(proxy)

				@@ -48,12 +39,32 @@ controller::controller(

				    , _sys_dist_ks(sys_dist_ks)

				    , _cdc_gen_svc(cdc_gen_svc)

				    , _memory_limiter(memory_limiter)

				    , _auth_service(auth_service)

				    , _sl_controller(sl_controller)

				    , _config(config)

				{

				}

				future<> controller::start() {

				sstring controller::name() const {

				    return "alternator";

				}

				sstring controller::protocol() const {

				    return "dynamodb";

				}

				sstring controller::protocol_version() const {

				    return version;

				}

				std::vector<socket_address> controller::listen_addresses() const {

				    return _listen_addresses;

				}

				future<> controller::start_server() {

				    return seastar::async([this] {

				        _listen_addresses.clear();

				        auto preferred = _config.listen_interface_prefer_ipv6() ? std::make_optional(net::inet_address::family::INET6) : std::nullopt;

				        auto family = _config.enable_ipv6_dns_lookup() || preferred ? std::nullopt : std::make_optional(net::inet_address::family::INET);

				@@ -67,25 +78,27 @@ future<> controller::start() {

				        rmw_operation::set_default_write_isolation(_config.alternator_write_isolation());

				        executor::set_default_timeout(std::chrono::milliseconds(_config.alternator_timeout_in_ms()));

				        net::inet_address addr;

				        try {

				            addr = net::dns::get_host_by_name(_config.alternator_address(), family).get0().addr_list.front();

				        } catch (...) {

				            std::throw_with_nested(std::runtime_error(fmt::format("Unable to resolve alternator_address {}", _config.alternator_address())));

				        }

				        net::inet_address addr = utils::resolve(_config.alternator_address, family).get0();

				        auto get_cdc_metadata = [] (cdc::generation_service& svc) { return std::ref(svc.get_cdc_metadata()); };

				        _executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();

				        _server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper)).get();

				        _server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();

				        // Note: from this point on, if start_server() throws for any reason,

				        // it must first call stop_server() to stop the executor and server

				        // services we just started - or Scylla will cause an assertion

				        // failure when the controller object is destroyed in the exception

				        // unwinding.

				        std::optional<uint16_t> alternator_port;

				        if (_config.alternator_port()) {

				            alternator_port = _config.alternator_port();

				            _listen_addresses.push_back({addr, *alternator_port});

				        }

				        std::optional<uint16_t> alternator_https_port;

				        std::optional<tls::credentials_builder> creds;

				        if (_config.alternator_https_port()) {

				            alternator_https_port = _config.alternator_https_port();

				            _listen_addresses.push_back({addr, *alternator_https_port});

				            creds.emplace();

				            auto opts = _config.alternator_encryption_options();

				            if (opts.empty()) {

				@@ -102,7 +115,13 @@ future<> controller::start() {

				            }

				            opts.erase("require_client_auth");

				            opts.erase("truststore");

				            utils::configure_tls_creds_builder(creds.value(), std::move(opts)).get();

				            try {

				                utils::configure_tls_creds_builder(creds.value(), std::move(opts)).get();

				            } catch(...) {

				                logger.error("Failed to set up Alternator TLS credentials: {}", std::current_exception());

				                stop_server().get();

				                std::throw_with_nested(std::runtime_error("Failed to set up Alternator TLS credentials"));

				            }

				        }

				        bool alternator_enforce_authorization = _config.alternator_enforce_authorization();

				        _server.invoke_on_all(

				@@ -110,6 +129,10 @@ future<> controller::start() {

				            return server.init(addr, alternator_port, alternator_https_port, creds, alternator_enforce_authorization,

				                    &_memory_limiter.local().get_semaphore(),

				                    _config.max_concurrent_requests_per_shard);

				        }).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {

				            logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",

				                    addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF", ep);

				            return stop_server().then([ep = std::move(ep)] { return make_exception_future<>(ep); });

				        }).then([addr, alternator_port, alternator_https_port] {

				            logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}",

				                    addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF");

				@@ -117,12 +140,20 @@ future<> controller::start() {

				    });

				}

				future<> controller::stop() {

				future<> controller::stop_server() {

				    return seastar::async([this] {

				        if (!_ssg) {

				            return;

				        }

				        _server.stop().get();

				        _executor.stop().get();

				        _listen_addresses.clear();

				        destroy_smp_service_group(_ssg.value()).get();

				    });

				}

				future<> controller::request_stop_server() {

				    return stop_server();

				}

				}

									
										46

alternator/controller.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -24,6 +11,8 @@

				#include <seastar/core/sharded.hh>

				#include <seastar/core/smp.hh>

				#include "protocol_server.hh"

				namespace service {

				class storage_proxy;

				class migration_manager;

				@@ -45,22 +34,38 @@ class gossiper;

				}

				namespace auth {

				class service;

				}

				namespace qos {

				class service_level_controller;

				}

				namespace alternator {

				// This is the official DynamoDB API version.

				// It represents the last major reorganization of that API, and all the features

				// that were added since did NOT increment this version string.

				constexpr const char* version = "2012-08-10";

				using namespace seastar;

				class executor;

				class server;

				class controller {

				class controller : public protocol_server {

				    sharded<gms::gossiper>& _gossiper;

				    sharded<service::storage_proxy>& _proxy;

				    sharded<service::migration_manager>& _mm;

				    sharded<db::system_distributed_keyspace>& _sys_dist_ks;

				    sharded<cdc::generation_service>& _cdc_gen_svc;

				    sharded<service::memory_limiter>& _memory_limiter;

				    sharded<auth::service>& _auth_service;

				    sharded<qos::service_level_controller>& _sl_controller;

				    const db::config& _config;

				    std::vector<socket_address> _listen_addresses;

				    sharded<executor> _executor;

				    sharded<server> _server;

				    std::optional<smp_service_group> _ssg;

				@@ -73,10 +78,17 @@ public:

				        sharded<db::system_distributed_keyspace>& sys_dist_ks,

				        sharded<cdc::generation_service>& cdc_gen_svc,

				        sharded<service::memory_limiter>& memory_limiter,

				        sharded<auth::service>& auth_service,

				        sharded<qos::service_level_controller>& sl_controller,

				        const db::config& config);

				    future<> start();

				    future<> stop();

				    virtual sstring name() const override;

				    virtual sstring protocol() const override;

				    virtual sstring protocol_version() const override;

				    virtual std::vector<socket_address> listen_addresses() const override;

				    virtual future<> start_server() override;

				    virtual future<> stop_server() override;

				    virtual future<> request_stop_server() override;

				};

				}

									
										25

alternator/error.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -36,7 +23,7 @@ namespace alternator {

				// api_error into a JSON object, and that is returned to the user.

				class api_error final : public std::exception {

				public:

				    using status_type = httpd::reply::status_type;

				    using status_type = http::reply::status_type;

				    status_type _http_code;

				    std::string _type;

				    std::string _msg;

				@@ -83,8 +70,14 @@ public:

				    static api_error request_limit_exceeded(std::string msg) {

				        return api_error("RequestLimitExceeded", std::move(msg));

				    }

				    static api_error serialization(std::string msg) {

				        return api_error("SerializationException", std::move(msg));

				    }

				    static api_error table_not_found(std::string msg) {

				        return api_error("TableNotFoundException", std::move(msg));

				    }

				    static api_error internal(std::string msg) {

				        return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);

				        return api_error("InternalServerError", std::move(msg), http::reply::status_type::internal_server_error);

				    }

				    // Provide the "std::exception" interface, to make it easier to print this

1036

alternator/executor.cc

View File

File diff suppressed because it is too large Load Diff

									
										58

alternator/executor.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -73,6 +60,16 @@ public:

				    explicit make_jsonable(rjson::value&& value);

				    std::string to_json() const override;

				};

				/**

				 * Make return type for serializing the object "streamed",

				 * i.e. direct to HTTP output stream. Note: only useful for

				 * (very) large objects as there are overhead issues with this

				 * as well, but for massive lists of return objects this can

				 * help avoid large allocations/many re-allocs

				 */ 

				json::json_return_type make_streamed(rjson::value&&);

				struct json_string : public json::jsonable {

				    std::string _value;

				public:

				@@ -84,9 +81,10 @@ namespace parsed {

				class path;

				};

				const std::map<sstring, sstring>& get_tags_of_table(schema_ptr schema);

				future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map);

				schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);

				bool is_alternator_keyspace(const sstring& ks_name);

				// Wraps the db::get_tags_of_table and throws if the table is missing the tags extension.

				const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema);

				// An attribute_path_map object is used to hold data for various attributes

				// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path

				@@ -146,6 +144,11 @@ template<typename T>

				using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;

				using attrs_to_get_node = attribute_path_map_node<std::monostate>;

				// attrs_to_get lists which top-level attribute are needed, and possibly also

				// which part of the top-level attribute is really needed (when nested

				// attribute paths appeared in the query).

				// Most code actually uses optional<attrs_to_get>. There, a disengaged

				// optional means we should get all attributes, not specific ones.

				using attrs_to_get = attribute_path_map<std::monostate>;

				@@ -193,12 +196,11 @@ public:

				    future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> get_records(client_state& client_state, tracing::trace_state_ptr, service_permit permit, rjson::value request);

				    future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);

				    future<> start();

				    future<> stop() { return make_ready_future<>(); }

				    future<> create_keyspace(std::string_view keyspace_name);

				    static sstring table_name(const schema&);

				    static db::timeout_clock::time_point default_timeout();

				    static void set_default_timeout(db::timeout_clock::duration timeout);

				@@ -210,27 +212,31 @@ public:

				private:

				    friend class rmw_operation;

				    static bool is_alternator_keyspace(const sstring& ks_name);

				    static sstring make_keyspace_name(const sstring& table_name);

				    static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);

				    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);

				public:    

				public:

				    static std::optional<rjson::value> describe_single_item(schema_ptr,

				        const query::partition_slice&,

				        const cql3::selection::selection&,

				        const query::result&,

				        const attrs_to_get&);

				        const std::optional<attrs_to_get>&);

				    static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,

				        const query::partition_slice&& slice,

				        shared_ptr<cql3::selection::selection> selection,

				        foreign_ptr<lw_shared_ptr<query::result>> query_result,

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);

				    static void describe_single_item(const cql3::selection::selection&,

				        const std::vector<bytes_opt>&,

				        const attrs_to_get&,

				        const std::optional<attrs_to_get>&,

				        rjson::value&,

				        bool = false);

				    void add_stream_options(const rjson::value& stream_spec, schema_builder&) const;

				    void supplement_table_info(rjson::value& descr, const schema& schema) const;

				    void supplement_table_stream_info(rjson::value& descr, const schema& schema) const;

				    static void add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);

				    static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);

				    static void supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);

				};

				}

									
										23

alternator/expressions.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "expressions.hh"

				@@ -42,7 +29,7 @@

				namespace alternator {

				template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>

				Result do_with_parser(std::string input, Func&& f) {

				Result do_with_parser(std::string_view input, Func&& f) {

				    expressionsLexer::InputStreamType input_stream{

				        reinterpret_cast<const ANTLR_UINT8*>(input.data()),

				        ANTLR_ENC_UTF8,

				@@ -57,7 +44,7 @@ Result do_with_parser(std::string input, Func&& f) {

				}

				parsed::update_expression

				parse_update_expression(std::string query) {

				parse_update_expression(std::string_view query) {

				    try {

				        return do_with_parser(query,  std::mem_fn(&expressionsParser::update_expression));

				    } catch (...) {

				@@ -66,7 +53,7 @@ parse_update_expression(std::string query) {

				}

				std::vector<parsed::path>

				parse_projection_expression(std::string query) {

				parse_projection_expression(std::string_view query) {

				    try {

				        return do_with_parser(query,  std::mem_fn(&expressionsParser::projection_expression));

				    } catch (...) {

				@@ -75,7 +62,7 @@ parse_projection_expression(std::string query) {

				}

				parsed::condition_expression

				parse_condition_expression(std::string query) {

				parse_condition_expression(std::string_view query) {

				    try {

				        return do_with_parser(query,  std::mem_fn(&expressionsParser::condition_expression));

				    } catch (...) {

15

alternator/expressions.g

View File

@@ -3,20 +3,7 @@
  */
 /*
  * This file is part of Scylla.
  *
  * Scylla is free software: you can redistribute it and/or modify
  * it under the terms of the GNU Affero General Public License as published by
  * the Free Software Foundation, either version 3 of the License, or
  * (at your option) any later version.
  *
  * Scylla is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
  * You should have received a copy of the GNU Affero General Public License
  * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.
  * SPDX-License-Identifier: AGPL-3.0-or-later
  */
 /*

									
										21

alternator/expressions.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -39,9 +26,9 @@ public:

				    using runtime_error::runtime_error;

				};

				parsed::update_expression parse_update_expression(std::string query);

				std::vector<parsed::path> parse_projection_expression(std::string query);

				parsed::condition_expression parse_condition_expression(std::string query);

				parsed::update_expression parse_update_expression(std::string_view query);

				std::vector<parsed::path> parse_projection_expression(std::string_view query);

				parsed::condition_expression parse_condition_expression(std::string_view query);

				void resolve_update_expression(parsed::update_expression& ue,

				        const rjson::value* expression_attribute_names,

									
										15

alternator/expressions_types.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										15

alternator/rmw_operation.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										111

alternator/serialization.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "utils/base64.hh"

				@@ -27,11 +14,14 @@

				#include "rapidjson/writer.h"

				#include "concrete_types.hh"

				#include "cql3/type_json.hh"

				#include "position_in_partition.hh"

				static logging::logger slogger("alternator-serialization");

				namespace alternator {

				bool is_alternator_keyspace(const sstring& ks_name);

				type_info type_info_from_string(std::string_view type) {

				    static thread_local const std::unordered_map<std::string_view, type_info> type_infos = {

				        {"S", {alternator_type::S, utf8_type}},

				@@ -83,7 +73,7 @@ struct from_json_visitor {

				    }

				    // default

				    void operator()(const abstract_type& t) const {

				        bo.write(from_json_object(t, v, cql_serialization_format::internal()));

				        bo.write(from_json_object(t, v));

				    }

				};

				@@ -175,31 +165,42 @@ bytes get_key_column_value(const rjson::value& item, const column_definition& co

				}

				// Parses the JSON encoding for a key value, which is a map with a single

				// entry, whose key is the type (expected to match the key column's type)

				// and the value is the encoded value.

				bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column) {

				// entry whose key is the type and the value is the encoded value.

				// If this type does not match the desired "type_str", an api_error::validation

				// error is thrown (the "name" parameter is the name of the column which will

				// mentioned in the exception message).

				// If the type does match, a reference to the encoded value is returned.

				static const rjson::value& get_typed_value(const rjson::value& key_typed_value, std::string_view type_str, std::string_view name, std::string_view value_name) {

				    if (!key_typed_value.IsObject() || key_typed_value.MemberCount() != 1 ||

				            !key_typed_value.MemberBegin()->value.IsString()) {

				        throw api_error::validation(

				                format("Malformed value object for key column {}: {}",

				                        column.name_as_text(), key_typed_value));

				                format("Malformed value object for {} {}: {}",

				                        value_name, name, key_typed_value));

				    }

				    auto it = key_typed_value.MemberBegin();

				    if (it->name != type_to_string(column.type)) {

				    if (rjson::to_string_view(it->name) != type_str) {

				        throw api_error::validation(

				                format("Type mismatch: expected type {} for key column {}, got type {}",

				                        type_to_string(column.type), column.name_as_text(), it->name));

				                format("Type mismatch: expected type {} for {} {}, got type {}",

				                        type_str, value_name, name, it->name));

				    }

				    std::string_view value_view = rjson::to_string_view(it->value);

				    return it->value;

				}

				// Parses the JSON encoding for a key value, which is a map with a single

				// entry, whose key is the type (expected to match the key column's type)

				// and the value is the encoded value.

				bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column) {

				    auto& value = get_typed_value(key_typed_value, type_to_string(column.type), column.name_as_text(), "key column");

				    std::string_view value_view = rjson::to_string_view(value);

				    if (value_view.empty()) {

				        throw api_error::validation(

				                format("The AttributeValue for a key attribute cannot contain an empty string value. Key: {}", column.name_as_text()));

				    }

				    if (column.type == bytes_type) {

				        return rjson::base64_decode(it->value);

				        return rjson::base64_decode(value);

				    } else {

				        return column.type->from_string(rjson::to_string_view(it->value));

				        return column.type->from_string(value_view);

				    }

				}

				@@ -250,6 +251,39 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {

				    return clustering_key::from_exploded(raw_ck);

				}

				position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {

				    auto ck = ck_from_json(item, schema);

				    if (is_alternator_keyspace(schema->ks_name())) {

				        return position_in_partition::for_key(std::move(ck));

				    }

				    const auto region_item = rjson::find(item, scylla_paging_region);

				    const auto weight_item = rjson::find(item, scylla_paging_weight);

				    if (bool(region_item) != bool(weight_item)) {

				        throw api_error::validation("Malformed value object: region and weight has to be either both missing or both present");

				    }

				    partition_region region;

				    bound_weight weight;

				    if (region_item) {

				        auto region_view = rjson::to_string_view(get_typed_value(*region_item, "S", scylla_paging_region, "key region"));

				        auto weight_view = rjson::to_string_view(get_typed_value(*weight_item, "N", scylla_paging_weight, "key weight"));

				        auto region = parse_partition_region(region_view);

				        if (weight_view == "-1") {

				            weight = bound_weight::before_all_prefixed;

				        } else if (weight_view == "0") {

				            weight = bound_weight::equal;

				        } else if (weight_view == "1") {

				            weight = bound_weight::after_all_prefixed;

				        } else {

				            throw std::runtime_error(fmt::format("Invalid value for weight: {}", weight_view));

				        }

				        return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);

				    }

				    if (ck.is_empty()) {

				        return position_in_partition::for_partition_start();

				    }

				    return position_in_partition::for_key(std::move(ck));

				}

				big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {

				    if (!v.IsObject() || v.MemberCount() != 1) {

				        throw api_error::validation(format("{}: invalid number object", diagnostic));

				@@ -259,11 +293,9 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {

				        throw api_error::validation(format("{}: expected number, found type '{}'", diagnostic, it->name));

				    }

				    try {

				        if (it->value.IsNumber()) {

				             // FIXME(sarna): should use big_decimal constructor with numeric values directly:

				            return big_decimal(rjson::print(it->value));

				        }

				        if (!it->value.IsString()) {

				            // We shouldn't reach here. Callers normally validate their input

				            // earlier with validate_value().

				            throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));

				        }

				        return big_decimal(rjson::to_string_view(it->value));

				@@ -272,6 +304,21 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {

				    }

				}

				std::optional<big_decimal> try_unwrap_number(const rjson::value& v) {

				    if (!v.IsObject() || v.MemberCount() != 1) {

				        return std::nullopt;

				    }

				    auto it = v.MemberBegin();

				    if (it->name != "N" || !it->value.IsString()) {

				        return std::nullopt;

				    }

				    try {

				        return big_decimal(rjson::to_string_view(it->value));

				    } catch (const marshal_exception& e) {

				        return std::nullopt;

				    }

				}

				const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value& v) {

				    if (!v.IsObject() || v.MemberCount() != 1) {

				        return {"", nullptr};

				@@ -279,7 +326,7 @@ const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value&

				    auto it = v.MemberBegin();

				    const std::string it_key = it->name.GetString();

				    if (it_key != "SS" && it_key != "BS" && it_key != "NS") {

				        return {"", nullptr};

				        return {std::move(it_key), nullptr};

				    }

				    return std::make_pair(it_key, &(it->value));

				}

				@@ -349,7 +396,7 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&

				    auto [set1_type, set1] = unwrap_set(v1);

				    auto [set2_type, set2] = unwrap_set(v2);

				    if (set1_type != set2_type) {

				        throw api_error::validation(format("Mismatched set types: {} and {}", set1_type, set2_type));

				        throw api_error::validation(format("Set DELETE type mismatch: {} and {}", set1_type, set2_type));

				    }

				    if (!set1 || !set2) {

				        throw api_error::validation("UpdateExpression: DELETE operation can only be performed on a set");

									
										26

alternator/serialization.hh
									
												View File
												
				@@ -3,32 +3,22 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include <string>

				#include <string_view>

				#include <optional>

				#include "types.hh"

				#include "schema_fwd.hh"

				#include "keys.hh"

				#include "utils/rjson.hh"

				#include "utils/big_decimal.hh"

				class position_in_partition;

				namespace alternator {

				enum class alternator_type : int8_t {

				@@ -45,6 +35,9 @@ struct type_representation {

				    data_type dtype;

				};

				inline constexpr std::string_view scylla_paging_region(":scylla:paging:region");

				inline constexpr std::string_view scylla_paging_weight(":scylla:paging:weight");

				type_info type_info_from_string(std::string_view type);

				type_representation represent_type(alternator_type atype);

				@@ -59,11 +52,16 @@ rjson::value json_key_column_value(bytes_view cell, const column_definition& col

				partition_key pk_from_json(const rjson::value& item, schema_ptr schema);

				clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);

				position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);

				// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it.  Otherwise,

				// raises ValidationException with diagnostic.

				big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);

				// try_unwrap_number is like unwrap_number, but returns an unset optional

				// when the given v does not encode a number.

				std::optional<big_decimal> try_unwrap_number(const rjson::value& v);

				// Check if a given JSON object encodes a set (i.e., it is a {"SS": [...]}, or "NS", "BS"

				// and returns set's type and a pointer to that set. If the object does not encode a set,

				// returned value is {"", nullptr}

									
										106

alternator/server.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "alternator/server.hh"

				@@ -26,13 +13,14 @@

				#include <seastar/core/coroutine.hh>

				#include <seastar/json/json_elements.hh>

				#include <seastar/util/defer.hh>

				#include <seastar/util/short_streams.hh>

				#include "seastarx.hh"

				#include "error.hh"

				#include "service/qos/service_level_controller.hh"

				#include "utils/rjson.hh"

				#include "auth.hh"

				#include <cctype>

				#include "service/storage_proxy.hh"

				#include "locator/snitch_base.hh"

				#include "gms/gossiper.hh"

				#include "utils/overloaded_functor.hh"

				#include "utils/fb_utilities.hh"

				@@ -40,6 +28,8 @@

				static logging::logger slogger("alternator-server");

				using namespace httpd;

				using request = http::request;

				using reply = http::reply;

				namespace alternator {

				@@ -164,8 +154,10 @@ public:

				protected:

				    void generate_error_reply(reply& rep, const api_error& err) {

				        rep._content += "{\"__type\":\"com.amazonaws.dynamodb.v20120810#" + err._type + "\"," +

				                "\"message\":\"" + err._msg + "\"}";

				        rjson::value results = rjson::empty_object();

				        rjson::add(results, "__type", rjson::from_string("com.amazonaws.dynamodb.v20120810#" + err._type));

				        rjson::add(results, "message", err._msg);

				        rep._content = rjson::print(std::move(results));

				        rep._status = err._http_code;

				        slogger.trace("api_handler error case: {}", rep._content);

				    }

				@@ -211,10 +203,9 @@ protected:

				        // It's very easy to get a list of all live nodes on the cluster,

				        // using _gossiper().get_live_members(). But getting

				        // just the list of live nodes in this DC needs more elaborate code:

				        sstring local_dc = locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(

				                utils::fb_utilities::get_broadcast_address());

				        std::unordered_set<gms::inet_address> local_dc_nodes =

				                _proxy.get_token_metadata_ptr()->get_topology().get_datacenter_endpoints().at(local_dc);

				        auto& topology = _proxy.get_token_metadata_ptr()->get_topology();

				        sstring local_dc = topology.get_datacenter();

				        std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);

				        for (auto& ip : local_dc_nodes) {

				            if (_gossiper.is_alive(ip)) {

				                rjson::push_back(results, rjson::from_string(ip.to_sstring()));

				@@ -246,7 +237,7 @@ protected:

				future<std::string> server::verify_signature(const request& req, const chunked_content& content) {

				    if (!_enforce_authorization) {

				        slogger.debug("Skipping authorization");

				        return make_ready_future<std::string>("<unauthenticated request>");

				        return make_ready_future<std::string>();

				    }

				    auto host_it = req._headers.find("Host");

				    if (host_it == req._headers.end()) {

				@@ -376,7 +367,9 @@ static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_

				        tracing::add_session_param(trace_state, "alternator_op", op);

				        tracing::add_query(trace_state, truncated_content_view(query, buf));

				        tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());

				        tracing::set_username(trace_state, auth::authenticated_user(username));

				        if (!username.empty()) {

				            tracing::set_username(trace_state, auth::authenticated_user(username));

				        }

				    }

				    return trace_state;

				}

				@@ -399,7 +392,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr

				    }

				    auto units = co_await std::move(units_fut);

				    assert(req->content_stream);

				    chunked_content content = co_await httpd::read_entire_stream(*req->content_stream);

				    chunked_content content = co_await util::read_entire_stream(*req->content_stream);

				    auto username = co_await verify_signature(*req, content);

				    if (slogger.is_enabled(log_level::trace)) {

				@@ -419,7 +412,11 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr

				    auto leave = defer([this] () noexcept { _pending_requests.leave(); });

				    //FIXME: Client state can provide more context, e.g. client's endpoint address

				    // We use unique_ptr because client_state cannot be moved or copied

				    executor::client_state client_state{executor::client_state::internal_tag()};

				    executor::client_state client_state = username.empty()

				        ? service::client_state{service::client_state::internal_tag()}

				        : service::client_state{service::client_state::internal_tag(), _auth_service, _sl_controller, username};

				    co_await client_state.maybe_update_per_service_level_params();

				    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);

				    tracing::trace(trace_state, op);

				    rjson::value json_request = co_await _json_parser.parse(std::move(content));

				@@ -452,12 +449,14 @@ void server::set_routes(routes& r) {

				//FIXME: A way to immediately invalidate the cache should be considered,

				// e.g. when the system table which stores the keys is changed.

				// For now, this propagation may take up to 1 minute.

				server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper)

				server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& auth_service, qos::service_level_controller& sl_controller)

				        : _http_server("http-alternator")

				        , _https_server("https-alternator")

				        , _executor(exec)

				        , _proxy(proxy)

				        , _gossiper(gossiper)

				        , _auth_service(auth_service)

				        , _sl_controller(sl_controller)

				        , _key_cache(1024, 1min, slogger)

				        , _enforce_authorization(false)

				        , _enabled_servers{}

				@@ -532,6 +531,9 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos

				        {"GetRecords", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

				            return e.get_records(client_state, std::move(trace_state), std::move(permit), std::move(json_request));

				        }},

				        {"DescribeContinuousBackups", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

				            return e.describe_continuous_backups(client_state, std::move(permit), std::move(json_request));

				        }},

				    } {

				}

				@@ -545,36 +547,28 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:

				                " must be specified in order to init an alternator HTTP server instance"));

				    }

				    return seastar::async([this, addr, port, https_port, creds] {

				        try {

				            _executor.start().get();

				        _executor.start().get();

				            if (port) {

				                set_routes(_http_server._routes);

				                _http_server.set_content_length_limit(server::content_length_limit);

				                _http_server.set_content_streaming(true);

				                _http_server.listen(socket_address{addr, *port}).get();

				                _enabled_servers.push_back(std::ref(_http_server));

				            }

				            if (https_port) {

				                set_routes(_https_server._routes);

				                _https_server.set_content_length_limit(server::content_length_limit);

				                _https_server.set_content_streaming(true);

				                _https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {

				                    if (ep) {

				                        slogger.warn("Exception loading {}: {}", files, ep);

				                    } else {

				                        slogger.info("Reloaded {}", files);

				                    }

				                }).get0());

				                _https_server.listen(socket_address{addr, *https_port}).get();

				                _enabled_servers.push_back(std::ref(_https_server));

				            }

				        } catch (...) {

				            slogger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",

				                    addr, port ? std::to_string(*port) : "OFF", https_port ? std::to_string(*https_port) : "OFF", std::current_exception());

				            std::throw_with_nested(std::runtime_error(

				                    format("Failed to set up Alternator HTTP server on {} port {}, TLS port {}",

				                            addr, port ? std::to_string(*port) : "OFF", https_port ? std::to_string(*https_port) : "OFF")));

				        if (port) {

				            set_routes(_http_server._routes);

				            _http_server.set_content_length_limit(server::content_length_limit);

				            _http_server.set_content_streaming(true);

				            _http_server.listen(socket_address{addr, *port}).get();

				            _enabled_servers.push_back(std::ref(_http_server));

				        }

				        if (https_port) {

				            set_routes(_https_server._routes);

				            _https_server.set_content_length_limit(server::content_length_limit);

				            _https_server.set_content_streaming(true);

				            _https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {

				                if (ep) {

				                    slogger.warn("Exception loading {}: {}", files, ep);

				                } else {

				                    slogger.info("Reloaded {}", files);

				                }

				            }).get0());

				            _https_server.listen(socket_address{addr, *https_port}).get();

				            _enabled_servers.push_back(std::ref(_https_server));

				        }

				    });

				}

				@@ -631,7 +625,7 @@ future<> server::json_parser::stop() {

				const char* api_error::what() const noexcept {

				    if (_what_string.empty()) {

				        _what_string = format("{} {}: {}", _http_code, _type, _msg);

				        _what_string = format("{} {}: {}", static_cast<int>(_http_code), _type, _msg);

				    }

				    return _what_string.c_str();

				}

									
										26

alternator/server.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -28,6 +15,7 @@

				#include <seastar/net/tls.hh>

				#include <optional>

				#include "alternator/auth.hh"

				#include "service/qos/service_level_controller.hh"

				#include "utils/small_vector.hh"

				#include "utils/updateable_value.hh"

				#include <seastar/core/units.hh>

				@@ -39,7 +27,7 @@ using chunked_content = rjson::chunked_content;

				class server {

				    static constexpr size_t content_length_limit = 16*MB;

				    using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,

				            tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<request>)>;

				            tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>)>;

				    using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;

				    http_server _http_server;

				@@ -47,6 +35,8 @@ class server {

				    executor& _executor;

				    service::storage_proxy& _proxy;

				    gms::gossiper& _gossiper;

				    auth::service& _auth_service;

				    qos::service_level_controller& _sl_controller;

				    key_cache _key_cache;

				    bool _enforce_authorization;

				@@ -78,7 +68,7 @@ class server {

				    json_parser _json_parser;

				public:

				    server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper);

				    server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);

				    future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,

				            bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);

				@@ -86,8 +76,8 @@ public:

				private:

				    void set_routes(seastar::httpd::routes& r);

				    // If verification succeeds, returns the authenticated user's username

				    future<std::string> verify_signature(const seastar::httpd::request&, const chunked_content&);

				    future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);

				    future<std::string> verify_signature(const seastar::http::request&, const chunked_content&);

				    future<executor::request_return_type> handle_api_request(std::unique_ptr<http::request> req);

				};

				}

									
										15

alternator/stats.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "stats.hh"

									
										15

alternator/stats.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										113

alternator/streams.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <type_traits>

				@@ -28,7 +15,6 @@

				#include "utils/base64.hh"

				#include "log.hh"

				#include "database.hh"

				#include "db/config.hh"

				#include "cdc/log.hh"

				@@ -47,7 +33,6 @@

				#include "gms/feature_service.hh"

				#include "executor.hh"

				#include "tags_extension.hh"

				#include "rmw_operation.hh"

				/**

				@@ -89,8 +74,8 @@ struct rapidjson::internal::TypeHelper<ValueType, utils::UUID>

				    : public from_string_helper<ValueType, utils::UUID>

				{};

				static db_clock::time_point as_timepoint(const utils::UUID& uuid) {

				    return db_clock::time_point{utils::UUID_gen::unix_timestamp(uuid)};

				static db_clock::time_point as_timepoint(const table_id& tid) {

				    return db_clock::time_point{utils::UUID_gen::unix_timestamp(tid.uuid())};

				}

				/**

				@@ -121,6 +106,9 @@ public:

				    stream_arn(const UUID& uuid)

				        : UUID(uuid)

				    {}

				    stream_arn(const table_id& tid)

				        : UUID(tid.uuid())

				    {}

				    stream_arn(std::string_view v)

				        : UUID(v.substr(1))

				    {

				@@ -155,24 +143,29 @@ future<alternator::executor::request_return_type> alternator::executor::list_str

				    auto limit = rjson::get_opt<int>(request, "Limit").value_or(std::numeric_limits<int>::max());

				    auto streams_start = rjson::get_opt<stream_arn>(request, "ExclusiveStartStreamArn");

				    auto table = find_table(_proxy, request);

				    auto& db = _proxy.get_db().local();

				    auto& cfs = db.get_column_families();

				    auto i = cfs.begin();

				    auto e = cfs.end();

				    auto db = _proxy.data_dictionary();

				    auto cfs = db.get_tables();

				    if (limit < 1) {

				        throw api_error::validation("Limit must be 1 or more");

				    }

				    // TODO: the unordered_map here is not really well suited for partial

				    // querying - we're sorting on local hash order, and creating a table

				    // between queries may or may not miss info. But that should be rare,

				    // and we can probably expect this to be a single call.

				    // # 12601 (maybe?) - sort the set of tables on ID. This should ensure we never

				    // generate duplicates in a paged listing here. Can obviously miss things if they 

				    // are added between paged calls and end up with a "smaller" UUID/ARN, but that 

				    // is to be expected.

				    std::sort(cfs.begin(), cfs.end(), [](const data_dictionary::table& t1, const data_dictionary::table& t2) {

				        return t1.schema()->id().uuid() < t2.schema()->id().uuid();

				    });

				    auto i = cfs.begin();

				    auto e = cfs.end();

				    if (streams_start) {

				        i = std::find_if(i, e, [&](const std::pair<utils::UUID, lw_shared_ptr<column_family>>& p) {

				            return p.first == streams_start 

				                && cdc::get_base_table(db, *p.second->schema())

				                && is_alternator_keyspace(p.second->schema()->ks_name())

				        i = std::find_if(i, e, [&](const data_dictionary::table& t) {

				            return t.schema()->id().uuid() == streams_start

				                && cdc::get_base_table(db.real_database(), *t.schema())

				                && is_alternator_keyspace(t.schema()->ks_name())

				                ;

				        });

				        if (i != e) {

				@@ -186,7 +179,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str

				    std::optional<stream_arn> last;

				    for (;limit > 0 && i != e; ++i) {

				        auto s = i->second->schema();

				        auto s = i->schema();

				        auto& ks_name = s->ks_name();

				        auto& cf_name = s->cf_name();

				@@ -196,14 +189,14 @@ future<alternator::executor::request_return_type> alternator::executor::list_str

				        if (table && ks_name != table->ks_name()) {

				            continue;

				        }

				        if (cdc::is_log_for_some_table(db, ks_name, cf_name)) {

				            if (table && table != cdc::get_base_table(db, *s)) {

				        if (cdc::is_log_for_some_table(db.real_database(), ks_name, cf_name)) {

				            if (table && table != cdc::get_base_table(db.real_database(), *s)) {

				                continue;

				            }

				            rjson::value new_entry = rjson::empty_object();

				            last = i->first;

				            last = i->schema()->id();

				            rjson::add(new_entry, "StreamArn", *last);

				            rjson::add(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));

				            rjson::add(new_entry, "TableName", rjson::from_string(cdc::base_name(table_name(*s))));

				@@ -424,7 +417,7 @@ using namespace std::string_literals;

				 * This will be a partial overlap, but it is the best we can do.

				 */

				static std::chrono::seconds confidence_interval(const database& db) {

				static std::chrono::seconds confidence_interval(data_dictionary::database db) {

				    return std::chrono::seconds(db.get_config().alternator_streams_time_window_s());

				}

				@@ -442,12 +435,12 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");

				    schema_ptr schema, bs;

				    auto& db = _proxy.get_db().local();

				    auto db = _proxy.data_dictionary();

				    try {

				        auto& cf = db.find_column_family(stream_arn);

				        auto cf = db.find_column_family(table_id(stream_arn));

				        schema = cf.schema();

				        bs = cdc::get_base_table(_proxy.get_db().local(), *schema);

				        bs = cdc::get_base_table(db.real_database(), *schema);

				    } catch (...) {        

				    }

				@@ -505,7 +498,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    // filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)

				    auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);

				    return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([this, &db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {

				    return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([this, db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {

				        auto e = topologies.end();

				        auto prev = e;

				@@ -726,18 +719,18 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    }

				    auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");

				    auto& db = _proxy.get_db().local();

				    auto db = _proxy.data_dictionary();

				    schema_ptr schema = nullptr;

				    std::optional<shard_id> sid;

				    try {

				        auto& cf = db.find_column_family(stream_arn);

				        auto cf = db.find_column_family(table_id(stream_arn));

				        schema = cf.schema();

				        sid = rjson::get<shard_id>(request, "ShardId");

				    } catch (...) {

				    }

				    if (!schema || !cdc::get_base_table(db, *schema) || !is_alternator_keyspace(schema->ks_name())) {

				    if (!schema || !cdc::get_base_table(db.real_database(), *schema) || !is_alternator_keyspace(schema->ks_name())) {

				        throw api_error::resource_not_found("Invalid StreamArn");

				    }

				    if (!sid) {

				@@ -814,12 +807,12 @@ future<executor::request_return_type> executor::get_records(client_state& client

				        throw api_error::validation("Limit must be 1 or more");

				    }

				    auto& db = _proxy.get_db().local();

				    auto db = _proxy.data_dictionary();

				    schema_ptr schema, base;

				    try {

				        auto& log_table = db.find_column_family(iter.table);

				        auto log_table = db.find_column_family(table_id(iter.table));

				        schema = log_table.schema();

				        base = cdc::get_base_table(db, *schema);

				        base = cdc::get_base_table(db.real_database(), *schema);

				    } catch (...) {        

				    }

				@@ -847,14 +840,14 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");

				    static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");

				    auto key_names = boost::copy_range<attrs_to_get>(

				    std::optional<attrs_to_get> key_names = boost::copy_range<attrs_to_get>(

				        boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))

				        | boost::adaptors::transformed([&] (const column_definition& cdef) {

				            return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })

				    );

				    // Include all base table columns as values (in case pre or post is enabled).

				    // This will include attributes not stored in the frozen map column

				    auto attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()

				    std::optional<attrs_to_get> attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()

				        // this will include the :attrs column, which we will also force evaluating. 

				        // But not having this set empty forces out any cdc columns from actual result 

				        | boost::adaptors::transformed([] (const column_definition& cdef) {

				@@ -891,11 +884,11 @@ future<executor::request_return_type> executor::get_records(client_state& client

				        ++mul;

				    }

				    auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),

				            query::row_limit(limit * mul));

				            query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));

				    return _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(

				            [this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {       

				        cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());

				        cql3::selection::result_set_builder builder(*selection, gc_clock::now());

				        query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));

				        auto result_set = builder.build();

				@@ -1023,8 +1016,8 @@ future<executor::request_return_type> executor::get_records(client_state& client

				        // ugh. figure out if we are and end-of-shard

				        auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();

				        return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {

				        return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret), nrecords](db_clock::time_point ts) mutable {

				            auto& shard = iter.shard;            

				            if (shard.time < ts && ts < high_ts) {

				@@ -1041,24 +1034,28 @@ future<executor::request_return_type> executor::get_records(client_state& client

				                rjson::add(ret, "NextShardIterator", iter);

				            }

				            _stats.api_operations.get_records_latency.add(std::chrono::steady_clock::now() - start_time);

				            // TODO: determine a better threshold...

				            if (nrecords > 10) {

				                return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));

				            }

				            return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				        });

				    });

				}

				void executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder) const {

				void executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {

				    auto stream_enabled = rjson::find(stream_specification, "StreamEnabled");

				    if (!stream_enabled || !stream_enabled->IsBool()) {

				        throw api_error::validation("StreamSpecification needs boolean StreamEnabled");

				    }

				    if (stream_enabled->GetBool()) {

				        auto& db = _proxy.get_db().local();

				        auto db = sp.data_dictionary();

				        if (!db.features().cluster_supports_cdc()) {

				        if (!db.features().cdc) {

				            throw api_error::validation("StreamSpecification: streams (CDC) feature not enabled in cluster.");

				        }

				        if (!db.features().cluster_supports_alternator_streams()) {

				        if (!db.features().alternator_streams) {

				            throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");

				        }

				@@ -1090,11 +1087,11 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche

				    }

				}

				void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema) const {

				void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp) {

				    auto& opts = schema.cdc_options();

				    if (opts.enabled()) {

				        auto& db = _proxy.get_db().local();

				        auto& cf = db.find_column_family(schema.ks_name(), cdc::log_name(schema.cf_name()));

				        auto db = sp.data_dictionary();

				        auto cf = db.find_table(schema.ks_name(), cdc::log_name(schema.cf_name()));

				        stream_arn arn(cf.schema()->id());

				        rjson::add(descr, "LatestStreamArn", arn);

				        rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));

									
										808

alternator/ttl.cc
									
												View File
												
				@@ -3,30 +3,56 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <chrono>

				#include <cstdint>

				#include <exception>

				#include <optional>

				#include <seastar/core/sstring.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/core/sleep.hh>

				#include <seastar/core/future.hh>

				#include <seastar/core/lowres_clock.hh>

				#include <seastar/coroutine/maybe_yield.hh>

				#include <boost/multiprecision/cpp_int.hpp>

				#include "executor.hh"

				#include "exceptions/exceptions.hh"

				#include "gms/gossiper.hh"

				#include "gms/inet_address.hh"

				#include "inet_address_vectors.hh"

				#include "locator/abstract_replication_strategy.hh"

				#include "log.hh"

				#include "gc_clock.hh"

				#include "replica/database.hh"

				#include "service_permit.hh"

				#include "timestamp.hh"

				#include "service/storage_proxy.hh"

				#include "service/pager/paging_state.hh"

				#include "service/pager/query_pagers.hh"

				#include "gms/feature_service.hh"

				#include "database.hh"

				#include "sstables/types.hh"

				#include "mutation.hh"

				#include "types.hh"

				#include "types/map.hh"

				#include "utils/rjson.hh"

				#include "utils/big_decimal.hh"

				#include "utils/fb_utilities.hh"

				#include "cql3/selection/selection.hh"

				#include "cql3/values.hh"

				#include "cql3/query_options.hh"

				#include "cql3/column_identifier.hh"

				#include "alternator/executor.hh"

				#include "alternator/controller.hh"

				#include "alternator/serialization.hh"

				#include "dht/sharder.hh"

				#include "db/config.hh"

				#include "db/tags/utils.hh"

				#include "ttl.hh"

				static logging::logger tlogger("alternator_ttl");

				namespace alternator {

				@@ -41,8 +67,8 @@ static const sstring TTL_TAG_KEY("system:ttl_attribute");

				future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				    _stats.api_operations.update_time_to_live++;

				    if (!_proxy.get_db().local().features().cluster_supports_alternator_ttl()) {

				        co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator_ttl' experimental feature is enabled on all nodes.");

				    if (!_proxy.data_dictionary().features().alternator_ttl) {

				        co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");

				    }

				    schema_ptr schema = get_table(_proxy, request);

				@@ -68,24 +94,25 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				    }

				    sstring attribute_name(v->GetString(), v->GetStringLength());

				    std::map<sstring, sstring> tags_map = get_tags_of_table(schema);

				    if (enabled) {

				        if (tags_map.contains(TTL_TAG_KEY)) {

				            co_return api_error::validation("TTL is already enabled");

				    co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {

				        if (enabled) {

				            if (tags_map.contains(TTL_TAG_KEY)) {

				                throw api_error::validation("TTL is already enabled");

				            }

				            tags_map[TTL_TAG_KEY] = attribute_name;

				        } else {

				            auto i = tags_map.find(TTL_TAG_KEY);

				            if (i == tags_map.end()) {

				                throw api_error::validation("TTL is already disabled");

				            } else if (i->second != attribute_name) {

				                throw api_error::validation(format(

				                    "Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",

				                    attribute_name, i->second));

				            }

				            tags_map.erase(TTL_TAG_KEY);

				        }

				        tags_map[TTL_TAG_KEY] = attribute_name;

				    } else {

				        auto i = tags_map.find(TTL_TAG_KEY);

				        if (i == tags_map.end()) {

				            co_return api_error::validation("TTL is already disabled");

				        } else if (i->second != attribute_name) {

				            co_return api_error::validation(format(

				                "Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",

				                attribute_name, i->second));

				        }

				        tags_map.erase(TTL_TAG_KEY);

				    }

				    co_await update_tags(_mm, schema, std::move(tags_map));

				    });

				    // Prepare the response, which contains a TimeToLiveSpecification

				    // basically identical to the request's

				    rjson::value response = rjson::empty_object();

				@@ -96,7 +123,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				    _stats.api_operations.describe_time_to_live++;

				    schema_ptr schema = get_table(_proxy, request);

				    std::map<sstring, sstring> tags_map = get_tags_of_table(schema);

				    std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);

				    rjson::value desc = rjson::empty_object();

				    auto i = tags_map.find(TTL_TAG_KEY);

				    if (i == tags_map.end()) {

				@@ -110,4 +137,713 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta

				    co_return make_jsonable(std::move(response));

				}

				// expiration_service is a sharded service responsible for cleaning up expired

				// items in all tables with per-item expiration enabled. Currently, this means

				// Alternator tables with TTL configured via a UpdateTimeToLive request.

				//

				// Here is a brief overview of how the expiration service works:

				//

				// An expiration thread on each shard periodically scans the items (i.e.,

				// rows) owned by this shard, looking for items whose chosen expiration-time

				// attribute indicates they are expired, and deletes those items.

				// The expiration-time "attribute" can be either an actual Scylla column

				// (must be numeric) or an Alternator "attribute" - i.e., an element in

				// the ATTRS_COLUMN_NAME map<utf8,bytes> column where the numeric expiration

				// time is encoded in DynamoDB's JSON encoding inside the bytes value.

				// To avoid scanning the same items RF times in RF replicas, only one node is

				// responsible for scanning a token range at a time. Normally, this is the

				// node owning this range as a "primary range" (the first node in the ring

				// with this range), but when this node is down, the secondary owner (the

				// second in the ring) may take over.

				// An expiration thread is reponsible for all tables which need expiration

				// scans. Currently, the different tables are scanned sequentially (not in

				// parallel).

				// The expiration thread scans item using CL=QUORUM to ensures that it reads

				// a consistent expiration-time attribute. This means that the items are read

				// locally and in addition QUORUM-1 additional nodes (one additional node

				// when RF=3) need to read the data and send digests.

				// When the expiration thread decides that an item has expired and wants

				// to delete it, it does it using a CL=QUORUM write. This allows this

				// deletion to be visible for consistent (quorum) reads. The deletion,

				// like user deletions, will also appear on the CDC log and therefore

				// Alternator Streams if enabled - currently as ordinary deletes (the

				// userIdentity flag is currently missing this is issue #11523).

				expiration_service::expiration_service(data_dictionary::database db, service::storage_proxy& proxy, gms::gossiper& g)

				        : _db(db)

				        , _proxy(proxy)

				        , _gossiper(g)

				{

				}

				// Convert the big_decimal used to represent expiration time to an integer.

				// Any fractional part is dropped. If the number is negative or invalid,

				// 0 is returned, and if it's too high, the maximum unsigned long is returned.

				static unsigned long bigdecimal_to_ul(const big_decimal& bd) {

				    // The big_decimal format has an integer mantissa of arbitrary length

				    // "unscaled_value" and then a (power of 10) exponent "scale".

				    if (bd.unscaled_value() <= 0) {

				        return 0;

				    }

				    if (bd.scale() == 0) {

				        // The fast path, when the expiration time is an integer, scale==0.

				        return static_cast<unsigned long>(bd.unscaled_value());

				    }

				    // Because the mantissa can be of arbitrary length, we work on it

				    // as a string. TODO: find a less ugly algorithm.

				    auto str = bd.unscaled_value().str();

				    if (bd.scale() > 0) {

				        int len = str.length();

				        if (len < bd.scale()) {

				            return 0;

				        }

				        str = str.substr(0, len-bd.scale());

				    } else {

				        if (bd.scale() < -20) {

				            return std::numeric_limits<unsigned long>::max();

				        }

				        for (int i = 0; i < -bd.scale(); i++) {

				            str.push_back('0');

				        }

				    }

				    // strtoul() returns ULONG_MAX if the number is too large, or 0 if not

				    // a number.

				    return strtoul(str.c_str(), nullptr, 10);

				}

				// The following is_expired() functions all check if an item with the given

				// expiration time has expired, according to the DynamoDB API rules.

				// The rules are:

				// 1. If the expiration time attribute's value is not a number type,

				//    the item is not expired.

				// 2. The expiration time is measured in seconds since the UNIX epoch.

				// 3. If the expiration time is more than 5 years in the past, it is assumed

				//    to be malformed and ignored - and the item does not expire.

				static bool is_expired(gc_clock::time_point expiration_time, gc_clock::time_point now) {

				    return expiration_time <= now &&

				           expiration_time > now - std::chrono::years(5);

				}

				static bool is_expired(const big_decimal& expiration_time, gc_clock::time_point now) {

				    unsigned long t = bigdecimal_to_ul(expiration_time);

				    // We assume - and the assumption turns out to be correct - that the

				    // epoch of gc_clock::time_point and the one used by the DynamoDB protocol

				    // are the same (the UNIX epoch in UTC). The resolution (seconds) is also

				    // the same.

				    return is_expired(gc_clock::time_point(gc_clock::duration(std::chrono::seconds(t))), now);

				}

				static bool is_expired(const rjson::value& expiration_time, gc_clock::time_point now) {

				    std::optional<big_decimal> n = try_unwrap_number(expiration_time);

				    return n && is_expired(*n, now);

				}

				// expire_item() expires an item - i.e., deletes it as appropriate for

				// expiration - with CL=QUORUM and (FIXME!) in a way Alternator Streams

				// understands it is an expiration event - not a user-initiated deletion.

				static future<> expire_item(service::storage_proxy& proxy,

				                            const service::query_state& qs,

				                            const std::vector<bytes_opt>& row,

				                            schema_ptr schema,

				                            api::timestamp_type ts) {

				    // Prepare the row key to delete

				    // NOTICE: the order of columns is guaranteed by the fact that selection::wildcard

				    // is used, which indicates that columns appear in the order defined by

				    // schema::all_columns_in_select_order() - partition key columns goes first,

				    // immediately followed by clustering key columns

				    std::vector<bytes> exploded_pk;

				    const unsigned pk_size = schema->partition_key_size();

				    const unsigned ck_size = schema->clustering_key_size();

				    for (unsigned c = 0; c < pk_size; ++c) {

				        const auto& row_c = row[c];

				        if (!row_c) {

				            // This shouldn't happen - all key columns must have values.

				            // But if it ever happens, let's just *not* expire the item.

				            // FIXME: log or increment a metric if this happens.

				            return make_ready_future<>();

				        }

				        exploded_pk.push_back(*row_c);

				    }

				    auto pk = partition_key::from_exploded(exploded_pk);

				    mutation m(schema, pk);

				    // If there's no clustering key, a tombstone should be created directly

				    // on a partition, not on a clustering row - otherwise it will look like

				    // an open-ended range tombstone, which will crash on KA/LA sstable format.

				    // See issue #6035

				    if (ck_size == 0) {

				        m.partition().apply(tombstone(ts, gc_clock::now()));

				    } else {

				        std::vector<bytes> exploded_ck;

				        for (unsigned c = pk_size; c < pk_size + ck_size; ++c) {

				            const auto& row_c = row[c];

				            if (!row_c) {

				                // This shouldn't happen - all key columns must have values.

				                // But if it ever happens, let's just *not* expire the item.

				                // FIXME: log or increment a metric if this happens.

				                return make_ready_future<>();

				            }

				            exploded_ck.push_back(*row_c);

				        }

				        auto ck = clustering_key::from_exploded(exploded_ck);

				        m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));

				    }

				    std::vector<mutation> mutations;

				    mutations.push_back(std::move(m));

				    return proxy.mutate(std::move(mutations),

				        db::consistency_level::LOCAL_QUORUM,

				        executor::default_timeout(), // FIXME - which timeout?

				        qs.get_trace_state(), qs.get_permit(),

				        db::allow_per_partition_rate_limit::no);

				}

				static size_t random_offset(size_t min, size_t max) {

				    static thread_local std::default_random_engine re{std::random_device{}()};

				    std::uniform_int_distribution<size_t> dist(min, max);

				    return dist(re);

				}

				// Get a list of secondary token ranges for the given node, and the primary

				// node responsible for each of these token ranges.

				// A "secondary range" is a range of tokens where for each token, the second

				// node (in ring order) out of the RF replicas that hold this token is the

				// given node.

				// In the expiration scanner, we want to scan a secondary range but only if

				// this range's primary node is down. For this we need to return not just

				// a list of this node's secondary ranges - but also the primary owner of

				// each of those ranges.

				static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary_ranges(

				        const locator::effective_replication_map_ptr& erm,

				        gms::inet_address ep) {

				    const auto& tm = *erm->get_token_metadata_ptr();

				    const auto& sorted_tokens = tm.sorted_tokens();

				    std::vector<std::pair<dht::token_range, gms::inet_address>> ret;

				    if (sorted_tokens.empty()) {

				        on_internal_error(tlogger, "Token metadata is empty");

				    }

				    auto prev_tok = sorted_tokens.back();

				    for (const auto& tok : sorted_tokens) {

				        inet_address_vector_replica_set eps = erm->get_natural_endpoints(tok);

				        if (eps.size() <= 1 || eps[1] != ep) {

				            prev_tok = tok;

				            continue;

				        }

				        // Add the range (prev_tok, tok] to ret. However, if the range wraps

				        // around, split it to two non-wrapping ranges.

				        if (prev_tok < tok) {

				            ret.emplace_back(

				                dht::token_range{

				                    dht::token_range::bound(prev_tok, false),

				                    dht::token_range::bound(tok, true)},

				                eps[0]);

				        } else {

				            ret.emplace_back(

				                dht::token_range{

				                    dht::token_range::bound(prev_tok, false),

				                    std::nullopt},

				                eps[0]);

				            ret.emplace_back(

				                dht::token_range{

				                    std::nullopt,

				                    dht::token_range::bound(tok, true)},

				                eps[0]);

				        }

				        prev_tok = tok;

				    }

				    return ret;

				}

				// A class for iterating over all the token ranges *owned* by this shard.

				// To avoid code duplication, it is a template with two distinct cases -

				// <primary> and <secondary>:

				//

				// In the <primary> case, we consider a token *owned* by this shard if:

				// 1. This node is a replica for this token.

				// 2. Moreover, this node is the *primary* replica of the token (i.e., the

				//    first replica in the ring).

				// 3. In this node, this shard is responsible for this token.

				// We will use this definition of which shard in the cluster owns which tokens

				// to split the expiration scanner's work between all the shards of the

				// system.

				//

				// In the <secondary> case, we consider a token *owned* by this shard if:

				// 1. This node is the *secondary* replica for this token (i.e., the second

				//    replica in the ring).

				// 2. The primary replica for this token is currently marked down.

				// 3. In this node, this shard is responsible for this token.

				// We use the <secondary> case to handle the possibility that some of the

				// nodes in the system are down. A dead node will not be expiring

				// the tokens owned by it, so we want the secondary owner to take over its

				// primary ranges.

				//

				// FIXME: need to decide how to choose primary ranges in multi-DC setup!

				// We could call get_primary_ranges_within_dc() below instead of get_primary_ranges().

				// NOTICE: Iteration currently starts from a random token range in order to improve

				// the chances of covering all ranges during a scan when restarts occur.

				// A more deterministic way would be to regularly persist the scanning state,

				// but that incurs overhead that we want to avoid if not needed.

				enum primary_or_secondary_t {primary, secondary};

				template<primary_or_secondary_t primary_or_secondary>

				class token_ranges_owned_by_this_shard {

				    // ranges_holder_primary holds just the primary ranges themselves

				    class ranges_holder_primary {

				        const dht::token_range_vector _token_ranges;

				     public:

				        ranges_holder_primary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)

				            : _token_ranges(erm->get_primary_ranges(ep)) {}

				        std::size_t size() const { return _token_ranges.size(); }

				        const dht::token_range& operator[](std::size_t i) const {

				            return _token_ranges[i];

				        }

				        bool should_skip(std::size_t i) const {

				            return false;

				        }

				    };

				    // ranges_holder<secondary> holds the secondary token ranges plus each

				    // range's primary owner, needed to implement should_skip().

				    class ranges_holder_secondary {

				        std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;

				        gms::gossiper& _gossiper;

				     public:

				        ranges_holder_secondary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)

				            : _token_ranges(get_secondary_ranges(erm, ep))

				            , _gossiper(g) {}

				        std::size_t size() const { return _token_ranges.size(); }

				        const dht::token_range& operator[](std::size_t i) const {

				            return _token_ranges[i].first;

				        }

				        // range i should be skipped if its primary owner is alive.

				        bool should_skip(std::size_t i) const {

				            return _gossiper.is_alive(_token_ranges[i].second);

				        }

				    };

				    schema_ptr _s;

				    // _token_ranges will contain a list of token ranges owned by this node.

				    // We'll further need to split each such range to the pieces owned by

				    // the current shard, using _intersecter.

				    using ranges_holder = std::conditional_t<

				            primary_or_secondary == primary_or_secondary_t::primary,

				            ranges_holder_primary,

				            ranges_holder_secondary>;

				    const ranges_holder _token_ranges;

				    // NOTICE: _range_idx is used modulo _token_ranges size when accessing

				    // the data to ensure that it doesn't go out of bounds

				    size_t _range_idx;

				    size_t _end_idx;

				    std::optional<dht::selective_token_range_sharder> _intersecter;

				public:

				    token_ranges_owned_by_this_shard(replica::database& db, gms::gossiper& g, schema_ptr s)

				        :  _s(s)

				        , _token_ranges(db.find_keyspace(s->ks_name()).get_effective_replication_map(),

				                g, utils::fb_utilities::get_broadcast_address())

				        , _range_idx(random_offset(0, _token_ranges.size() - 1))

				        , _end_idx(_range_idx + _token_ranges.size())

				    {

				        tlogger.debug("Generating token ranges starting from base range {} of {}", _range_idx, _token_ranges.size());

				    }

				    // Return the next token_range owned by this shard, or nullopt when the

				    // iteration ends.

				    std::optional<dht::token_range> next() {

				        // We may need three or more iterations in the following loop if a

				        // vnode doesn't intersect with the given shard at all (such a small

				        // vnode is unlikely, but possible). The loop cannot be infinite

				        // because each iteration of the loop advances _range_idx.

				        for (;;) {

				            if (_intersecter) {

				                std::optional<dht::token_range> ret = _intersecter->next();

				                if (ret) {

				                    return ret;

				                }

				                // done with this range, go to next one

				                ++_range_idx;

				                _intersecter = std::nullopt;

				            }

				            if (_range_idx == _end_idx) {

				                return std::nullopt;

				            }

				            // If should_skip(), the range should be skipped. This happens for

				            // a secondary range whose primary owning node is still alive.

				            while (_token_ranges.should_skip(_range_idx % _token_ranges.size())) {

				                ++_range_idx;

				                if (_range_idx == _end_idx) {

				                    return std::nullopt;

				                }

				            }

				            _intersecter.emplace(_s->get_sharder(), _token_ranges[_range_idx % _token_ranges.size()], this_shard_id());

				        }

				    }

				    // Same as next(), just return a partition_range instead of token_range

				    std::optional<dht::partition_range> next_partition_range() {

				        std::optional<dht::token_range> ret = next();

				        if (ret) {

				            return dht::to_partition_range(*ret);

				        } else {

				            return std::nullopt;

				        }

				    }

				};

				// Precomputed information needed to perform a scan on partition ranges

				struct scan_ranges_context {

				    schema_ptr s;

				    bytes column_name;

				    std::optional<std::string> member;

				    ::shared_ptr<cql3::selection::selection> selection;

				    std::unique_ptr<service::query_state> query_state_ptr;

				    std::unique_ptr<cql3::query_options> query_options;

				    ::lw_shared_ptr<query::read_command> command;

				    scan_ranges_context(schema_ptr s, service::storage_proxy& proxy, bytes column_name, std::optional<std::string> member)

				        : s(s)

				        , column_name(column_name)

				        , member(member)

				    {

				        // FIXME: don't read the entire items - read only parts of it.

				        // We must read the key columns (to be able to delete) and also

				        // the requested attribute. If the requested attribute is a map's

				        // member we may be forced to read the entire map - but it would

				        // be good if we can read only the single item of the map - it

				        // should be possible (and a must for issue #7751!).

				        lw_shared_ptr<service::pager::paging_state> paging_state = nullptr;

				        auto regular_columns = boost::copy_range<query::column_id_vector>(

				            s->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));

				        selection = cql3::selection::selection::wildcard(s);

				        query::partition_slice::option_set opts = selection->get_query_options();

				        opts.set<query::partition_slice::option::allow_short_read>();

				        // It is important that the scan bypass cache to avoid polluting it:

				        opts.set<query::partition_slice::option::bypass_cache>();

				        std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};

				        auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);

				        command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));

				        executor::client_state client_state{executor::client_state::internal_tag()};

				        tracing::trace_state_ptr trace_state;

				        // NOTICE: empty_service_permit is used because the TTL service has fixed parallelism

				        query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, empty_service_permit());

				        // FIXME: What should we do on multi-DC? Will we run the expiration on the same ranges on all

				        // DCs or only once for each range? If the latter, we need to change the CLs in the

				        // scanner and deleter.

				        db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;

				        query_options = std::make_unique<cql3::query_options>(cl, std::vector<cql3::raw_value>{});

				        query_options = std::make_unique<cql3::query_options>(std::move(query_options), std::move(paging_state));

				    }

				};

				// Scan data in a list of token ranges in one table, looking for expired

				// items and deleting them.

				// Because of issue #9167, partition_ranges must have a single partition

				// range for this code to work correctly.

				static future<> scan_table_ranges(

				        service::storage_proxy& proxy,

				        const scan_ranges_context& scan_ctx,

				        dht::partition_range_vector&& partition_ranges,

				        abort_source& abort_source,

				        named_semaphore& page_sem,

				        expiration_service::stats& expiration_stats)

				{

				    const schema_ptr& s = scan_ctx.s;

				    assert (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.

				    auto p = service::pager::query_pagers::pager(proxy, s, scan_ctx.selection, *scan_ctx.query_state_ptr,

				            *scan_ctx.query_options, scan_ctx.command, std::move(partition_ranges), nullptr);

				    while (!p->is_exhausted()) {

				        if (abort_source.abort_requested()) {

				            co_return;

				        }

				        auto units = co_await get_units(page_sem, 1);

				        // We don't need to limit page size in number of rows because there is

				        // a builtin limit of the page's size in bytes. Setting this limit to

				        // 1 is useful for debugging the paging code with moderate-size data.

				        uint32_t limit = std::numeric_limits<uint32_t>::max();

				        // Read a page, and if that times out, try again after a small sleep.

				        // If we didn't catch the timeout exception, it would cause the scan

				        // be aborted and only be restarted at the next scanning period.

				        // If we retry too many times, give up and restart the scan later.

				        std::unique_ptr<cql3::result_set> rs;

				        for (int retries=0; ; retries++) {

				            try {

				                // FIXME: which timeout?

				                rs = co_await p->fetch_page(limit, gc_clock::now(), executor::default_timeout());

				                break;

				            } catch(exceptions::read_timeout_exception&) {

				                tlogger.warn("expiration scanner read timed out, will retry: {}",

				                    std::current_exception());

				            }

				            // If we didn't break out of this loop, add a minimal sleep

				            if (retries >= 10) {

				                // Don't get stuck forever asking the same page, maybe there's

				                // a bug or a real problem in several replicas. Give up on

				                // this scan an retry the scan from a random position later,

				                // in the next scan period.

				                throw runtime_exception("scanner thread failed after too many timeouts for the same page");

				            }

				            co_await sleep_abortable(std::chrono::seconds(1), abort_source);

				        }

				        auto rows = rs->rows();

				        auto meta = rs->get_metadata().get_names();

				        std::optional<unsigned> expiration_column;

				        for (unsigned i = 0; i < meta.size(); i++) {

				            const cql3::column_specification& col = *meta[i];

				            if (col.name->name() == scan_ctx.column_name) {

				                expiration_column = i;

				                break;

				            }

				        }

				        if (!expiration_column) {

				            continue;

				        }

				        for (const auto& row : rows) {

				            const bytes_opt& cell = row[*expiration_column];

				            if (!cell) {

				                continue;

				            }

				            auto v = meta[*expiration_column]->type->deserialize(*cell);

				            bool expired = false;

				            // FIXME: don't recalculate "now" all the time

				            auto now = gc_clock::now();

				            if (scan_ctx.member) {

				                // In this case, the expiration-time attribute we're

				                // looking for is a member in a map, saved serialized

				                // into bytes using Alternator's serialization (basically

				                // a JSON serialized into bytes)

				                // FIXME: is it possible to find a specific member of a map

				                // without iterating through it like we do here and compare

				                // the key?

				                for (const auto& entry : value_cast<map_type_impl::native_type>(v)) {

				                    std::string attr_name = value_cast<sstring>(entry.first);

				                    if (value_cast<sstring>(entry.first) == *scan_ctx.member) {

				                        bytes value = value_cast<bytes>(entry.second);

				                        rjson::value json = deserialize_item(value);

				                        expired = is_expired(json, now);

				                        break;

				                    }

				                }

				            } else {

				                // For a real column to contain an expiration time, it

				                // must be a numeric type.

				                // FIXME: Currently we only support decimal_type (which is

				                // what Alternator uses), but other numeric types can be

				                // supported as well to make this feature more useful in CQL.

				                // Note that kind::decimal is also checked above.

				                big_decimal n = value_cast<big_decimal>(v);

				                expired = is_expired(n, now);

				            }

				            if (expired) {

				                expiration_stats.items_deleted++;

				                // FIXME: maybe don't recalculate new_timestamp() all the time

				                // FIXME: if expire_item() throws on timeout, we need to retry it.

				                auto ts = api::new_timestamp();

				                co_await expire_item(proxy, *scan_ctx.query_state_ptr, row, s, ts);

				            }

				        }

				        // FIXME: once in a while, persist p->state(), so on reboot

				        // we don't start from scratch.

				    }

				}

				// scan_table() scans, in one table, data "owned" by this shard, looking for

				// expired items and deleting them.

				// We consider each node to "own" its primary token ranges, i.e., the tokens

				// that this node is their first replica in the ring. Inside the node, each

				// shard "owns" subranges of the node's token ranges - according to the node's

				// sharding algorithm.

				// When a node goes down, the token ranges owned by it will not be scanned

				// and items in those token ranges will not expire, so in the future (FIXME)

				// this function should additionally work on token ranges whose primary owner

				// is down and this node is the range's secondary owner.

				// If the TTL (expiration-time scanning) feature is not enabled for this

				// table, scan_table() returns false without doing anything. Remember that the

				// TTL feature may be enabled later so this function will need to be called

				// again when the feature is enabled.

				// Currently this function scans the entire table (or, rather the parts owned

				// by this shard) at full rate, once. In the future (FIXME) we should consider

				// how to pace this scan, how and when to repeat it, how to interleave or

				// parallelize scanning of multiple tables, and how to continue scans after a

				// reboot.

				static future<bool> scan_table(

				    service::storage_proxy& proxy,

				    data_dictionary::database db,

				    gms::gossiper& gossiper,

				    schema_ptr s,

				    abort_source& abort_source,

				    named_semaphore& page_sem,

				    expiration_service::stats& expiration_stats)

				{

				    // Check if an expiration-time attribute is enabled for this table.

				    // If not, just return false immediately.

				    // FIXME: the setting of the TTL may change in the middle of a long scan!

				    std::optional<std::string> attribute_name = db::find_tag(*s, TTL_TAG_KEY);

				    if (!attribute_name) {

				        co_return false;

				    }

				    // attribute_name may be one of the schema's columns (in Alternator, this

				    // means it's a key column), or an element in Alternator's attrs map

				    // encoded in Alternator's JSON encoding.

				    // FIXME: To make this less Alternators-specific, we should encode in the

				    // single key's value three things:

				    // 1. The name of a column

				    // 2. Optionally if column is a map, a member in the map

				    // 3. The deserializer for the value: CQL or Alternator (JSON).

				    // The deserializer can be guessed: If the given column or map item is

				    // numeric, it can be used directly. If it is a "bytes" type, it needs to

				    // be deserialized using Alternator's deserializer.

				    bytes column_name = to_bytes(*attribute_name);

				    const column_definition *cd = s->get_column_definition(column_name);

				    std::optional<std::string> member;

				    if (!cd) {

				        member = std::move(attribute_name);

				        column_name = bytes(executor::ATTRS_COLUMN_NAME);

				        cd = s->get_column_definition(column_name);

				        tlogger.info("table {} TTL enabled with attribute {} in {}", s->cf_name(), *member, executor::ATTRS_COLUMN_NAME);

				    } else {

				        tlogger.info("table {} TTL enabled with attribute {}", s->cf_name(), *attribute_name);

				    }

				    if (!cd) {

				        tlogger.info("table {} TTL column is missing, not scanning", s->cf_name());

				        co_return false;

				    }

				    data_type column_type = cd->type;

				    // Verify that the column has the right type: If "member" exists

				    // the column must be a map, and if it doesn't, the column must

				    // (currently) be a decimal_type. If the column has the wrong type

				    // nothing can get expired in this table, and it's pointless to

				    // scan it.

				    if ((member && column_type->get_kind() != abstract_type::kind::map) ||

				        (!member && column_type->get_kind() != abstract_type::kind::decimal)) {

				        tlogger.info("table {} TTL column has unsupported type, not scanning", s->cf_name());

				        co_return false;

				    }

				    expiration_stats.scan_table++;

				    // FIXME: need to pace the scan, not do it all at once.

				    scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};

				    token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), gossiper, s);

				    while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {

				        // Note that because of issue #9167 we need to run a separate

				        // query on each partition range, and can't pass several of

				        // them into one partition_range_vector.

				        dht::partition_range_vector partition_ranges;

				        partition_ranges.push_back(std::move(*range));

				        // FIXME: if scanning a single range fails, including network errors,

				        // we fail the entire scan (and rescan from the beginning). Need to

				        // reconsider this. Saving the scan position might be a good enough

				        // solution for this problem.

				        co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				    }

				    // If each node only scans its own primary ranges, then when any node is

				    // down part of the token range will not get scanned. This can be viewed

				    // as acceptable (when the comes back online, it will resume its scan),

				    // but as noted in issue #9787, we can allow more prompt expiration

				    // by tasking another node to take over scanning of the dead node's primary

				    // ranges. What we do here is that this node will also check expiration

				    // on its *secondary* ranges - but only those whose primary owner is down.

				    token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), gossiper, s);

				    while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {

				        expiration_stats.secondary_ranges_scanned++;

				        dht::partition_range_vector partition_ranges;

				        partition_ranges.push_back(std::move(*range));

				        co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				    }

				    co_return true;

				}

				future<> expiration_service::run() {

				    // FIXME: don't just tight-loop, think about timing, pace, and

				    // store position in durable storage, etc.

				    // FIXME: think about working on different tables in parallel.

				    // also need to notice when a new table is added, a table is

				    // deleted or when ttl is enabled or disabled for a table!

				    for (;;) {

				        auto start = lowres_clock::now();

				        // _db.tables() may change under our feet during a

				        // long-living loop, so we must keep our own copy of the list of

				        // schemas.

				        std::vector<schema_ptr> schemas;

				        for (auto cf : _db.get_tables()) {

				            schemas.push_back(cf.schema());

				        }

				        for (schema_ptr s : schemas) {

				            co_await coroutine::maybe_yield();

				            if (shutting_down()) {

				                co_return;

				            }

				            try {

				                co_await scan_table(_proxy, _db, _gossiper, s, _abort_source, _page_sem, _expiration_stats);

				            } catch (...) {

				                // The scan of a table may fail in the middle for many

				                // reasons, including network failure and even the table

				                // being removed. We'll continue scanning this table later

				                // (if it still exists). In any case it's important to catch

				                // the exception and not let the scanning service die for

				                // good.

				                // If the table has been deleted, it is expected that the scan

				                // will fail at some point, and even a warning is excessive.

				                if (_db.has_schema(s->ks_name(), s->cf_name())) {

				                    tlogger.warn("table {}.{} expiration scan failed: {}",

				                        s->ks_name(), s->cf_name(), std::current_exception());

				                } else {

				                    tlogger.info("expiration scan failed when table {}.{} was deleted",

				                        s->ks_name(), s->cf_name());

				                }

				            }

				        }

				        _expiration_stats.scan_passes++;

				        // The TTL scanner runs above once over all tables, at full steam.

				        // After completing such a scan, we sleep until it's time start

				        // another scan. TODO: If the scan went too fast, we can slow it down

				        // in the next iteration by reducing the scanner's scheduling-group

				        // share (if using a separate scheduling group), or introduce

				        // finer-grain sleeps into the scanning code.

				        std::chrono::milliseconds scan_duration(std::chrono::duration_cast<std::chrono::milliseconds>(lowres_clock::now() - start));

				        std::chrono::milliseconds period(long(_db.get_config().alternator_ttl_period_in_seconds() * 1000));

				        if (scan_duration < period) {

				            try {

				                tlogger.info("sleeping {} seconds until next period", (period - scan_duration).count()/1000.0);

				                co_await seastar::sleep_abortable(period - scan_duration, _abort_source);

				            } catch(seastar::sleep_aborted&) {}

				        } else {

				                tlogger.warn("scan took {} seconds, longer than period - not sleeping", scan_duration.count()/1000.0);

				        }

				    }

				}

				future<> expiration_service::start() {

				    // Called by main() on each shard to start the expiration-service

				    // thread. Just runs run() in the background and allows stop().

				    if (_db.features().alternator_ttl) {

				        if (!shutting_down()) {

				            _end = run().handle_exception([] (std::exception_ptr ep) {

				                tlogger.error("expiration_service failed: {}", ep);

				            });

				        }

				    }

				    return make_ready_future<>();

				}

				future<> expiration_service::stop() {

				    if (_abort_source.abort_requested()) {

				        throw std::logic_error("expiration_service::stop() called a second time");

				    }

				    _abort_source.request_abort();

				    if (!_end) {

				        // if _end is was not set, start() was never called

				        return make_ready_future<>();

				    }

				    return std::move(*_end);

				}

				expiration_service::stats::stats() {

				    _metrics.add_group("expiration", {

				        seastar::metrics::make_total_operations("scan_passes", scan_passes,

				            seastar::metrics::description("number of passes over the database")),

				        seastar::metrics::make_total_operations("scan_table", scan_table,

				            seastar::metrics::description("number of table scans (counting each scan of each table that enabled expiration)")),

				        seastar::metrics::make_total_operations("items_deleted", items_deleted,

				            seastar::metrics::description("number of items deleted after expiration")),

				        seastar::metrics::make_total_operations("secondary_ranges_scanned", secondary_ranges_scanned,

				            seastar::metrics::description("number of token ranges scanned by this node while their primary owner was down")),

				    });

				}

				} // namespace alternator

									
										80

alternator/ttl.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,80 @@

				/*

				 * Copyright 2021-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include "seastarx.hh"

				#include <seastar/core/sharded.hh>

				#include <seastar/core/abort_source.hh>

				#include <seastar/core/semaphore.hh>

				#include "data_dictionary/data_dictionary.hh"

				namespace gms {

				class gossiper;

				}

				namespace replica {

				class database;

				}

				namespace service {

				    class storage_proxy;

				}

				namespace alternator {

				// expiration_service is a sharded service responsible for cleaning up expired

				// items in all tables with per-item expiration enabled. Currently, this means

				// Alternator tables with TTL configured via a UpdateTimeToLeave request.

				class expiration_service final : public seastar::peering_sharded_service<expiration_service> {

				public:

				    // Object holding per-shard statistics related to the expiration service.

				    // While this object is alive, these metrics are also registered to be

				    // visible by the metrics REST API, with the "expiration_" prefix.

				    class stats {

				    public:

				        stats();

				        uint64_t scan_passes = 0;

				        uint64_t scan_table = 0;

				        uint64_t items_deleted = 0;

				        uint64_t secondary_ranges_scanned = 0;

				    private:

				        // The metric_groups object holds this stat object's metrics registered

				        // as long as the stats object is alive.

				        seastar::metrics::metric_groups _metrics;

				    };

				private:

				    data_dictionary::database _db;

				    service::storage_proxy& _proxy;

				    gms::gossiper& _gossiper;

				    // _end is set by start(), and resolves when the the background service

				    // started by it ends. To ask the background service to end, _abort_source

				    // should be triggered. stop() below uses both _abort_source and _end.

				    std::optional<future<>> _end;

				    abort_source _abort_source;

				    // Ensures that at most 1 page of scan results at a time is processed by the TTL service

				    named_semaphore _page_sem{1, named_semaphore_exception_factory{"alternator_ttl"}};

				    bool shutting_down() { return _abort_source.abort_requested(); }

				    stats _expiration_stats;

				public:

				    // sharded_service<expiration_service>::start() creates this object on

				    // all shards, so calls this constructor on each shard. Later, the

				    // additional start() function should be invoked on all shards.

				    expiration_service(data_dictionary::database, service::storage_proxy&, gms::gossiper&);

				    future<> start();

				    future<> run();

				    // sharded_service<expiration_service>::stop() calls the following stop()

				    // method on each shard. This stop() asks the service on this shard to

				    // shut down as quickly as it can. The returned future indicates when the

				    // service is no longer running.

				    // stop() may be called even before start(), but may only be called once -

				    // calling it twice will result in an exception.

				    future<> stop();

				};

				} // namespace alternator

									
										15

amplify.yml
									
										Normal file
									
												View File
												
				@@ -0,0 +1,15 @@

				version: 1

				applications:

				  - frontend:

				      phases:

				        build:

				          commands:

				            - make setupenv

				            - make dirhtml

				      artifacts:

				        baseDirectory: _build/dirhtml

				        files:

				          - '**/*'

				      cache:

				        paths: []

				    appRoot: docs

									
										29

api/api-doc/authorization_cache.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,29 @@

				{

				  "apiVersion":"0.0.1",

				  "swaggerVersion":"1.2",

				  "basePath":"{{Protocol}}://{{Host}}",

				  "resourcePath":"/authorization_cache",

				  "produces":[

				    "application/json"

				  ],

				  "apis":[

				    {

				      "path":"/authorization_cache/reset",

				      "operations":[

				        {

				          "method":"POST",

				          "summary":"Reset cache",

				          "type":"void",

				          "nickname":"authorization_cache_reset",

				          "produces":[

				            "application/json"

				          ],

				          "parameters":[

				          ]

				        }

				      ]

				    }

				  ],

				  "models":{

				  }

				}

									
										42

api/api-doc/compaction_manager.json
									
												View File
												
				@@ -102,7 +102,47 @@

				               "parameters":[

				                  {

				                     "name":"type",

				                     "description":"the type of compaction to stop. Can be one of: - COMPACTION - VALIDATION - CLEANUP - SCRUB - INDEX_BUILD",

				                     "description":"The type of compaction to stop. Can be one of: COMPACTION | CLEANUP | SCRUB | UPGRADE | RESHAPE",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/compaction_manager/stop_keyspace_compaction/{keyspace}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Stop all running compaction-like tasks in the given keyspace and tables having the provided type.",

				               "type":"void",

				               "nickname":"stop_keyspace_compaction",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace to stop compaction in",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"tables",

				                     "description":"Comma-separated tables to stop compaction in",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"type",

				                     "description":"The type of compaction to stop. Can be one of: COMPACTION | CLEANUP | SCRUB | UPGRADE | RESHAPE",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

									
										43

api/api-doc/raft.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,43 @@

				{

				   "apiVersion":"0.0.1",

				   "swaggerVersion":"1.2",

				   "basePath":"{{Protocol}}://{{Host}}",

				   "resourcePath":"/raft",

				   "produces":[

				      "application/json"

				   ],

				   "apis":[

				      {

				         "path":"/raft/trigger_snapshot/{group_id}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Triggers snapshot creation and log truncation for the given Raft group",

				               "type":"string",

				               "nickname":"trigger_snapshot",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"group_id",

				                     "description":"The ID of the group which should get snapshotted",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"timeout",

				                     "description":"Timeout in seconds after which the endpoint returns a failure. If not provided, 60s is used.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"long",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      }

				   ]

				}

									
										73

api/api-doc/storage_service.json
									
												View File
												
				@@ -624,7 +624,7 @@

				                  },

				                  {

				                     "name":"kn",

				                     "description":"Comma seperated keyspaces name to snapshot",

				                     "description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -632,7 +632,7 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"the column family to snapshot",

				                     "description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -667,7 +667,7 @@

				                  },

				                  {

				                     "name":"kn",

				                     "description":"Comma seperated keyspaces name that their snapshot will be deleted",

				                     "description":"Comma-separated keyspaces name that their snapshot will be deleted",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -723,7 +723,7 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma seperated column family names",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -755,7 +755,39 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma seperated column family names",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/keyspace_offstrategy_compaction/{keyspace}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Perform offstrategy compaction, if needed, in a single keyspace",

				               "type":"boolean",

				               "nickname":"perform_keyspace_offstrategy_compaction",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace to operate on",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma-separated table names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -807,6 +839,19 @@

				                     ],

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"quarantine_mode",

				                     "description":"Controls whether to scrub quarantined sstables (default INCLUDE)",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "enum":[

				                        "INCLUDE",

				                        "EXCLUDE",

				                        "ONLY"

				                     ],

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace to query about",

				@@ -817,7 +862,7 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma seperated column family names",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -857,7 +902,7 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma seperated column family names",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -889,7 +934,7 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma seperated column family names",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1183,7 +1228,7 @@

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Removes token (and all data associated with enpoint that had it) from the ring",

				               "summary":"Removes a node from the cluster. Replicated data that logically belonged to this node is redistributed among the remaining nodes.",

				               "type":"void",

				               "nickname":"remove_node",

				               "produces":[

				@@ -1200,7 +1245,7 @@

				                  },

				                  {

				                     "name":"ignore_nodes",

				                     "description":"List of dead nodes to ingore in removenode operation",

				                     "description":"Comma-separated list of dead nodes to ignore in removenode operation. Use the same method for all nodes to ignore: either Host IDs or ip addresses.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1901,7 +1946,7 @@

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Reset local schema",

				               "summary":"Forces this node to recalculate versions of schema objects.",

				               "type":"void",

				               "nickname":"reset_local_schema",

				               "produces":[

				@@ -2028,7 +2073,7 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma seperated column family names",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2055,7 +2100,7 @@

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma seperated column family names",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2596,7 +2641,7 @@

				            "version":{

				               "type":"string",

				               "enum":[

				                  "ka", "la", "mc", "md"

				                  "ka", "la", "mc", "md", "me"

				               ],

				               "description":"SSTable version"

				            },

									
										39

api/api-doc/system.json
									
												View File
												
				@@ -52,6 +52,45 @@

				            }

				         ]

				      },

				      {

				         "path":"/system/log",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Write a message to the Scylla log",

				               "type":"void",

				               "nickname":"write_log_message",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"message",

				                     "description":"The message to write to the log",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"level",

				                     "description":"The logging level to use",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "enum":[

				                        "error",

				                        "warn",

				                        "info",

				                        "debug",

				                        "trace"

				                     ],

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/system/drop_sstable_caches",

				         "operations":[

									
										305

api/api-doc/task_manager.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,305 @@

				{

				    "apiVersion":"0.0.1",

				    "swaggerVersion":"1.2",

				    "basePath":"{{Protocol}}://{{Host}}",

				    "resourcePath":"/task_manager",

				    "produces":[

				       "application/json"

				    ],

				    "apis":[

				       {

				          "path":"/task_manager/list_modules",

				          "operations":[

				             {

				                "method":"GET",

				                "summary":"Get all modules names",

				                "type":"array",

				                "items":{

				                   "type":"string"

				                },

				                "nickname":"get_modules",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager/list_module_tasks/{module}",

				          "operations":[

				             {

				                "method":"GET",

				                "summary":"Get a list of tasks",

				                "type":"array",

				                "items":{

				                    "type":"task_stats"

				                },

				                "nickname":"get_tasks",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"module",

				                        "description":"The module to query about",

				                        "required":true,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"path"

				                    },

				                    {

				                        "name":"internal",

				                        "description":"Boolean flag indicating whether internal tasks should be shown (false by default)",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"boolean",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"keyspace",

				                        "description":"The keyspace to query about",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"table",

				                        "description":"The table to query about",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    }

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager/task_status/{task_id}",

				          "operations":[

				             {

				                "method":"GET",

				                "summary":"Get task status",

				                "type":"task_status",

				                "nickname":"get_task_status",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"task_id",

				                        "description":"The uuid of a task to query about",

				                        "required":true,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"path"

				                    }

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager/abort_task/{task_id}",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Abort running task and its descendants",

				                "type":"void",

				                "nickname":"abort_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                   {

				                      "name":"task_id",

				                      "description":"The uuid of a task to abort",

				                      "required":true,

				                      "allowMultiple":false,

				                      "type":"string",

				                      "paramType":"path"

				                   }

				                ]

				             }

				          ]

				       },

				       {

				        "path":"/task_manager/wait_task/{task_id}",

				        "operations":[

				           {

				              "method":"GET",

				              "summary":"Wait for a task to complete",

				              "type":"task_status",

				              "nickname":"wait_task",

				              "produces":[

				                 "application/json"

				              ],

				              "parameters":[

				                 {

				                    "name":"task_id",

				                    "description":"The uuid of a task to wait for",

				                    "required":true,

				                    "allowMultiple":false,

				                    "type":"string",

				                    "paramType":"path"

				                 }

				              ]

				           }

				        ]

				     },

				     {

				      "path":"/task_manager/task_status_recursive/{task_id}",

				      "operations":[

				         {

				            "method":"GET",

				            "summary":"Get statuses of the task and all its descendants",

				            "type":"array",

				            "items":{

				               "type":"task_status"

				            },

				            "nickname":"get_task_status_recursively",

				            "produces":[

				               "application/json"

				            ],

				            "parameters":[

				                {

				                    "name":"task_id",

				                    "description":"The uuid of a task to query about",

				                    "required":true,

				                    "allowMultiple":false,

				                    "type":"string",

				                    "paramType":"path"

				                }

				            ]

				         }

				      ]

				    }

				    ],

				    "models":{

				       "task_stats" :{

				           "id": "task_stats",

				           "description":"A task statistics object",

				           "properties":{

				             "task_id":{

				                "type":"string",

				                "description":"The uuid of a task"

				             },

				             "state":{

				                "type":"string",

				                "enum":[

				                  "created",

				                  "running",

				                  "done",

				                  "failed"

				                ],

				                "description":"The state of a task"

				             },

				             "type":{

				                "type":"string",

				                "description":"The description of the task"

				             },

				             "keyspace":{

				                "type":"string",

				                "description":"The keyspace the task is working on (if applicable)"

				             },

				             "table":{

				                "type":"string",

				                "description":"The table the task is working on (if applicable)"

				             },

				             "entity":{

				                "type":"string",

				                "description":"Task-specific entity description"

				             },

				             "sequence_number":{

				                "type":"long",

				                "description":"The running sequence number of the task"

				             }

				           }

				       },

				       "task_status":{

				          "id":"task_status",

				          "description":"A task status object",

				          "properties":{

				             "id":{

				                "type":"string",

				                "description":"The uuid of the task"

				             },

				             "type":{

				                "type":"string",

				                "description":"The description of the task"

				             },

				             "state":{

				               "type":"string",

				               "enum":[

				                 "created",

				                 "running",

				                 "done",

				                 "failed"

				               ],

				                "description":"The state of the task"

				             },

				             "is_abortable":{

				                "type":"boolean",

				                "description":"Boolean flag indicating whether the task can be aborted"

				             },

				             "start_time":{

				                "type":"datetime",

				                "description":"The start time of the task"

				             },

				             "end_time":{

				                "type":"datetime",

				                "description":"The end time of the task (unspecified when the task is not completed)"

				             },

				             "error":{

				                "type":"string",

				                "description":"Error string, if the task failed"

				             },

				             "parent_id":{

				               "type":"string",

				               "description":"The uuid of the parent task"

				            },

				            "sequence_number":{

				               "type":"long",

				               "description":"The running sequence number of the task"

				            },

				            "shard":{

				               "type":"long",

				               "description":"The number of a shard the task is running on"

				            },

				            "keyspace":{

				               "type":"string",

				               "description":"The keyspace the task is working on (if applicable)"

				            },

				            "table":{

				               "type":"string",

				               "description":"The table the task is working on (if applicable)"

				            },

				            "entity":{

				               "type":"string",

				               "description":"Task-specific entity description"

				            },

				            "progress_units":{

				               "type":"string",

				               "description":"A description of the progress units"

				            },

				            "progress_total":{

				               "type":"double",

				               "description":"The total number of units to complete for the task"

				            },

				            "progress_completed":{

				               "type":"double",

				               "description":"The number of units completed so far"

				            },

				            "children_ids":{

				               "type":"array",

				                "items":{

				                    "type":"string"

				                },

				               "description":"Task IDs of children of this task"

				            }

				          }

				       }

				    }

				 }

									
										177

api/api-doc/task_manager_test.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,177 @@

				{

				    "apiVersion":"0.0.1",

				    "swaggerVersion":"1.2",

				    "basePath":"{{Protocol}}://{{Host}}",

				    "resourcePath":"/task_manager_test",

				    "produces":[

				       "application/json"

				    ],

				    "apis":[

				       {

				          "path":"/task_manager_test/test_module",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Register test module in task manager",

				                "type":"void",

				                "nickname":"register_test_module",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                ]

				             },

				             {

				                "method":"DELETE",

				                "summary":"Unregister test module in task manager",

				                "type":"void",

				                "nickname":"unregister_test_module",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager_test/test_task",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Register test task",

				                "type":"string",

				                "nickname":"register_test_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"task_id",

				                        "description":"The uuid of a task to register",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"shard",

				                        "description":"The shard of the task",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"long",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"parent_id",

				                        "description":"The uuid of a parent task",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"keyspace",

				                        "description":"The keyspace the task is working on",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"table",

				                        "description":"The table the task is working on",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"entity",

				                        "description":"Task-specific entity description",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    }

				                ]

				             },

				             {

				                "method":"DELETE",

				                "summary":"Unregister test task",

				                "type":"void",

				                "nickname":"unregister_test_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"task_id",

				                        "description":"The uuid of a task to register",

				                        "required":true,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    }

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager_test/finish_test_task/{task_id}",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Finish test task",

				                "type":"void",

				                "nickname":"finish_test_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                   {

				                      "name":"task_id",

				                      "description":"The uuid of a task to finish",

				                      "required":true,

				                      "allowMultiple":false,

				                      "type":"string",

				                      "paramType":"path"

				                   },

				                   {

				                      "name":"error",

				                      "description":"The error with which task fails (if it does)",

				                      "required":false,

				                      "allowMultiple":false,

				                      "type":"string",

				                      "paramType":"query"

				                   }

				                ]

				             }

				          ]

				       },

				       {

				         "path":"/task_manager_test/ttl",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Set ttl in seconds and get last value",

				               "type":"long",

				               "nickname":"get_and_update_ttl",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"ttl",

				                     "description":"The number of seconds for which the tasks will be kept in memory after it finishes",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"long",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      }

				    ]

				 }

									
										121

api/api.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "api.hh"

				@@ -37,10 +24,14 @@

				#include "compaction_manager.hh"

				#include "hinted_handoff.hh"

				#include "error_injection.hh"

				#include "authorization_cache.hh"

				#include <seastar/http/exception.hh>

				#include "stream_manager.hh"

				#include "system.hh"

				#include "api/config.hh"

				#include "task_manager.hh"

				#include "task_manager_test.hh"

				#include "raft.hh"

				logging::logger apilog("api");

				@@ -49,7 +40,7 @@ namespace api {

				static std::unique_ptr<reply> exception_reply(std::exception_ptr eptr) {

				    try {

				        std::rethrow_exception(eptr);

				    } catch (const no_such_keyspace& ex) {

				    } catch (const replica::no_such_keyspace& ex) {

				        throw bad_param_exception(ex.what());

				    }

				    // We never going to get here

				@@ -109,9 +100,9 @@ future<> unset_rpc_controller(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_rpc_controller(ctx, r); });

				}

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs) {

				    return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs] (http_context& ctx, routes& r) {

				            set_storage_service(ctx, r, ss, g.local(), cdc_gs);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks) {

				    return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs, &sys_ks] (http_context& ctx, routes& r) {

				            set_storage_service(ctx, r, ss, g.local(), cdc_gs, sys_ks);

				        });

				}

				@@ -139,6 +130,17 @@ future<> unset_server_repair(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_repair(ctx, r); });

				}

				future<> set_server_authorization_cache(http_context &ctx, sharded<auth::service> &auth_service) {

				    return register_api(ctx, "authorization_cache",

				                "The authorization cache API", [&auth_service] (http_context &ctx, routes &r) {

				                     set_authorization_cache(ctx, r, auth_service);

				                 });

				}

				future<> unset_server_authorization_cache(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_authorization_cache(ctx, r); });

				}

				future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl) {

				    return ctx.http_server.set_routes([&ctx, &snap_ctl] (routes& r) { set_snapshot(ctx, r, snap_ctl); });

				}

				@@ -147,8 +149,14 @@ future<> unset_server_snapshot(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_snapshot(ctx, r); });

				}

				future<> set_server_snitch(http_context& ctx) {

				    return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", set_endpoint_snitch);

				future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch) {

				    return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", [&snitch] (http_context& ctx, routes& r) {

				        set_endpoint_snitch(ctx, r, snitch);

				    });

				}

				future<> unset_server_snitch(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_endpoint_snitch(ctx, r); });

				}

				future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g) {

				@@ -180,9 +188,15 @@ future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_se

				                });

				}

				future<> set_server_stream_manager(http_context& ctx) {

				future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm) {

				    return register_api(ctx, "stream_manager",

				                "The stream manager API", set_stream_manager);

				                "The stream manager API", [&sm] (http_context& ctx, routes& r) {

				                    set_stream_manager(ctx, r, sm);

				                });

				}

				future<> unset_server_stream_manager(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_stream_manager(ctx, r); });

				}

				future<> set_server_cache(http_context& ctx) {

				@@ -240,5 +254,68 @@ future<> set_server_done(http_context& ctx) {

				    });

				}

				future<> set_server_task_manager(http_context& ctx) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx](routes& r) {

				        rb->register_function(r, "task_manager",

				                "The task manager API");

				        set_task_manager(ctx, r);

				    });

				}

				#ifndef SCYLLA_BUILD_MODE_RELEASE

				future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::config> cfg) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx, &cfg = *cfg](routes& r) mutable {

				        rb->register_function(r, "task_manager_test",

				                "The task manager test API");

				        set_task_manager_test(ctx, r, cfg);

				    });

				}

				#endif

				future<> set_server_raft(http_context& ctx, sharded<service::raft_group_registry>& raft_gr) {

				    auto rb = std::make_shared<api_registry_builder>(ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx, &raft_gr] (routes& r) {

				        rb->register_function(r, "raft", "The Raft API");

				        set_raft(ctx, r, raft_gr);

				    });

				}

				future<> unset_server_raft(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_raft(ctx, r); });

				}

				void req_params::process(const request& req) {

				    // Process mandatory parameters

				    for (auto& [name, ent] : params) {

				        if (!ent.is_mandatory) {

				            continue;

				        }

				        try {

				            ent.value = req.param[name];

				        } catch (std::out_of_range&) {

				            throw httpd::bad_param_exception(fmt::format("Mandatory parameter '{}' was not provided", name));

				        }

				    }

				    // Process optional parameters

				    for (auto& [name, value] : req.query_parameters) {

				        try {

				            auto& ent = params.at(name);

				            if (ent.is_mandatory) {

				                throw httpd::bad_param_exception(fmt::format("Parameter '{}' is expected to be provided as part of the request url", name));

				            }

				            ent.value = value;

				        } catch (std::out_of_range&) {

				            throw httpd::bad_param_exception(fmt::format("Unsupported optional parameter '{}'", name));

				        }

				    }

				}

				}

									
										91

api/api.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -94,13 +81,6 @@ inline std::vector<sstring> split(const sstring& text, const char* separator) {

				    return boost::split(tokens, text, boost::is_any_of(separator));

				}

				/**

				 * Split a column family parameter

				 */

				inline std::vector<sstring> split_cf(const sstring& cf) {

				    return split(cf, ",");

				}

				/**

				 * A helper function to sum values on an a distributed object that

				 * has a get_stats method.

				@@ -157,6 +137,14 @@ future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_

				    });

				}

				template<class T, class F>

				future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {

				        return make_ready_future<json::json_return_type>(timer_to_json(val));

				    });

				}

				inline int64_t min_int64(int64_t a, int64_t b) {

				    return std::min(a,b);

				}

				@@ -257,6 +245,67 @@ public:

				    operator T() const { return value; }

				};

				using mandatory = bool_class<struct mandatory_tag>;

				class req_params {

				public:

				    struct def {

				        std::optional<sstring> value;

				        mandatory is_mandatory = mandatory::no;

				        def(std::optional<sstring> value_ = std::nullopt, mandatory is_mandatory_ = mandatory::no)

				            : value(std::move(value_))

				            , is_mandatory(is_mandatory_)

				        { }

				        def(mandatory is_mandatory_)

				            : is_mandatory(is_mandatory_)

				        { }

				    };

				private:

				    std::unordered_map<sstring, def> params;

				public:

				    req_params(std::initializer_list<std::pair<sstring, def>> l) {

				        for (const auto& [name, ent] : l) {

				            add(std::move(name), std::move(ent));

				        }

				    }

				    void add(sstring name, def ent) {

				        params.emplace(std::move(name), std::move(ent));

				    }

				    void process(const request& req);

				    const std::optional<sstring>& get(const char* name) const {

				        return params.at(name).value;

				    }

				    template <typename T = sstring>

				    const std::optional<T> get_as(const char* name) const {

				        return get(name);

				    }

				    template <typename T = sstring>

				    requires std::same_as<T, bool>

				    const std::optional<bool> get_as(const char* name) const {

				        auto value = get(name);

				        if (!value) {

				            return std::nullopt;

				        }

				        std::transform(value->begin(), value->end(), value->begin(), ::tolower);

				        if (value == "true" || value == "yes" || value == "1") {

				            return true;

				        }

				        if (value == "false" || value == "no" || value == "0") {

				            return false;

				        }

				        throw boost::bad_lexical_cast{};

				    }

				};

				utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimated_histogram& val);

				}

									
										53

api/api_init.hh
									
												View File
												
				@@ -3,43 +3,40 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include <seastar/http/httpd.hh>

				#include <seastar/core/future.hh>

				#include "database_fwd.hh"

				#include "replica/database_fwd.hh"

				#include "tasks/task_manager.hh"

				#include "seastarx.hh"

				using request = http::request;

				using reply = http::reply;

				namespace service {

				class load_meter;

				class storage_proxy;

				class storage_service;

				class raft_group_registry;

				} // namespace service

				class sstables_loader;

				namespace streaming {

				class stream_manager;

				}

				namespace locator {

				class token_metadata;

				class shared_token_metadata;

				class snitch_ptr;

				} // namespace locator

				@@ -51,6 +48,7 @@ class config;

				namespace view {

				class view_builder;

				}

				class system_keyspace;

				}

				namespace netw { class messaging_service; }

				class repair_service;

				@@ -62,21 +60,24 @@ class gossiper;

				}

				namespace auth { class service; }

				namespace api {

				struct http_context {

				    sstring api_dir;

				    sstring api_doc;

				    httpd::http_server_control http_server;

				    distributed<database>& db;

				    distributed<replica::database>& db;

				    distributed<service::storage_proxy>& sp;

				    service::load_meter& lmeter;

				    const sharded<locator::shared_token_metadata>& shared_token_metadata;

				    sharded<tasks::task_manager>& tm;

				    http_context(distributed<database>& _db,

				    http_context(distributed<replica::database>& _db,

				            distributed<service::storage_proxy>& _sp,

				            service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm)

				            : db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm) {

				            service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm, sharded<tasks::task_manager>& _tm)

				            : db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm), tm(_tm) {

				    }

				    const locator::token_metadata& get_token_metadata();

				@@ -84,8 +85,9 @@ struct http_context {

				future<> set_server_init(http_context& ctx);

				future<> set_server_config(http_context& ctx, const db::config& cfg);

				future<> set_server_snitch(http_context& ctx);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs);

				future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);

				future<> unset_server_snitch(http_context& ctx);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks);

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);

				future<> unset_server_sstables_loader(http_context& ctx);

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);

				@@ -96,6 +98,8 @@ future<> set_transport_controller(http_context& ctx, cql_transport::controller&

				future<> unset_transport_controller(http_context& ctx);

				future<> set_rpc_controller(http_context& ctx, thrift_controller& ctl);

				future<> unset_rpc_controller(http_context& ctx);

				future<> set_server_authorization_cache(http_context& ctx, sharded<auth::service> &auth_service);

				future<> unset_server_authorization_cache(http_context& ctx);

				future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl);

				future<> unset_server_snapshot(http_context& ctx);

				future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);

				@@ -103,12 +107,17 @@ future<> set_server_load_sstable(http_context& ctx);

				future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);

				future<> unset_server_messaging_service(http_context& ctx);

				future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss);

				future<> set_server_stream_manager(http_context& ctx);

				future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm);

				future<> unset_server_stream_manager(http_context& ctx);

				future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g);

				future<> unset_hinted_handoff(http_context& ctx);

				future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g);

				future<> set_server_cache(http_context& ctx);

				future<> set_server_compaction_manager(http_context& ctx);

				future<> set_server_done(http_context& ctx);

				future<> set_server_task_manager(http_context& ctx);

				future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::config> cfg);

				future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);

				future<> unset_server_raft(http_context&);

				}

									
										33

api/authorization_cache.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,33 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "api/api-doc/authorization_cache.json.hh"

				#include "api/authorization_cache.hh"

				#include "api/api.hh"

				#include "auth/common.hh"

				namespace api {

				using namespace json;

				void set_authorization_cache(http_context& ctx, routes& r, sharded<auth::service> &auth_service) {

				    httpd::authorization_cache_json::authorization_cache_reset.set(r, [&auth_service] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        co_await auth_service.invoke_on_all([] (auth::service& auth) -> future<>  {

				            auth.reset_authorization_cache();

				            return make_ready_future<>();

				        });

				        co_return json_void();

				    });

				}

				void unset_authorization_cache(http_context& ctx, routes& r) {

				    httpd::authorization_cache_json::authorization_cache_reset.unset(r);

				}

				}

									
										18

api/authorization_cache.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,18 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include "api.hh"

				namespace api {

				void set_authorization_cache(http_context& ctx, routes& r, sharded<auth::service> &auth_service);

				void unset_authorization_cache(http_context& ctx, routes& r);

				}

									
										31

api/cache_service.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "cache_service.hh"

				@@ -208,7 +195,7 @@ void set_cache_service(http_context& ctx, routes& r) {

				    });

				    cs::get_row_capacity.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return ctx.db.map_reduce0([](database& db) -> uint64_t {

				        return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {

				            return db.row_cache_tracker().region().occupancy().used_space();

				        }, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {

				            return make_ready_future<json::json_return_type>(res);

				@@ -216,26 +203,26 @@ void set_cache_service(http_context& ctx, routes& r) {

				    });

				    cs::get_row_hits.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_requests.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_hit_rate.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, ratio_holder(), [](const column_family& cf) {

				        return map_reduce_cf(ctx, ratio_holder(), [](const replica::column_family& cf) {

				            return ratio_holder(cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count(),

				                    cf.get_row_cache().stats().hits.count());

				        }, std::plus<ratio_holder>());

				    });

				    cs::get_row_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				@@ -243,7 +230,7 @@ void set_cache_service(http_context& ctx, routes& r) {

				    });

				    cs::get_row_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate() + cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				@@ -253,7 +240,7 @@ void set_cache_service(http_context& ctx, routes& r) {

				    cs::get_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        // In origin row size is the weighted size.

				        // We currently do not support weights, so we use num entries instead

				        return ctx.db.map_reduce0([](database& db) -> uint64_t {

				        return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {

				            return db.row_cache_tracker().partitions();

				        }, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {

				            return make_ready_future<json::json_return_type>(res);

				@@ -261,7 +248,7 @@ void set_cache_service(http_context& ctx, routes& r) {

				    });

				    cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return ctx.db.map_reduce0([](database& db) -> uint64_t {

				        return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {

				            return db.row_cache_tracker().partitions();

				        }, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {

				            return make_ready_future<json::json_return_type>(res);

									
										15

api/cache_service.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										22

api/collectd.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "collectd.hh"

				@@ -42,8 +29,11 @@ static auto transformer(const std::vector<collectd_value>& values) {

				        case scollectd::data_type::GAUGE:

				            collected_value.values.push(v.d());

				            break;

				        case scollectd::data_type::DERIVE:

				            collected_value.values.push(v.i());

				        case scollectd::data_type::COUNTER:

				            collected_value.values.push(v.ui());

				            break;

				        case scollectd::data_type::REAL_COUNTER:

				            collected_value.values.push(v.d());

				            break;

				        default:

				            collected_value.values.push(v.ui());

									
										15

api/collectd.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										295

api/column_family.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "column_family.hh"

				@@ -27,7 +14,7 @@

				#include "sstables/metadata_collector.hh"

				#include "utils/estimated_histogram.hh"

				#include <algorithm>

				#include "db/system_keyspace_view_types.hh"

				#include "db/system_keyspace.hh"

				#include "db/data_listeners.hh"

				#include "storage_service.hh"

				#include "unimplemented.hh"

				@@ -56,57 +43,57 @@ std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name) {

				    return std::make_tuple(name.substr(0, pos), name.substr(end));

				}

				const utils::UUID& get_uuid(const sstring& ks, const sstring& cf, const database& db) {

				const table_id& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {

				    try {

				        return db.find_uuid(ks, cf);

				    } catch (std::out_of_range& e) {

				        throw bad_param_exception(format("Column family '{}:{}' not found", ks, cf));

				    } catch (replica::no_such_column_family& e) {

				        throw bad_param_exception(e.what());

				    }

				}

				const utils::UUID& get_uuid(const sstring& name, const database& db) {

				const table_id& get_uuid(const sstring& name, const replica::database& db) {

				    auto [ks, cf] = parse_fully_qualified_cf_name(name);

				    return get_uuid(ks, cf, db);

				}

				future<> foreach_column_family(http_context& ctx, const sstring& name, function<void(column_family&)> f) {

				future<> foreach_column_family(http_context& ctx, const sstring& name, function<void(replica::column_family&)> f) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.invoke_on_all([f, uuid](database& db) {

				    return ctx.db.invoke_on_all([f, uuid](replica::database& db) {

				        f(db.find_column_family(uuid));

				    });

				}

				future<json::json_return_type>  get_cf_stats(http_context& ctx, const sstring& name,

				        int64_t column_family_stats::*f) {

				    return map_reduce_cf(ctx, name, int64_t(0), [f](const column_family& cf) {

				        int64_t replica::column_family_stats::*f) {

				    return map_reduce_cf(ctx, name, int64_t(0), [f](const replica::column_family& cf) {

				        return cf.get_stats().*f;

				    }, std::plus<int64_t>());

				}

				future<json::json_return_type>  get_cf_stats(http_context& ctx,

				        int64_t column_family_stats::*f) {

				    return map_reduce_cf(ctx, int64_t(0), [f](const column_family& cf) {

				        int64_t replica::column_family_stats::*f) {

				    return map_reduce_cf(ctx, int64_t(0), [f](const replica::column_family& cf) {

				        return cf.get_stats().*f;

				    }, std::plus<int64_t>());

				}

				static future<json::json_return_type>  get_cf_stats_count(http_context& ctx, const sstring& name,

				        utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {

				    return map_reduce_cf(ctx, name, int64_t(0), [f](const column_family& cf) {

				        utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    return map_reduce_cf(ctx, name, int64_t(0), [f](const replica::column_family& cf) {

				        return (cf.get_stats().*f).hist.count;

				    }, std::plus<int64_t>());

				}

				static future<json::json_return_type>  get_cf_stats_sum(http_context& ctx, const sstring& name,

				        utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {

				        utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([uuid, f](database& db) {

				    return ctx.db.map_reduce0([uuid, f](replica::database& db) {

				        // Histograms information is sample of the actual load

				        // so to get an estimation of sum, we multiply the mean

				        // with count. The information is gather in nano second,

				        // but reported in micro

				        column_family& cf = db.find_column_family(uuid);

				        replica::column_family& cf = db.find_column_family(uuid);

				        return ((cf.get_stats().*f).hist.count/1000.0) * (cf.get_stats().*f).hist.mean;

				    }, 0.0, std::plus<double>()).then([](double res) {

				        return make_ready_future<json::json_return_type>((int64_t)res);

				@@ -115,16 +102,16 @@ static future<json::json_return_type>  get_cf_stats_sum(http_context& ctx, const

				static future<json::json_return_type>  get_cf_stats_count(http_context& ctx,

				        utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {

				    return map_reduce_cf(ctx, int64_t(0), [f](const column_family& cf) {

				        utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    return map_reduce_cf(ctx, int64_t(0), [f](const replica::column_family& cf) {

				        return (cf.get_stats().*f).hist.count;

				    }, std::plus<int64_t>());

				}

				static future<json::json_return_type>  get_cf_histogram(http_context& ctx, const sstring& name,

				        utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {

				    utils::UUID uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([f, uuid](const database& p) {

				        utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([f, uuid](const replica::database& p) {

				        return (p.find_column_family(uuid).get_stats().*f).hist;},

				            utils::ihistogram(),

				            std::plus<utils::ihistogram>())

				@@ -133,8 +120,20 @@ static future<json::json_return_type>  get_cf_histogram(http_context& ctx, const

				    });

				}

				static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {

				    std::function<utils::ihistogram(const database&)> fun = [f] (const database& db)  {

				static future<json::json_return_type>  get_cf_histogram(http_context& ctx, const sstring& name,

				        utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([f, uuid](const replica::database& p) {

				        return (p.find_column_family(uuid).get_stats().*f).hist;},

				            utils::ihistogram(),

				            std::plus<utils::ihistogram>())

				            .then([](const utils::ihistogram& val) {

				                return make_ready_future<json::json_return_type>(to_json(val));

				    });

				}

				static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    std::function<utils::ihistogram(const replica::database&)> fun = [f] (const replica::database& db)  {

				        utils::ihistogram res;

				        for (auto i : db.get_column_families()) {

				            res += (i.second->get_stats().*f).hist;

				@@ -149,9 +148,9 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils:

				}

				static future<json::json_return_type>  get_cf_rate_and_histogram(http_context& ctx, const sstring& name,

				        utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {

				    utils::UUID uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([f, uuid](const database& p) {

				        utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([f, uuid](const replica::database& p) {

				        return (p.find_column_family(uuid).get_stats().*f).rate();},

				            utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>())

				@@ -160,8 +159,8 @@ static future<json::json_return_type>  get_cf_rate_and_histogram(http_context& c

				    });

				}

				static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram column_family_stats::*f) {

				    std::function<utils::rate_moving_average_and_histogram(const database&)> fun = [f] (const database& db)  {

				static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    std::function<utils::rate_moving_average_and_histogram(const replica::database&)> fun = [f] (const replica::database& db)  {

				        utils::rate_moving_average_and_histogram res;

				        for (auto i : db.get_column_families()) {

				            res += (i.second->get_stats().*f).rate();

				@@ -176,12 +175,12 @@ static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ct

				}

				static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ctx, const sstring& name) {

				    return map_reduce_cf(ctx, name, int64_t(0), [](const column_family& cf) {

				    return map_reduce_cf(ctx, name, int64_t(0), [](const replica::column_family& cf) {

				        return cf.get_unleveled_sstables();

				    }, std::plus<int64_t>());

				}

				static int64_t min_partition_size(column_family& cf) {

				static int64_t min_partition_size(replica::column_family& cf) {

				    int64_t res = INT64_MAX;

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        res = std::min(res, i->get_stats_metadata().estimated_partition_size.min());

				@@ -189,7 +188,7 @@ static int64_t min_partition_size(column_family& cf) {

				    return (res == INT64_MAX) ? 0 : res;

				}

				static int64_t max_partition_size(column_family& cf) {

				static int64_t max_partition_size(replica::column_family& cf) {

				    int64_t res = 0;

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        res = std::max(i->get_stats_metadata().estimated_partition_size.max(), res);

				@@ -197,7 +196,7 @@ static int64_t max_partition_size(column_family& cf) {

				    return res;

				}

				static integral_ratio_holder mean_partition_size(column_family& cf) {

				static integral_ratio_holder mean_partition_size(replica::column_family& cf) {

				    integral_ratio_holder res;

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        auto c = i->get_stats_metadata().estimated_partition_size.count();

				@@ -223,7 +222,7 @@ static json::json_return_type sum_map(const std::unordered_map<sstring, uint64_t

				static future<json::json_return_type>  sum_sstable(http_context& ctx, const sstring name, bool total) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([uuid, total](database& db) {

				    return ctx.db.map_reduce0([uuid, total](replica::database& db) {

				        std::unordered_map<sstring, uint64_t> m;

				        auto sstables = (total) ? db.find_column_family(uuid).get_sstables_including_compacted_undeleted() :

				                db.find_column_family(uuid).get_sstables();

				@@ -239,7 +238,7 @@ static future<json::json_return_type>  sum_sstable(http_context& ctx, const sstr

				static future<json::json_return_type> sum_sstable(http_context& ctx, bool total) {

				    return map_reduce_cf_raw(ctx, std::unordered_map<sstring, uint64_t>(), [total](column_family& cf) {

				    return map_reduce_cf_raw(ctx, std::unordered_map<sstring, uint64_t>(), [total](replica::column_family& cf) {

				        std::unordered_map<sstring, uint64_t> m;

				        auto sstables = (total) ? cf.get_sstables_including_compacted_undeleted() :

				                cf.get_sstables();

				@@ -252,7 +251,7 @@ static future<json::json_return_type> sum_sstable(http_context& ctx, bool total)

				    });

				}

				future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const column_family&)> f) {

				future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const replica::column_family&)> f) {

				    return map_reduce_cf_raw(ctx, name, utils::time_estimated_histogram(), f, utils::time_estimated_histogram_merge).then([](const utils::time_estimated_histogram& res) {

				        return make_ready_future<json::json_return_type>(time_to_json_histogram(res));

				    });

				@@ -275,7 +274,7 @@ public:

				    }

				};

				static double get_compression_ratio(column_family& cf) {

				static double get_compression_ratio(replica::column_family& cf) {

				    sum_ratio<double> result;

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        auto compression_ratio = i->get_compression_ratio();

				@@ -334,14 +333,14 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](column_family& cf) {

				            return cf.active_memtable().partition_count();

				        return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](replica::column_family& cf) {

				            return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed(std::mem_fn(&replica::memtable::partition_count)), uint64_t(0));

				        }, std::plus<>());

				    });

				    cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t{0}, [](column_family& cf) {

				            return cf.active_memtable().partition_count();

				        return map_reduce_cf(ctx, uint64_t{0}, [](replica::column_family& cf) {

				            return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed(std::mem_fn(&replica::memtable::partition_count)), uint64_t(0));

				        }, std::plus<>());

				    });

				@@ -354,26 +353,34 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				            return cf.active_memtable().region().occupancy().total_space();

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {

				            return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {

				                return active_memtable->region().occupancy().total_space();

				            }), uint64_t(0));

				        }, std::plus<int64_t>());

				    });

				    cf::get_all_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {

				            return cf.active_memtable().region().occupancy().total_space();

				        return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {

				            return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {

				                return active_memtable->region().occupancy().total_space();

				            }), uint64_t(0));

				        }, std::plus<int64_t>());

				    });

				    cf::get_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				            return cf.active_memtable().region().occupancy().used_space();

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {

				            return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {

				                return active_memtable->region().occupancy().used_space();

				            }), uint64_t(0));

				        }, std::plus<int64_t>());

				    });

				    cf::get_all_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {

				            return cf.active_memtable().region().occupancy().used_space();

				        return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {

				            return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {

				                return active_memtable->region().occupancy().used_space();

				            }), uint64_t(0));

				        }, std::plus<int64_t>());

				    });

				@@ -387,15 +394,15 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        warn(unimplemented::cause::INDEXES);

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {

				            return cf.occupancy().total_space();

				        }, std::plus<int64_t>());

				    });

				    cf::get_all_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        warn(unimplemented::cause::INDEXES);

				        return ctx.db.map_reduce0([](const database& db){

				            return db.dirty_memory_region_group().memory_used();

				        return ctx.db.map_reduce0([](const replica::database& db){

				            return db.dirty_memory_region_group().real_memory_used();

				        }, int64_t(0), std::plus<int64_t>()).then([](int res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				@@ -403,29 +410,31 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        warn(unimplemented::cause::INDEXES);

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {

				            return cf.occupancy().used_space();

				        }, std::plus<int64_t>());

				    });

				    cf::get_all_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        warn(unimplemented::cause::INDEXES);

				        return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {

				            return cf.active_memtable().region().occupancy().used_space();

				        return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {

				            return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {

				                return active_memtable->region().occupancy().used_space();

				            }), uint64_t(0));

				        }, std::plus<int64_t>());

				    });

				    cf::get_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx,req->param["name"] ,&column_family_stats::memtable_switch_count);

				        return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::memtable_switch_count);

				    });

				    cf::get_all_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, &column_family_stats::memtable_switch_count);

				        return get_cf_stats(ctx, &replica::column_family_stats::memtable_switch_count);

				    });

				    // FIXME: this refers to partitions, not rows.

				    cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {

				            utils::estimated_histogram res(0);

				            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				                res.merge(i->get_stats_metadata().estimated_partition_size);

				@@ -437,7 +446,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    // FIXME: this refers to partitions, not rows.

				    cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {

				            uint64_t res = 0;

				            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				                res += i->get_stats_metadata().estimated_partition_size.count();

				@@ -448,7 +457,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {

				            utils::estimated_histogram res(0);

				            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				                res.merge(i->get_stats_metadata().estimated_cells_count);

				@@ -465,87 +474,87 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_pending_flushes.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx,req->param["name"] ,&column_family_stats::pending_flushes);

				        return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::pending_flushes);

				    });

				    cf::get_all_pending_flushes.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, &column_family_stats::pending_flushes);

				        return get_cf_stats(ctx, &replica::column_family_stats::pending_flushes);

				    });

				    cf::get_read.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats_count(ctx,req->param["name"] ,&column_family_stats::reads);

				        return get_cf_stats_count(ctx,req->param["name"] ,&replica::column_family_stats::reads);

				    });

				    cf::get_all_read.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats_count(ctx, &column_family_stats::reads);

				        return get_cf_stats_count(ctx, &replica::column_family_stats::reads);

				    });

				    cf::get_write.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats_count(ctx, req->param["name"] ,&column_family_stats::writes);

				        return get_cf_stats_count(ctx, req->param["name"] ,&replica::column_family_stats::writes);

				    });

				    cf::get_all_write.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats_count(ctx, &column_family_stats::writes);

				        return get_cf_stats_count(ctx, &replica::column_family_stats::writes);

				    });

				    cf::get_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, req->param["name"], &column_family_stats::reads);

				        return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);

				    });

				    cf::get_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, req->param["name"], &column_family_stats::reads);

				        return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);

				    });

				    cf::get_read_latency.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats_sum(ctx,req->param["name"] ,&column_family_stats::reads);

				        return get_cf_stats_sum(ctx,req->param["name"] ,&replica::column_family_stats::reads);

				    });

				    cf::get_write_latency.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats_sum(ctx, req->param["name"] ,&column_family_stats::writes);

				        return get_cf_stats_sum(ctx, req->param["name"] ,&replica::column_family_stats::writes);

				    });

				    cf::get_all_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, &column_family_stats::writes);

				        return get_cf_histogram(ctx, &replica::column_family_stats::writes);

				    });

				    cf::get_all_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, &column_family_stats::writes);

				        return get_cf_rate_and_histogram(ctx, &replica::column_family_stats::writes);

				    });

				    cf::get_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, req->param["name"], &column_family_stats::writes);

				        return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);

				    });

				    cf::get_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, req->param["name"], &column_family_stats::writes);

				        return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);

				    });

				    cf::get_all_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, &column_family_stats::writes);

				        return get_cf_histogram(ctx, &replica::column_family_stats::writes);

				    });

				    cf::get_all_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, &column_family_stats::writes);

				        return get_cf_rate_and_histogram(ctx, &replica::column_family_stats::writes);

				    });

				    cf::get_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				            return cf.get_compaction_strategy().estimated_pending_compactions(cf);

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {

				            return cf.estimate_pending_compactions();

				        }, std::plus<int64_t>());

				    });

				    cf::get_all_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {

				            return cf.get_compaction_strategy().estimated_pending_compactions(cf);

				        return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {

				            return cf.estimate_pending_compactions();

				        }, std::plus<int64_t>());

				    });

				    cf::get_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, req->param["name"], &column_family_stats::live_sstable_count);

				        return get_cf_stats(ctx, req->param["name"], &replica::column_family_stats::live_sstable_count);

				    });

				    cf::get_all_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, &column_family_stats::live_sstable_count);

				        return get_cf_stats(ctx, &replica::column_family_stats::live_sstable_count);

				    });

				    cf::get_unleveled_sstables.set(r, [&ctx] (std::unique_ptr<request> req) {

				@@ -601,7 +610,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_false_positive();

				@@ -610,7 +619,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_false_positive();

				@@ -619,7 +628,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_recent_false_positive();

				@@ -628,7 +637,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_recent_false_positive();

				@@ -637,31 +646,31 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_all_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {

				        return map_reduce_cf(ctx, ratio_holder(), [] (replica::column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_all_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {

				        return map_reduce_cf(ctx, ratio_holder(), [] (replica::column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_size();

				@@ -670,7 +679,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_size();

				@@ -679,7 +688,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_memory_size();

				@@ -688,7 +697,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_memory_size();

				@@ -697,7 +706,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->get_summary().memory_footprint();

				@@ -706,7 +715,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (replica::column_family& cf) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->get_summary().memory_footprint();

				@@ -753,7 +762,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_true_snapshots_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        auto uuid = get_uuid(req->param["name"], ctx.db.local());

				        return ctx.db.local().find_column_family(uuid).get_snapshot_details().then([](

				                const std::unordered_map<sstring, column_family::snapshot_details>& sd) {

				                const std::unordered_map<sstring, replica::column_family::snapshot_details>& sd) {

				            int64_t res = 0;

				            for (auto i : sd) {

				                res += i.second.total;

				@@ -782,7 +791,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_row_cache_hit.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const column_family& cf) {

				        return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				@@ -790,7 +799,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_all_row_cache_hit.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				@@ -798,7 +807,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_row_cache_miss.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const column_family& cf) {

				        return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				@@ -806,7 +815,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_all_row_cache_miss.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				@@ -815,36 +824,36 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_cas_prepare.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {

				            return cf.get_stats().estimated_cas_prepare;

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {

				            return cf.get_stats().cas_prepare.histogram();

				        });

				    });

				    cf::get_cas_propose.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {

				            return cf.get_stats().estimated_cas_accept;

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {

				            return cf.get_stats().cas_accept.histogram();

				        });

				    });

				    cf::get_cas_commit.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {

				            return cf.get_stats().estimated_cas_learn;

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {

				            return cf.get_stats().cas_learn.histogram();

				        });

				    });

				    cf::get_sstables_per_read_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {

				            return cf.get_stats().estimated_sstable_per_read;

				        },

				        utils::estimated_histogram_merge, utils_json::estimated_histogram());

				    });

				    cf::get_tombstone_scanned_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, req->param["name"], &column_family_stats::tombstone_scanned);

				        return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::tombstone_scanned);

				    });

				    cf::get_live_scanned_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, req->param["name"], &column_family_stats::live_scanned);

				        return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::live_scanned);

				    });

				    cf::get_col_update_time_delta_histogram.set(r, [] (std::unique_ptr<request> req) {

				@@ -856,15 +865,15 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_auto_compaction.set(r, [&ctx] (const_req req) {

				        const utils::UUID& uuid = get_uuid(req.param["name"], ctx.db.local());

				        column_family& cf = ctx.db.local().find_column_family(uuid);

				        auto uuid = get_uuid(req.param["name"], ctx.db.local());

				        replica::column_family& cf = ctx.db.local().find_column_family(uuid);

				        return !cf.is_auto_compaction_disabled_by_user();

				    });

				    cf::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (database& db) {

				            auto g = database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {

				            auto g = replica::database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {

				                cf.enable_auto_compaction();

				            }).then([g = std::move(g)] {

				                return make_ready_future<json::json_return_type>(json_void());

				@@ -873,10 +882,10 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (database& db) {

				            auto g = database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {

				                cf.disable_auto_compaction();

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {

				            auto g = replica::database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {

				                return cf.disable_auto_compaction();

				            }).then([g = std::move(g)] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				@@ -896,7 +905,7 @@ void set_column_family(http_context& ctx, routes& r) {

				            }

				            std::vector<sstring> res;

				            auto uuid = get_uuid(ks, cf_name, ctx.db.local());

				            column_family& cf = ctx.db.local().find_column_family(uuid);

				            replica::column_family& cf = ctx.db.local().find_column_family(uuid);

				            res.reserve(cf.get_index_manager().list_indexes().size());

				            for (auto&& i : cf.get_index_manager().list_indexes()) {

				                if (!vp.contains(secondary_index::index_table_name(i.metadata().name()))) {

				@@ -924,8 +933,8 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_compression_ratio.set(r, [&ctx](std::unique_ptr<request> req) {

				        auto uuid = get_uuid(req->param["name"], ctx.db.local());

				        return ctx.db.map_reduce(sum_ratio<double>(), [uuid](database& db) {

				            column_family& cf = db.find_column_family(uuid);

				        return ctx.db.map_reduce(sum_ratio<double>(), [uuid](replica::database& db) {

				            replica::column_family& cf = db.find_column_family(uuid);

				            return make_ready_future<double>(get_compression_ratio(cf));

				        }).then([] (const double& result) {

				            return make_ready_future<json::json_return_type>(result);

				@@ -933,20 +942,20 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {

				            return cf.get_stats().estimated_read;

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {

				            return cf.get_stats().reads.histogram();

				        });

				    });

				    cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {

				            return cf.get_stats().estimated_write;

				        return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {

				            return cf.get_stats().writes.histogram();

				        });

				    });

				    cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<request> req) {

				        sstring strategy = req->get_query_param("class_name");

				        return foreach_column_family(ctx, req->param["name"], [strategy](column_family& cf) {

				        return foreach_column_family(ctx, req->param["name"], [strategy](replica::column_family& cf) {

				            cf.set_compaction_strategy(sstables::compaction_strategy::type(strategy));

				        }).then([] {

				                return make_ready_future<json::json_return_type>(json_void());

				@@ -970,7 +979,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_sstable_count_per_level.set(r, [&ctx](std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const column_family& cf) {

				        return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const replica::column_family& cf) {

				            return cf.sstable_count_per_level();

				        }, concat_sstable_count_per_level).then([](const std::vector<uint64_t>& res) {

				            return make_ready_future<json::json_return_type>(res);

				@@ -981,7 +990,7 @@ void set_column_family(http_context& ctx, routes& r) {

				        auto key = req->get_query_param("key");

				        auto uuid = get_uuid(req->param["name"], ctx.db.local());

				        return ctx.db.map_reduce0([key, uuid] (database& db) {

				        return ctx.db.map_reduce0([key, uuid] (replica::database& db) {

				            return db.find_column_family(uuid).get_sstables_by_partition_key(key);

				        }, std::unordered_set<sstring>(),

				            [](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {

				@@ -1013,7 +1022,7 @@ void set_column_family(http_context& ctx, routes& r) {

				        if (req->get_query_param("split_output") != "") {

				            fail(unimplemented::cause::API);

				        }

				        return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {

				        return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {

				            return cf.compact_all_sstables();

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

									
										41

api/column_family.hh
									
												View File
												
				@@ -3,27 +3,14 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include "api.hh"

				#include "api/api-doc/column_family.json.hh"

				#include "database.hh"

				#include "replica/database.hh"

				#include <seastar/core/future-util.hh>

				#include <any>

				@@ -31,17 +18,17 @@ namespace api {

				void set_column_family(http_context& ctx, routes& r);

				const utils::UUID& get_uuid(const sstring& name, const database& db);

				future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(column_family&)> f);

				const table_id& get_uuid(const sstring& name, const replica::database& db);

				future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f);

				template<class Mapper, class I, class Reducer>

				future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

				        Mapper mapper, Reducer reducer) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    using mapper_type = std::function<std::unique_ptr<std::any>(database&)>;

				    using mapper_type = std::function<std::unique_ptr<std::any>(replica::database&)>;

				    using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;

				    return ctx.db.map_reduce0(mapper_type([mapper, uuid](database& db) {

				    return ctx.db.map_reduce0(mapper_type([mapper, uuid](replica::database& db) {

				        return std::make_unique<std::any>(I(mapper(db.find_column_family(uuid))));

				    }), std::make_unique<std::any>(std::move(init)), reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {

				        return std::make_unique<std::any>(I(reducer(std::any_cast<I>(std::move(*a)), std::any_cast<I>(std::move(*b)))));

				@@ -68,15 +55,15 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& n

				    });

				}

				future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const column_family&)> f);

				future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const replica::column_family&)> f);

				struct map_reduce_column_families_locally {

				    std::any init;

				    std::function<std::unique_ptr<std::any>(column_family&)> mapper;

				    std::function<std::unique_ptr<std::any>(replica::column_family&)> mapper;

				    std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;

				    future<std::unique_ptr<std::any>> operator()(database& db) const {

				    future<std::unique_ptr<std::any>> operator()(replica::database& db) const {

				        auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));

				        return do_for_each(db.get_column_families(), [res, this](const std::pair<utils::UUID, seastar::lw_shared_ptr<table>>& i) {

				        return do_for_each(db.get_column_families(), [res, this](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) {

				            *res = reducer(std::move(*res), mapper(*i.second.get()));

				        }).then([res] {

				            return std::move(*res);

				@@ -87,9 +74,9 @@ struct map_reduce_column_families_locally {

				template<class Mapper, class I, class Reducer>

				future<I> map_reduce_cf_raw(http_context& ctx, I init,

				        Mapper mapper, Reducer reducer) {

				    using mapper_type = std::function<std::unique_ptr<std::any>(column_family&)>;

				    using mapper_type = std::function<std::unique_ptr<std::any>(replica::column_family&)>;

				    using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;

				    auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (column_family& cf) mutable {

				    auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (replica::column_family& cf) mutable {

				        return std::make_unique<std::any>(I(mapper(cf)));

				    });

				    auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {

				@@ -111,10 +98,10 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, I init,

				}

				future<json::json_return_type>  get_cf_stats(http_context& ctx, const sstring& name,

				        int64_t column_family_stats::*f);

				        int64_t replica::column_family_stats::*f);

				future<json::json_return_type>  get_cf_stats(http_context& ctx,

				        int64_t column_family_stats::*f);

				        int64_t replica::column_family_stats::*f);

				std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);

									
										21

api/commitlog.cc
									
												View File
												
				@@ -3,26 +3,13 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "commitlog.hh"

				#include "db/commitlog/commitlog.hh"

				#include "api/api-doc/commitlog.json.hh"

				#include "database.hh"

				#include "replica/database.hh"

				#include <vector>

				namespace api {

				@@ -31,7 +18,7 @@ template<typename T>

				static auto acquire_cl_metric(http_context& ctx, std::function<T (db::commitlog*)> func) {

				    typedef T ret_type;

				    return ctx.db.map_reduce0([func = std::move(func)](database& db) {

				    return ctx.db.map_reduce0([func = std::move(func)](replica::database& db) {

				        if (db.commitlog() == nullptr) {

				            return make_ready_future<ret_type>();

				        }

				@@ -47,7 +34,7 @@ void set_commitlog(http_context& ctx, routes& r) {

				        auto res = make_shared<std::vector<sstring>>();

				        return ctx.db.map_reduce([res](std::vector<sstring> names) {

				            res->insert(res->end(), names.begin(), names.end());

				        }, [](database& db) {

				        }, [](replica::database& db) {

				            if (db.commitlog() == nullptr) {

				                return make_ready_future<std::vector<sstring>>(std::vector<sstring>());

				            }

									
										15

api/commitlog.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										59

api/compaction_manager.cc
									
												View File
												
				@@ -3,28 +3,19 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <seastar/core/coroutine.hh>

				#include "compaction_manager.hh"

				#include "compaction/compaction_manager.hh"

				#include "api/api-doc/compaction_manager.json.hh"

				#include "db/system_keyspace.hh"

				#include "column_family.hh"

				#include "unimplemented.hh"

				#include "storage_service.hh"

				#include <utility>

				namespace api {

				@@ -34,7 +25,7 @@ using namespace json;

				static future<json::json_return_type> get_cm_stats(http_context& ctx,

				        int64_t compaction_manager::stats::*f) {

				    return ctx.db.map_reduce0([f](database& db) {

				    return ctx.db.map_reduce0([f](replica::database& db) {

				        return db.get_compaction_manager().get_stats().*f;

				    }, int64_t(0), std::plus<int64_t>()).then([](const int64_t& res) {

				        return make_ready_future<json::json_return_type>(res);

				@@ -50,10 +41,9 @@ static std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_ha

				    return std::move(a);

				}

				void set_compaction_manager(http_context& ctx, routes& r) {

				    cm::get_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return ctx.db.map_reduce0([](database& db) {

				        return ctx.db.map_reduce0([](replica::database& db) {

				            std::vector<cm::summary> summaries;

				            const compaction_manager& cm = db.get_compaction_manager();

				@@ -75,11 +65,11 @@ void set_compaction_manager(http_context& ctx, routes& r) {

				    });

				    cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return ctx.db.map_reduce0([&ctx](database& db) {

				        return ctx.db.map_reduce0([&ctx](replica::database& db) {

				            return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&ctx, &db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {

				                return do_for_each(db.get_column_families(), [&tasks](const std::pair<utils::UUID, seastar::lw_shared_ptr<table>>& i) {

				                    table& cf = *i.second.get();

				                    tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.get_compaction_strategy().estimated_pending_compactions(cf);

				                return do_for_each(db.get_column_families(), [&tasks](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) -> future<> {

				                    replica::table& cf = *i.second.get();

				                    tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.estimate_pending_compactions();

				                    return make_ready_future<>();

				                }).then([&tasks] {

				                    return std::move(tasks);

				@@ -109,17 +99,36 @@ void set_compaction_manager(http_context& ctx, routes& r) {

				    cm::stop_compaction.set(r, [&ctx] (std::unique_ptr<request> req) {

				        auto type = req->get_query_param("type");

				        return ctx.db.invoke_on_all([type] (database& db) {

				        return ctx.db.invoke_on_all([type] (replica::database& db) {

				            auto& cm = db.get_compaction_manager();

				            cm.stop_compaction(type);

				            return cm.stop_compaction(type);

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto ks_name = validate_keyspace(ctx, req->param);

				        auto table_names = parse_tables(ks_name, ctx, req->query_parameters, "tables");

				        if (table_names.empty()) {

				            table_names = map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());

				        }

				        auto type = req->get_query_param("type");

				        co_await ctx.db.invoke_on_all([&ks_name, &table_names, type] (replica::database& db) {

				            auto& cm = db.get_compaction_manager();

				            return parallel_for_each(table_names, [&db, &cm, &ks_name, type] (sstring& table_name) {

				                auto& t = db.find_column_family(ks_name, table_name);

				                return t.parallel_foreach_table_state([&] (compaction::table_state& ts) {

				                    return cm.stop_compaction(type, &ts);

				                });

				            });

				        });

				        co_return json_void();

				    });

				    cm::get_pending_tasks.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {

				            return cf.get_compaction_strategy().estimated_pending_compactions(cf);

				        return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {

				            return cf.estimate_pending_compactions();

				        }, std::plus<int64_t>());

				    });

									
										15

api/compaction_manager.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										15

api/config.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "api/config.hh"

									
										15

api/config.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										62

api/endpoint_snitch.cc
									
												View File
												
				@@ -3,46 +3,66 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "locator/token_metadata.hh"

				#include "locator/snitch_base.hh"

				#include "locator/production_snitch_base.hh"

				#include "endpoint_snitch.hh"

				#include "api/api-doc/endpoint_snitch_info.json.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include "utils/fb_utilities.hh"

				namespace api {

				void set_endpoint_snitch(http_context& ctx, routes& r) {

				void set_endpoint_snitch(http_context& ctx, routes& r, sharded<locator::snitch_ptr>& snitch) {

				    static auto host_or_broadcast = [](const_req req) {

				        auto host = req.get_query_param("host");

				        return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);

				    };

				    httpd::endpoint_snitch_info_json::get_datacenter.set(r, [](const_req req) {

				        return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(host_or_broadcast(req));

				    httpd::endpoint_snitch_info_json::get_datacenter.set(r, [&ctx](const_req req) {

				        auto& topology = ctx.shared_token_metadata.local().get()->get_topology();

				        auto ep = host_or_broadcast(req);

				        if (!topology.has_endpoint(ep)) {

				            // Cannot return error here, nodetool status can race, request

				            // info about just-left node and not handle it nicely

				            return sstring(locator::production_snitch_base::default_dc);

				        }

				        return topology.get_datacenter(ep);

				    });

				    httpd::endpoint_snitch_info_json::get_rack.set(r, [](const_req req) {

				        return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(host_or_broadcast(req));

				    httpd::endpoint_snitch_info_json::get_rack.set(r, [&ctx](const_req req) {

				        auto& topology = ctx.shared_token_metadata.local().get()->get_topology();

				        auto ep = host_or_broadcast(req);

				        if (!topology.has_endpoint(ep)) {

				            // Cannot return error here, nodetool status can race, request

				            // info about just-left node and not handle it nicely

				            return sstring(locator::production_snitch_base::default_rack);

				        }

				        return topology.get_rack(ep);

				    });

				    httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {

				        return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_name();

				    httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [&snitch] (const_req req) {

				        return snitch.local()->get_name();

				    });

				    httpd::storage_service_json::update_snitch.set(r, [&snitch](std::unique_ptr<request> req) {

				        locator::snitch_config cfg;

				        cfg.name = req->get_query_param("ep_snitch_class_name");

				        return locator::i_endpoint_snitch::reset_snitch(snitch, cfg).then([] {

				            return make_ready_future<json::json_return_type>(json::json_void());

				        });

				    });

				}

				void unset_endpoint_snitch(http_context& ctx, routes& r) {

				    httpd::endpoint_snitch_info_json::get_datacenter.unset(r);

				    httpd::endpoint_snitch_info_json::get_rack.unset(r);

				    httpd::endpoint_snitch_info_json::get_snitch_name.unset(r);

				    httpd::storage_service_json::update_snitch.unset(r);

				}

				}

									
										22

api/endpoint_snitch.hh
									
												View File
												
				@@ -3,28 +3,20 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include "api.hh"

				namespace locator {

				class snitch_ptr;

				}

				namespace api {

				void set_endpoint_snitch(http_context& ctx, routes& r);

				void set_endpoint_snitch(http_context& ctx, routes& r, sharded<locator::snitch_ptr>&);

				void unset_endpoint_snitch(http_context& ctx, routes& r);

				}

									
										17

api/error_injection.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "api/api-doc/error_injection.json.hh"

				@@ -25,7 +12,7 @@

				#include <seastar/http/exception.hh>

				#include "log.hh"

				#include "utils/error_injection.hh"

				#include "seastar/core/future-util.hh"

				#include <seastar/core/future-util.hh>

				namespace api {

									
										15

api/error_injection.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										108

api/failure_detector.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "failure_detector.hh"

				@@ -30,77 +17,86 @@ namespace fd = httpd::failure_detector_json;

				void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {

				        std::vector<fd::endpoint_state> res;

				        for (auto i : g.endpoint_state_map) {

				            fd::endpoint_state val;

				            val.addrs = boost::lexical_cast<std::string>(i.first);

				            val.is_alive = i.second.is_alive();

				            val.generation = i.second.get_heart_beat_state().get_generation();

				            val.version = i.second.get_heart_beat_state().get_heart_beat_version();

				            val.update_time = i.second.get_update_timestamp().time_since_epoch().count();

				            for (auto a : i.second.get_application_state_map()) {

				                fd::version_value version_val;

				                // We return the enum index and not it's name to stay compatible to origin

				                // method that the state index are static but the name can be changed.

				                version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);

				                version_val.value = a.second.value;

				                version_val.version = a.second.version;

				                val.application_state.push(version_val);

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            std::vector<fd::endpoint_state> res;

				            for (auto i : g.get_endpoint_states()) {

				                fd::endpoint_state val;

				                val.addrs = boost::lexical_cast<std::string>(i.first);

				                val.is_alive = i.second.is_alive();

				                val.generation = i.second.get_heart_beat_state().get_generation();

				                val.version = i.second.get_heart_beat_state().get_heart_beat_version();

				                val.update_time = i.second.get_update_timestamp().time_since_epoch().count();

				                for (auto a : i.second.get_application_state_map()) {

				                    fd::version_value version_val;

				                    // We return the enum index and not it's name to stay compatible to origin

				                    // method that the state index are static but the name can be changed.

				                    version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);

				                    version_val.value = a.second.value;

				                    version_val.version = a.second.version;

				                    val.application_state.push(version_val);

				                }

				                res.push_back(val);

				            }

				            res.push_back(val);

				        }

				        return make_ready_future<json::json_return_type>(res);

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    fd::get_up_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {

				        return gms::get_up_endpoint_count(g).then([](int res) {

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            int res = g.get_up_endpoint_count();

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    fd::get_down_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {

				        return gms::get_down_endpoint_count(g).then([](int res) {

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            int res = g.get_down_endpoint_count();

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    fd::get_phi_convict_threshold.set(r, [] (std::unique_ptr<request> req) {

				        return gms::get_phi_convict_threshold().then([](double res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				        return make_ready_future<json::json_return_type>(8);

				    });

				    fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {

				        return gms::get_simple_states(g).then([](const std::map<sstring, sstring>& map) {

				            return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(map));

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            std::map<sstring, sstring> nodes_status;

				            for (auto& entry : g.get_endpoint_states()) {

				                nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");

				            }

				            return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));

				        });

				    });

				    fd::set_phi_convict_threshold.set(r, [](std::unique_ptr<request> req) {

				        double phi = atof(req->get_query_param("phi").c_str());

				        return gms::set_phi_convict_threshold(phi).then([]() {

				            return make_ready_future<json::json_return_type>("");

				        });

				        return make_ready_future<json::json_return_type>("");

				    });

				    fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {

				        return get_endpoint_state(g, req->param["addr"]).then([](const sstring& state) {

				            return make_ready_future<json::json_return_type>(state);

				        return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {

				            auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));

				            if (!state) {

				                return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));

				            }

				            std::stringstream ss;

				            g.append_endpoint_state(ss, *state);

				            return make_ready_future<json::json_return_type>(sstring(ss.str()));

				        });

				    });

				    fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {

				        return gms::get_arrival_samples().then([](std::map<gms::inet_address, gms::arrival_window> map) {

				            std::vector<fd::endpoint_phi_value> res;

				            auto now = gms::arrival_window::clk::now();

				            for (auto& p : map) {

				                fd::endpoint_phi_value val;

				                val.endpoint = p.first.to_sstring();

				                val.phi = p.second.phi(now);

				                res.emplace_back(std::move(val));

				            }

				            return make_ready_future<json::json_return_type>(res);

				        });

				        std::map<gms::inet_address, gms::arrival_window> map;

				        std::vector<fd::endpoint_phi_value> res;

				        auto now = gms::arrival_window::clk::now();

				        for (auto& p : map) {

				            fd::endpoint_phi_value val;

				            val.endpoint = p.first.to_sstring();

				            val.phi = p.second.phi(now);

				            res.emplace_back(std::move(val));

				        }

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

									
										15

api/failure_detector.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										39

api/gossiper.cc
									
												View File
												
				@@ -3,22 +3,11 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <seastar/core/coroutine.hh>

				#include "gossiper.hh"

				#include "api/api-doc/gossiper.json.hh"

				#include "gms/gossiper.hh"

				@@ -27,19 +16,23 @@ namespace api {

				using namespace json;

				void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {

				    httpd::gossiper_json::get_down_endpoint.set(r, [&g] (const_req req) {

				        auto res = g.get_unreachable_members();

				        return container_to_vec(res);

				    httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto res = co_await g.get_unreachable_members_synchronized();

				        co_return json::json_return_type(container_to_vec(res));

				    });

				    httpd::gossiper_json::get_live_endpoint.set(r, [&g] (const_req req) {

				        auto res = g.get_live_members();

				        return container_to_vec(res);

				    httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) {

				        return g.get_live_members_synchronized().then([] (auto res) {

				            return make_ready_future<json::json_return_type>(container_to_vec(res));

				        });

				    });

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (const_req req) {

				        gms::inet_address ep(req.param["addr"]);

				        return g.get_endpoint_downtime(ep);

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        gms::inet_address ep(req->param["addr"]);

				        // synchronize unreachable_members on all shards

				        co_await g.get_unreachable_members_synchronized();

				        co_return g.get_endpoint_downtime(ep);

				    });

				    httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<request> req) {

									
										15

api/gossiper.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										15

api/hinted_handoff.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <algorithm>

									
										15

api/hinted_handoff.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										19

api/lsa.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "api/api-doc/lsa.json.hh"

				@@ -26,7 +13,7 @@

				#include <seastar/http/exception.hh>

				#include "utils/logalloc.hh"

				#include "log.hh"

				#include "database.hh"

				#include "replica/database.hh"

				namespace api {

				@@ -35,7 +22,7 @@ static logging::logger alogger("lsa-api");

				void set_lsa(http_context& ctx, routes& r) {

				    httpd::lsa_json::lsa_compact.set(r, [&ctx](std::unique_ptr<request> req) {

				        alogger.info("Triggering compaction");

				        return ctx.db.invoke_on_all([] (database&) {

				        return ctx.db.invoke_on_all([] (replica::database&) {

				            logalloc::shard_tracker().reclaim(std::numeric_limits<size_t>::max());

				        }).then([] {

				            return json::json_return_type(json::json_void());

									
										15

api/lsa.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										15

api/messaging_service.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "messaging_service.hh"

									
										15

api/messaging_service.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										70

api/raft.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,70 @@

				/*

				 * Copyright (C) 2024-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <seastar/core/coroutine.hh>

				#include "api/api.hh"

				#include "api/api-doc/raft.json.hh"

				#include "service/raft/raft_group_registry.hh"

				using namespace seastar::httpd;

				extern logging::logger apilog;

				namespace api {

				namespace r = httpd::raft_json;

				using namespace json;

				void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr) {

				    r::trigger_snapshot.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {

				        raft::group_id gid{utils::UUID{req->param["group_id"]}};

				        auto timeout_dur = std::invoke([timeout_str = req->get_query_param("timeout")] {

				            if (timeout_str.empty()) {

				                return std::chrono::seconds{60};

				            }

				            auto dur = std::stoll(timeout_str);

				            if (dur <= 0) {

				                throw std::runtime_error{"Timeout must be a positive number."};

				            }

				            return std::chrono::seconds{dur};

				        });

				        std::atomic<bool> found_srv{false};

				        co_await raft_gr.invoke_on_all([gid, timeout_dur, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {

				            auto* srv = raft_gr.find_server(gid);

				            if (!srv) {

				                co_return;

				            }

				            found_srv = true;

				            abort_on_expiry aoe(lowres_clock::now() + timeout_dur);

				            apilog.info("Triggering Raft group {} snapshot", gid);

				            auto result = co_await srv->trigger_snapshot(&aoe.abort_source());

				            if (result) {

				                apilog.info("New snapshot for Raft group {} created", gid);

				            } else {

				                apilog.info("Could not create new snapshot for Raft group {}, no new entries applied", gid);

				            }

				        });

				        if (!found_srv) {

				            throw std::runtime_error{fmt::format("Server for group ID {} not found", gid)};

				        }

				        co_return json_void{};

				    });

				}

				void unset_raft(http_context&, httpd::routes& r) {

				    r::trigger_snapshot.unset(r);

				}

				}

									
										18

api/raft.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,18 @@

				/*

				 * Copyright (C) 2023-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include "api_init.hh"

				namespace api {

				void set_raft(http_context& ctx, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr);

				void unset_raft(http_context& ctx, httpd::routes& r);

				}

									
										46

api/storage_proxy.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "storage_proxy.hh"

				@@ -26,8 +13,8 @@

				#include "service/storage_service.hh"

				#include "db/config.hh"

				#include "utils/histogram.hh"

				#include "database.hh"

				#include "seastar/core/scheduling_specific.hh"

				#include "replica/database.hh"

				#include <seastar/core/scheduling_specific.hh>

				namespace api {

				@@ -35,6 +22,9 @@ namespace sp = httpd::storage_proxy_json;

				using proxy = service::storage_proxy;

				using namespace json;

				utils::time_estimated_histogram timed_rate_moving_average_summary_merge(utils::time_estimated_histogram a, const utils::timed_rate_moving_average_summary_and_histogram& b) {

				    return a.merge(b.histogram());

				}

				/**

				 * This function implement a two dimentional map reduce where

				@@ -68,10 +58,10 @@ future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				 * @param initial_value - the initial value to use for both aggregations* @return

				 * @return A future that resolves to the result of the aggregation.

				 */

				template<typename V, typename Reducer, typename F>

				template<typename V, typename Reducer, typename F, typename C>

				future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				        V F::*f, Reducer reducer, V initial_value) {

				    return two_dimensional_map_reduce(d, [f] (F& stats) {

				        C F::*f, Reducer reducer, V initial_value) {

				    return two_dimensional_map_reduce(d, [f] (F& stats) -> V {

				        return stats.*f;

				    }, reducer, initial_value);

				}

				@@ -125,10 +115,10 @@ utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimat

				    return res;

				}

				static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx, utils::time_estimated_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(ctx.sp, f, utils::time_estimated_histogram_merge,

				            utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {

				static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).histogram();

				    }, utils::time_estimated_histogram_merge, utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {

				        return make_ready_future<json::json_return_type>(time_to_json_histogram(val));

				    });

				}

				@@ -143,7 +133,7 @@ static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx

				    });

				}

				static future<json::json_return_type>  total_latency(http_context& ctx, utils::timed_rate_moving_average_and_histogram service::storage_proxy_stats::stats::*f) {

				static future<json::json_return_type>  total_latency(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {

				            return (stats.*f).hist.mean * (stats.*f).hist.count;

				        }, std::plus<double>(), 0.0).then([](double val) {

				@@ -163,7 +153,7 @@ static future<json::json_return_type>  total_latency(http_context& ctx, utils::t

				template<typename F>

				future<json::json_return_type>

				sum_histogram_stats_storage_proxy(distributed<proxy>& d,

				        utils::timed_rate_moving_average_and_histogram F::*f) {

				        utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).hist;

				    }, std::plus<utils::ihistogram>(), utils::ihistogram()).

				@@ -183,7 +173,7 @@ sum_histogram_stats_storage_proxy(distributed<proxy>& d,

				template<typename F>

				future<json::json_return_type>

				sum_timer_stats_storage_proxy(distributed<proxy>& d,

				        utils::timed_rate_moving_average_and_histogram F::*f) {

				        utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).rate();

				@@ -504,14 +494,14 @@ void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_se

				    });

				    sp::get_read_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_read);

				        return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::read);

				    });

				    sp::get_read_latency.set(r, [&ctx](std::unique_ptr<request> req) {

				        return total_latency(ctx, &service::storage_proxy_stats::stats::read);

				    });

				    sp::get_write_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_write);

				        return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::write);

				    });

				    sp::get_write_latency.set(r, [&ctx](std::unique_ptr<request> req) {

									
										15

api/storage_proxy.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

634

api/storage_service.cc

View File

File diff suppressed because it is too large Load Diff

									
										51

api/storage_service.hh
									
												View File
												
				@@ -3,24 +3,13 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include <iostream>

				#include <seastar/core/sharded.hh>

				#include "api.hh"

				#include "db/data_listeners.hh"

				@@ -32,6 +21,7 @@ class snapshot_ctl;

				namespace view {

				class view_builder;

				}

				class system_keyspace;

				}

				namespace netw { class messaging_service; }

				class repair_service;

				@@ -46,7 +36,30 @@ class gossiper;

				namespace api {

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs);

				// verify that the keyspace parameter is found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective keyspace error.

				sstring validate_keyspace(http_context& ctx, const parameters& param);

				// splits a request parameter assumed to hold a comma-separated list of table names

				// verify that the tables are found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective no_such_column_family error.

				// Returns an empty vector if no parameter was found.

				// If the parameter is found and empty, returns a list of all table names in the keyspace.

				std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);

				struct table_info {

				    sstring name;

				    table_id id;

				};

				// splits a request parameter assumed to hold a comma-separated list of table names

				// verify that the tables are found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective no_such_column_family error.

				// Returns a vector of all table infos given by the parameter, or

				// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.

				std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ls);

				void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader);

				void unset_sstables_loader(http_context& ctx, routes& r);

				void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb);

				@@ -61,4 +74,10 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				void unset_snapshot(http_context& ctx, routes& r);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);

				}

				} // namespace api

				namespace std {

				std::ostream& operator<<(std::ostream& os, const api::table_info& ti);

				} // namespace std

									
										54

api/stream_manager.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "stream_manager.hh"

				@@ -87,13 +74,13 @@ static hs::stream_state get_state(

				    return state;

				}

				void set_stream_manager(http_context& ctx, routes& r) {

				void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_manager>& sm) {

				    hs::get_current_streams.set(r,

				            [] (std::unique_ptr<request> req) {

				                return streaming::get_stream_manager().invoke_on_all([] (auto& sm) {

				            [&sm] (std::unique_ptr<request> req) {

				                return sm.invoke_on_all([] (auto& sm) {

				                    return sm.update_all_progress_info();

				                }).then([] {

				                    return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& stream) {

				                }).then([&sm] {

				                    return sm.map_reduce0([](streaming::stream_manager& stream) {

				                        std::vector<hs::stream_state> res;

				                        for (auto i : stream.get_initiated_streams()) {

				                            res.push_back(get_state(*i.second.get()));

				@@ -109,17 +96,17 @@ void set_stream_manager(http_context& ctx, routes& r) {

				                });

				            });

				    hs::get_all_active_streams_outbound.set(r, [](std::unique_ptr<request> req) {

				        return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& stream) {

				    hs::get_all_active_streams_outbound.set(r, [&sm](std::unique_ptr<request> req) {

				        return sm.map_reduce0([](streaming::stream_manager& stream) {

				            return stream.get_initiated_streams().size();

				        }, 0, std::plus<int64_t>()).then([](int64_t res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    hs::get_total_incoming_bytes.set(r, [](std::unique_ptr<request> req) {

				    hs::get_total_incoming_bytes.set(r, [&sm](std::unique_ptr<request> req) {

				        gms::inet_address peer(req->param["peer"]);

				        return streaming::get_stream_manager().map_reduce0([peer](streaming::stream_manager& sm) {

				        return sm.map_reduce0([peer](streaming::stream_manager& sm) {

				            return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {

				                return sbytes.bytes_received;

				            });

				@@ -128,8 +115,8 @@ void set_stream_manager(http_context& ctx, routes& r) {

				        });

				    });

				    hs::get_all_total_incoming_bytes.set(r, [](std::unique_ptr<request> req) {

				        return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& sm) {

				    hs::get_all_total_incoming_bytes.set(r, [&sm](std::unique_ptr<request> req) {

				        return sm.map_reduce0([](streaming::stream_manager& sm) {

				            return sm.get_progress_on_all_shards().then([] (auto sbytes) {

				                return sbytes.bytes_received;

				            });

				@@ -138,9 +125,9 @@ void set_stream_manager(http_context& ctx, routes& r) {

				        });

				    });

				    hs::get_total_outgoing_bytes.set(r, [](std::unique_ptr<request> req) {

				    hs::get_total_outgoing_bytes.set(r, [&sm](std::unique_ptr<request> req) {

				        gms::inet_address peer(req->param["peer"]);

				        return streaming::get_stream_manager().map_reduce0([peer] (streaming::stream_manager& sm) {

				        return sm.map_reduce0([peer] (streaming::stream_manager& sm) {

				            return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {

				                return sbytes.bytes_sent;

				            });

				@@ -149,8 +136,8 @@ void set_stream_manager(http_context& ctx, routes& r) {

				        });

				    });

				    hs::get_all_total_outgoing_bytes.set(r, [](std::unique_ptr<request> req) {

				        return streaming::get_stream_manager().map_reduce0([](streaming::stream_manager& sm) {

				    hs::get_all_total_outgoing_bytes.set(r, [&sm](std::unique_ptr<request> req) {

				        return sm.map_reduce0([](streaming::stream_manager& sm) {

				            return sm.get_progress_on_all_shards().then([] (auto sbytes) {

				                return sbytes.bytes_sent;

				            });

				@@ -160,4 +147,13 @@ void set_stream_manager(http_context& ctx, routes& r) {

				    });

				}

				void unset_stream_manager(http_context& ctx, routes& r) {

				    hs::get_current_streams.unset(r);

				    hs::get_all_active_streams_outbound.unset(r);

				    hs::get_total_incoming_bytes.unset(r);

				    hs::get_all_total_incoming_bytes.unset(r);

				    hs::get_total_outgoing_bytes.unset(r);

				    hs::get_all_total_outgoing_bytes.unset(r);

				}

				}

									
										18

api/stream_manager.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				@@ -25,6 +12,7 @@

				namespace api {

				void set_stream_manager(http_context& ctx, routes& r);

				void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_manager>& sm);

				void unset_stream_manager(http_context& ctx, routes& r);

				}

									
										29

api/system.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "api/api-doc/system.json.hh"

				@@ -25,7 +12,7 @@

				#include <seastar/core/reactor.hh>

				#include <seastar/http/exception.hh>

				#include "log.hh"

				#include "database.hh"

				#include "replica/database.hh"

				extern logging::logger apilog;

				@@ -74,9 +61,19 @@ void set_system(http_context& ctx, routes& r) {

				        return json::json_void();

				    });

				    hs::write_log_message.set(r, [](const_req req) {

				        try {

				            logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));

				            apilog.log(level, "/system/log: {}", std::string(req.get_query_param("message")));

				        } catch (boost::bad_lexical_cast& e) {

				            throw bad_param_exception("Unknown logging level " + req.get_query_param("level"));

				        }

				        return json::json_void();

				    });

				    hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {

				        apilog.info("Dropping sstable caches");

				        return ctx.db.invoke_on_all([] (database& db) {

				        return ctx.db.invoke_on_all([] (replica::database& db) {

				            return db.drop_caches();

				        }).then([] {

				            apilog.info("Caches dropped");

									
										15

api/system.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										231

api/task_manager.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,231 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <seastar/core/coroutine.hh>

				#include "task_manager.hh"

				#include "api/api-doc/task_manager.json.hh"

				#include "db/system_keyspace.hh"

				#include "column_family.hh"

				#include "unimplemented.hh"

				#include "storage_service.hh"

				#include <utility>

				#include <boost/range/adaptors.hpp>

				namespace api {

				namespace tm = httpd::task_manager_json;

				using namespace json;

				inline bool filter_tasks(tasks::task_manager::task_ptr task, std::unordered_map<sstring, sstring>& query_params) {

				    return (!query_params.contains("keyspace") || query_params["keyspace"] == task->get_status().keyspace) &&

				        (!query_params.contains("table") || query_params["table"] == task->get_status().table);

				}

				struct full_task_status {

				    tasks::task_manager::task::status task_status;

				    std::string type;

				    tasks::task_manager::task::progress progress;

				    std::string module;

				    tasks::task_id parent_id;

				    tasks::is_abortable abortable;

				    std::vector<std::string> children_ids;

				};

				struct task_stats {

				    task_stats(tasks::task_manager::task_ptr task)

				        : task_id(task->id().to_sstring())

				        , state(task->get_status().state)

				        , type(task->type())

				        , keyspace(task->get_status().keyspace)

				        , table(task->get_status().table)

				        , entity(task->get_status().entity)

				        , sequence_number(task->get_status().sequence_number)

				    { }

				    sstring task_id;

				    tasks::task_manager::task_state state;

				    std::string type;

				    std::string keyspace;

				    std::string table;

				    std::string entity;

				    uint64_t sequence_number;

				};

				tm::task_status make_status(full_task_status status) {

				    auto start_time = db_clock::to_time_t(status.task_status.start_time);

				    auto end_time = db_clock::to_time_t(status.task_status.end_time);

				    ::tm st, et;

				    ::gmtime_r(&end_time, &et);

				    ::gmtime_r(&start_time, &st);

				    tm::task_status res{};

				    res.id = status.task_status.id.to_sstring();

				    res.type = status.type;

				    res.state = status.task_status.state;

				    res.is_abortable = bool(status.abortable);

				    res.start_time = st;

				    res.end_time = et;

				    res.error = status.task_status.error;

				    res.parent_id = status.parent_id.to_sstring();

				    res.sequence_number = status.task_status.sequence_number;

				    res.shard = status.task_status.shard;

				    res.keyspace = status.task_status.keyspace;

				    res.table = status.task_status.table;

				    res.entity = status.task_status.entity;

				    res.progress_units = status.task_status.progress_units;

				    res.progress_total = status.progress.total;

				    res.progress_completed = status.progress.completed;

				    res.children_ids = std::move(status.children_ids);

				    return res;

				}

				future<full_task_status> retrieve_status(const tasks::task_manager::foreign_task_ptr& task) {

				    if (task.get() == nullptr) {

				        co_return coroutine::return_exception(httpd::bad_param_exception("Task not found"));

				    }

				    auto progress = co_await task->get_progress();

				    full_task_status s;

				    s.task_status = task->get_status();

				    s.type = task->type();

				    s.parent_id = task->get_parent_id();

				    s.abortable = task->is_abortable();

				    s.module = task->get_module_name();

				    s.progress.completed = progress.completed;

				    s.progress.total = progress.total;

				    std::vector<std::string> ct{task->get_children().size()};

				    boost::transform(task->get_children(), ct.begin(), [] (const auto& child) {

				        return child->id().to_sstring();

				    });

				    s.children_ids = std::move(ct);

				    co_return s;

				}

				void set_task_manager(http_context& ctx, routes& r) {

				    tm::get_modules.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        std::vector<std::string> v = boost::copy_range<std::vector<std::string>>(ctx.tm.local().get_modules() | boost::adaptors::map_keys);

				        co_return v;

				    });

				    tm::get_tasks.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        using chunked_stats = utils::chunked_vector<task_stats>;

				        auto internal = tasks::is_internal{req_param<bool>(*req, "internal", false)};

				        std::vector<chunked_stats> res = co_await ctx.tm.map([&req, internal] (tasks::task_manager& tm) {

				            chunked_stats local_res;

				            auto module = tm.find_module(req->param["module"]);

				            const auto& filtered_tasks = module->get_tasks() | boost::adaptors::filtered([&params = req->query_parameters, internal] (const auto& task) {

				                return (internal || !task.second->is_internal()) && filter_tasks(task.second, params);

				            });

				            for (auto& [task_id, task] : filtered_tasks) {

				                local_res.push_back(task_stats{task});

				            }

				            return local_res;

				        });

				        std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {

				            auto s = std::move(os);

				            auto res = std::move(r);

				            co_await s.write("[");

				            std::string delim = "";

				            for (auto& v: res) {

				                for (auto& stats: v) {

				                    co_await s.write(std::exchange(delim, ", "));

				                    tm::task_stats ts;

				                    ts = stats;

				                    co_await formatter::write(s, ts);

				                }

				            }

				            co_await s.write("]");

				            co_await s.close();

				        };

				        co_return std::move(f);

				    });

				    tm::get_task_status.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {

				            auto state = task->get_status().state;

				            if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {

				                task->unregister_task();

				            }

				            co_return std::move(task);

				        }));

				        auto s = co_await retrieve_status(task);

				        co_return make_status(s);

				    });

				    tm::abort_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {

				            if (!task->is_abortable()) {

				                co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));

				            }

				            co_await task->abort();

				        });

				        co_return json_void();

				    });

				    tm::wait_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) {

				            return task->done().then_wrapped([task] (auto f) {

				                task->unregister_task();

				                // done() is called only because we want the task to be complete before getting its status.

				                // The future should be ignored here as the result does not matter.

				                f.ignore_ready_future();

				                return make_foreign(task);

				            });

				        }));

				        auto s = co_await retrieve_status(task);

				        co_return make_status(s);

				    });

				    tm::get_task_status_recursively.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto& _ctx = ctx;

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        std::queue<tasks::task_manager::foreign_task_ptr> q;

				        utils::chunked_vector<full_task_status> res;

				        // Get requested task.

				        auto task = co_await tasks::task_manager::invoke_on_task(_ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {

				            auto state = task->get_status().state;

				            if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {

				                task->unregister_task();

				            }

				            co_return task;

				        }));

				        // Push children's statuses in BFS order.

				        q.push(co_await task.copy());   // Task cannot be moved since we need it to be alive during whole loop execution.

				        while (!q.empty()) {

				            auto& current = q.front();

				            res.push_back(co_await retrieve_status(current));

				            for (auto& child: current->get_children()) {

				                q.push(co_await child.copy());

				            }

				            q.pop();

				        }

				        std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {

				            auto s = std::move(os);

				            auto res = std::move(r);

				            co_await s.write("[");

				            std::string delim = "";

				            for (auto& status: res) {

				                co_await s.write(std::exchange(delim, ", "));

				                co_await formatter::write(s, make_status(status));

				            }

				            co_await s.write("]");

				            co_await s.close();

				        };

				        co_return f;

				    });

				}

				}

									
										17

api/task_manager.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,17 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include "api.hh"

				namespace api {

				void set_task_manager(http_context& ctx, routes& r);

				}

									
										107

api/task_manager_test.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,107 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#ifndef SCYLLA_BUILD_MODE_RELEASE

				#include <seastar/core/coroutine.hh>

				#include "task_manager_test.hh"

				#include "api/api-doc/task_manager_test.json.hh"

				#include "tasks/test_module.hh"

				namespace api {

				namespace tmt = httpd::task_manager_test_json;

				using namespace json;

				void set_task_manager_test(http_context& ctx, routes& r, db::config& cfg) {

				    tmt::register_test_module.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) {

				            auto m = make_shared<tasks::test_module>(tm);

				            tm.register_module("test", m);

				        });

				        co_return json_void();

				    });

				    tmt::unregister_test_module.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) -> future<> {

				            auto module_name = "test";

				            auto module = tm.find_module(module_name);

				            co_await module->stop();

				        });

				        co_return json_void();

				    });

				    tmt::register_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        sharded<tasks::task_manager>& tms = ctx.tm;

				        auto it = req->query_parameters.find("task_id");

				        auto id = it != req->query_parameters.end() ? tasks::task_id{utils::UUID{it->second}} : tasks::task_id::create_null_id();

				        it = req->query_parameters.find("shard");

				        unsigned shard = it != req->query_parameters.end() ? boost::lexical_cast<unsigned>(it->second) : 0;

				        it = req->query_parameters.find("keyspace");

				        std::string keyspace = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("table");

				        std::string table = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("entity");

				        std::string entity = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("parent_id");

				        tasks::task_info data;

				        if (it != req->query_parameters.end()) {

				            data.id = tasks::task_id{utils::UUID{it->second}};

				            auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(ctx.tm, data.id);

				            data.shard = parent_ptr->get_status().shard;

				        }

				        auto module = tms.local().find_module("test");

				        id = co_await module->make_task<tasks::test_task_impl>(shard, id, keyspace, table, entity, data);

				        co_await tms.invoke_on(shard, [id] (tasks::task_manager& tm) {

				            auto it = tm.get_all_tasks().find(id);

				            if (it != tm.get_all_tasks().end()) {

				                it->second->start();

				            }

				        });

				        co_return id.to_sstring();

				    });

				    tmt::unregister_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};

				        co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {

				            tasks::test_task test_task{task};

				            co_await test_task.unregister_task();

				        });

				        co_return json_void();

				    });

				    tmt::finish_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        auto it = req->query_parameters.find("error");

				        bool fail = it != req->query_parameters.end();

				        std::string error = fail ? it->second : "";

				        co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {

				            tasks::test_task test_task{task};

				            if (fail) {

				                test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));

				            } else {

				                test_task.finish();

				            }

				            return make_ready_future<>();

				        });

				        co_return json_void();

				    });

				    tmt::get_and_update_ttl.set(r, [&ctx, &cfg] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        uint32_t ttl = cfg.task_ttl_seconds();

				        co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);

				        co_return json::json_return_type(ttl);

				    });

				}

				}

				#endif

									
										22

api/task_manager_test.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,22 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#ifndef SCYLLA_BUILD_MODE_RELEASE

				#pragma once

				#include "api.hh"

				#include "db/config.hh"

				namespace api {

				void set_task_manager_test(http_context& ctx, routes& r, db::config& cfg);

				}

				#endif

									
										51

atomic_cell.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "atomic_cell.hh"

				@@ -79,36 +66,48 @@ atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)

				    set_view(_data);

				}

				// Based on:

				//  - org.apache.cassandra.db.AbstractCell#reconcile()

				//  - org.apache.cassandra.db.BufferExpiringCell#reconcile()

				//  - org.apache.cassandra.db.BufferDeletedCell#reconcile()

				// Based on Cassandra's resolveRegular function:

				//  - https://github.com/apache/cassandra/blob/e4f31b73c21b04966269c5ac2d3bd2562e5f6c63/src/java/org/apache/cassandra/db/rows/Cells.java#L79-L119

				//

				// Note: the ordering algorithm for cell is the same as for rows,

				// except that the cell value is used to break a tie in case all other attributes are equal.

				// See compare_row_marker_for_merge.

				std::strong_ordering

				compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {

				    // Largest write timestamp wins.

				    if (left.timestamp() != right.timestamp()) {

				        return left.timestamp() <=> right.timestamp();

				    }

				    // Tombstones always win reconciliation with live cells of the same timestamp

				    if (left.is_live() != right.is_live()) {

				        return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;

				    }

				    if (left.is_live()) {

				        auto c = compare_unsigned(left.value(), right.value()) <=> 0;

				        if (c != 0) {

				            return c;

				        }

				        // Prefer expiring cells (which will become tombstones at some future date) over live cells.

				        // See https://issues.apache.org/jira/browse/CASSANDRA-14592

				        if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {

				            // prefer expiring cells.

				            return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;

				        }

				        // If both are expiring, choose the cell with the latest expiry or derived write time.

				        if (left.is_live_and_has_ttl()) {

				            // Prefer cell with latest expiry

				            if (left.expiry() != right.expiry()) {

				                return left.expiry() <=> right.expiry();

				            } else {

				                // prefer the cell that was written later,

				                // so it survives longer after it expires, until purged.

				            } else if (right.ttl() != left.ttl()) {

				                // The cell write time is derived by (expiry - ttl).

				                // Prefer the cell that was written later,

				                // so it survives longer after it expires, until purged,

				                // as it become purgeable gc_grace_seconds after it was written.

				                //

				                // Note that this is an extension to Cassandra's algorithm

				                // which stops at the expiration time, and if equal,

				                // move forward to compare the cell values.

				                return right.ttl() <=> left.ttl();

				            }

				        }

				        // The cell with the largest value wins, if all other attributes of the cells are identical.

				        // This is quite arbitrary, but still required to break the tie in a deterministic way.

				        return compare_unsigned(left.value(), right.value());

				    } else {

				        // Both are deleted

									
										15

atomic_cell.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										15

atomic_cell_hash.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										15

atomic_cell_or_collection.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

									
										15

auth/allow_all_authenticator.cc
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "auth/allow_all_authenticator.hh"

									
										15

auth/allow_all_authenticator.hh
									
												View File
												
				@@ -3,20 +3,7 @@

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

Compare commits

6325 Commits next-4.6 ... branch-5.2

1 .gitattributes vendored Unescape Escape View File

42 .github/CODEOWNERS vendored Unescape Escape View File

17 .github/workflows/docs-amplify-enhanced.yaml vendored Normal file Unescape Escape View File

35 .github/workflows/docs-pages.yaml vendored Normal file Unescape Escape View File

29 .github/workflows/docs-pages@v2.yaml vendored Unescape Escape View File

28 .github/workflows/docs-pr.yaml vendored Normal file Unescape Escape View File

25 .github/workflows/docs-pr@v1.yaml vendored Unescape Escape View File

4 .gitignore vendored Unescape Escape View File

6 .gitmodules vendored Unescape Escape View File

3 .mailmap Normal file Unescape Escape View File

51 CMakeLists.txt Unescape Escape View File

2 CONTRIBUTING.md Unescape Escape View File

36 HACKING.md Unescape Escape View File

6 README.md Unescape Escape View File

39 SCYLLA-VERSION-GEN Unescape Escape View File

1 abseil

15 absl-flat_hash_map.cc Unescape Escape View File

15 absl-flat_hash_map.hh Unescape Escape View File

29 alternator/auth.cc Unescape Escape View File

17 alternator/auth.hh Unescape Escape View File

15 alternator/conditions.cc Unescape Escape View File

15 alternator/conditions.hh Unescape Escape View File

79 alternator/controller.cc Unescape Escape View File

46 alternator/controller.hh Unescape Escape View File

25 alternator/error.hh Unescape Escape View File

1036 alternator/executor.cc View File

58 alternator/executor.hh Unescape Escape View File

23 alternator/expressions.cc Unescape Escape View File

15 alternator/expressions.g Unescape Escape View File

21 alternator/expressions.hh Unescape Escape View File

15 alternator/expressions_types.hh Unescape Escape View File

15 alternator/rmw_operation.hh Unescape Escape View File

111 alternator/serialization.cc Unescape Escape View File

26 alternator/serialization.hh Unescape Escape View File

106 alternator/server.cc Unescape Escape View File

26 alternator/server.hh Unescape Escape View File

15 alternator/stats.cc Unescape Escape View File

15 alternator/stats.hh Unescape Escape View File

113 alternator/streams.cc Unescape Escape View File

808 alternator/ttl.cc Unescape Escape View File

80 alternator/ttl.hh Normal file Unescape Escape View File

15 amplify.yml Normal file Unescape Escape View File

29 api/api-doc/authorization_cache.json Normal file Unescape Escape View File

42 api/api-doc/compaction_manager.json Unescape Escape View File

43 api/api-doc/raft.json Normal file Unescape Escape View File

73 api/api-doc/storage_service.json Unescape Escape View File

39 api/api-doc/system.json Unescape Escape View File

305 api/api-doc/task_manager.json Normal file Unescape Escape View File

177 api/api-doc/task_manager_test.json Normal file Unescape Escape View File

121 api/api.cc Unescape Escape View File

91 api/api.hh Unescape Escape View File

53 api/api_init.hh Unescape Escape View File

33 api/authorization_cache.cc Normal file Unescape Escape View File

18 api/authorization_cache.hh Normal file Unescape Escape View File

31 api/cache_service.cc Unescape Escape View File

15 api/cache_service.hh Unescape Escape View File

22 api/collectd.cc Unescape Escape View File

15 api/collectd.hh Unescape Escape View File

295 api/column_family.cc Unescape Escape View File

41 api/column_family.hh Unescape Escape View File

21 api/commitlog.cc Unescape Escape View File

15 api/commitlog.hh Unescape Escape View File

59 api/compaction_manager.cc Unescape Escape View File

15 api/compaction_manager.hh Unescape Escape View File

15 api/config.cc Unescape Escape View File

15 api/config.hh Unescape Escape View File

62 api/endpoint_snitch.cc Unescape Escape View File

22 api/endpoint_snitch.hh Unescape Escape View File

17 api/error_injection.cc Unescape Escape View File

15 api/error_injection.hh Unescape Escape View File

108 api/failure_detector.cc Unescape Escape View File

15 api/failure_detector.hh Unescape Escape View File

39 api/gossiper.cc Unescape Escape View File

15 api/gossiper.hh Unescape Escape View File

15 api/hinted_handoff.cc Unescape Escape View File

15 api/hinted_handoff.hh Unescape Escape View File

19 api/lsa.cc Unescape Escape View File

15 api/lsa.hh Unescape Escape View File

6325 Commits

next-4.6 ... branch-5.2

1

.gitattributes vendored

View File

42

.github/CODEOWNERS vendored

View File

17

.github/workflows/docs-amplify-enhanced.yaml vendored Normal file

View File

35

.github/workflows/docs-pages.yaml vendored Normal file

View File

29

.github/workflows/docs-pages@v2.yaml vendored

View File

28

.github/workflows/docs-pr.yaml vendored Normal file

View File

25

.github/workflows/docs-pr@v1.yaml vendored

View File

4

.gitignore vendored

View File

6

.gitmodules vendored

View File

3

.mailmap Normal file

View File

51

CMakeLists.txt

View File

2

CONTRIBUTING.md

View File

36

HACKING.md

View File

6

README.md

View File

39

SCYLLA-VERSION-GEN

View File

1

abseil

15

absl-flat_hash_map.cc

View File

15

absl-flat_hash_map.hh

View File

29

alternator/auth.cc

View File

17

alternator/auth.hh

View File

15

alternator/conditions.cc

View File

15

alternator/conditions.hh

View File

79

alternator/controller.cc

View File

46

alternator/controller.hh

View File

25

alternator/error.hh

View File

1036

alternator/executor.cc

View File

58

alternator/executor.hh

View File

23

alternator/expressions.cc

View File

15

alternator/expressions.g

View File

21

alternator/expressions.hh

View File

15

alternator/expressions_types.hh

View File

15

alternator/rmw_operation.hh

View File

111

alternator/serialization.cc

View File

26

alternator/serialization.hh

View File

106

alternator/server.cc

View File

26

alternator/server.hh

View File

15

alternator/stats.cc

View File

15

alternator/stats.hh

View File

113

alternator/streams.cc

View File

808

alternator/ttl.cc

View File

80

alternator/ttl.hh Normal file

View File

15

amplify.yml Normal file

View File

29

api/api-doc/authorization_cache.json Normal file

View File

42

api/api-doc/compaction_manager.json

View File

43

api/api-doc/raft.json Normal file

View File

73

api/api-doc/storage_service.json

View File

39

api/api-doc/system.json

View File

305

api/api-doc/task_manager.json Normal file

View File

177

api/api-doc/task_manager_test.json Normal file

View File

121

api/api.cc

View File

91

api/api.hh

View File

53

api/api_init.hh

View File

33

api/authorization_cache.cc Normal file

View File

18

api/authorization_cache.hh Normal file

View File

31

api/cache_service.cc

View File

15

api/cache_service.hh

View File

22

api/collectd.cc

View File

15

api/collectd.hh

View File

295

api/column_family.cc

View File

41

api/column_family.hh

View File

21

api/commitlog.cc

View File

15

api/commitlog.hh

View File

59

api/compaction_manager.cc

View File

15

api/compaction_manager.hh

View File

15

api/config.cc

View File

15

api/config.hh

View File

62

api/endpoint_snitch.cc

View File

22

api/endpoint_snitch.hh

View File

17

api/error_injection.cc

View File

15

api/error_injection.hh

View File

108

api/failure_detector.cc

View File

15

api/failure_detector.hh

View File

39

api/gossiper.cc

View File

15

api/gossiper.hh

View File

15

api/hinted_handoff.cc

View File

15

api/hinted_handoff.hh

View File

19

api/lsa.cc

View File

15

api/lsa.hh

View File

15

api/messaging_service.cc

View File