scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 05:26:58 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	3cb01f218f	Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja Test log consistency after apply_snapshot() is called. Ensure log::last_term() log::last_conf_index() and log::size() work as expected. Misc cleanups. * scylla-dev.git/raft-confchange-test-v4: raft: fix spelling raft: add a unit test for voting raft: do not account for the same vote twice raft: remove fsm::set_configuration() raft: consistently use configuration from the log raft: add ostream serialization for enum vote_result raft: advance commit index right after leaving joint configuration raft: add tracker test raft: tidy up follower_progress API raft: update raft::log::apply_snapshot() assert raft: add a unit test for raft::log raft: rename log::non_snapshoted_length() to log::in_memory_size() raft: inline raft::log::truncate_tail() raft: ignore AppendEntries RPC with a very old term raft: remove log::start_idx() raft: return a correct last term on an empty log raft: do not use raft::log::start_idx() outside raft::log() raft: rename progress.hh to tracker.hh raft: extend single_node_is_quiet test	2021-03-03 16:29:40 +01:00
Tomasz Grabiec	0dc57db248	Revert "Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja" This reverts commit `f94f70cda8`, reversing changes made to `5206a97915`. Not the latest version of the series was merged. Rvert prior to merging the latest one.	2021-03-03 16:29:02 +01:00
Avi Kivity	5f4bf18387	Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros" This reverts commit `31909515b3`, reversing changes made to `ef97adc72a`. It shows many serious regressions in dtest. Fixes #8197.	2021-03-02 13:21:22 +02:00
Botond Dénes	257c295cff	cql_query_test: add unit test for the more efficient range scan result format The most user-visible aspect of this change is range scans which select a small subset of the columns. These queries work as the user expects them to work: unselected columns are not included in determining the size of the result (or that of the page). This is the aspect this test is checking for. While at it, also test single partition queries too.	2021-03-02 08:01:53 +02:00
Botond Dénes	fe280271a6	cql_query_test: test_query_limit: clean up scheduling groups Destroy scheduling groups created for this test, so other tests can create scheduling groups with the same name, without conflicts.	2021-03-02 07:53:53 +02:00
Avi Kivity	8747c684e0	Merge 'Move timeouts to client state' from Piotr Sarna This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests. The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867). Closes #8140 * github.com:scylladb/scylla: treewide: remove timeout config from query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options service: add timeout config to client state	2021-03-01 20:34:35 +02:00
Tomasz Grabiec	cb0b8d1903	row_cache: Zap dummy entries when populating or reading a range This will prevent accumulation of unnecessary dummy entries. A single-partition populating scan with clustering key restrictions will insert dummy entries positioned at the boundaries of the clustering query range to mark the newly populated range as continuous. Those dummy entries may accumulate with time, increasing the cost of the scan, which needs to walk over them. In some workloads we could prevent this. If a populating query overlaps with dummy entries, we could erase the old dummy entry since it will not be needed, it will fall inside a broader continuous range. This will be the case for time series worklodas which scan with a decreasing (newest) lower bound. Refs #8153. _last_row is now updated atomically with _next_row. Before, _last_row was moved first. If exception was thrown and the section was retried, this could cause the wrong entry to be removed (new next instead of old last) by the new algorithm. I don't think this was causing problems before this patch. The problem is not solved for all the cases. After this patch, we remove dummies only when there is a single MVCC version. We could patch apply_monotonically() to also do it, so that dummies which are inside continuous ranges are eventually removed, but this is left for later. perf_row_cache_reads output after that patch shows that the second scan touches no dummies: $ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 265320 Scanning read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB] read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB] Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>	2021-03-01 20:34:35 +02:00
Avi Kivity	31909515b3	Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that keeps all its versions that are referenced somewhere and provides a way of getting a reference to an immutable version of the set. Each sstable in the set is associated with the versions it is alive in, and is removed when all such versions don't have references anymore. To avoid copying, the object holding all sstables in the set version is changed to a new structure, sstable_list, which was previously an alias for std::unordered_set<shared_sstable>, and which implements most of the methods of an unordered_set, but its iterator uses the actual set with all sstables from all referenced versions and iterates over those sstables that belong to the captured version. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. To release shared_sstables as soon as possible (i.e. when all references to versions that contain them die), each time a version is removed, all sstables that were referenced exclusively by this version are erased. We are able to find these sstables efficiently by storing, for each version, all sstables that were added and erased in it, and, when a version is removed, merging it with the next one. When a version that adds an sstable gets merged with a version that removes it, this sstable is erased. Fixes #2622 Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com Closes #8111 * github.com:scylladb/scylla: sstables: add test for checking the latency of updating the sstable_set in a table sstables: move column_family_test class from test/boost to test/lib sstables: use fast copying of the sstable_set instead of rebuilding it sstables: replace the sstable_set with a versioned structure sstables: remove potential ub sstables: make sstable_set constructor less error-prone	2021-03-01 14:16:36 +02:00
Botond Dénes	694f8a4ec6	mutation_fragment_stream_validating_filter: make validation levels more fine-grained Currently key order validation for the mutation fragment stream validating filter is all or nothing. Either no keys (partition or clustering) are validated or all of them. As we suspect that clustering key order validation would add a significant overhead, this discourages turning key validation on, which means we miss out on partition key monotonicity validation which has a much more moderate cost. This patch makes this configurable in a more fine-grained fashion, providing separate levels for partition and clustering key monotonicity validation. As the choice for the default validation level is not as clear-cut as before, the default value for the validation level is removed in the validating filter's constructor.	2021-03-01 07:49:23 +02:00
Botond Dénes	1d9b5911fe	time_series_sstable_set::create_single_key_sstable_reader(): fix use-after-free The optimal path of said method mistakenly captures `pos` (a local variable) in its reader factory method and passes a temporary range implicitly constructed from said `pos` as the range parameter to the sstable reader. This will lead to the sstable reader using a dangling range and will result in returning no result for queries. This patch fixes this bug and adds a unit test to cover this code path. Fixes #8138. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>	2021-02-26 23:57:25 +02:00
Botond Dénes	dd5a601aaa	result_memory_accounter: abort unpaged queries hitting the global limit The `result_memory_accounter` terminates a query if it reaches either the global or shard-local limit. This used to be so only for paged queries, unpaged ones could grow indefinitely (until the node OOM'd). This was changed in `fea5067` which enforces the local limit on unpaged queries as well, by aborting them. However a loophole remained in the code: `result_memory_accounter::check_and_update()` has another stop condition, besides `check_local_limit()`, it also checks the global limit. This stop condition was not updated to enforce itself on unpaged queries by aborting them, instead it silently terminated them, causing them to return less data then requested. This was masked by most queries reaching the local limit first. This patch fixes this by aborting unpaged mutation queries when they hit the global limit. Fixes: #8162 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>	2021-02-26 23:43:16 +02:00
Botond Dénes	bc1fcd3db2	multishard_combining_reader: only read from needed shards The multishard combining reader currently assumes that all shards have data for the read range. This however is not always true and in extreme cases (like reading a single token) it can lead to huge read amplification. Avoid this by not pushing shards to `_shard_selection_min_heap` if the first token they are expected to produce falls outside of the read range. Also change the read ahead algorithm to select the shards from `_shard_selection_min_heap`, instead of walking them in shard order. This was wrong in two ways: * Shards may be ordered differently with respect to the first partition they will produce; reading ahead on the next shard in shard order might not bring in data on the next shard the read will continue on. Shard order is only correct when starting a new range and shards are iterated over in the order they own tokens according to the sharding algorithm. * Shards that may not have data relevant to the read range are also considered for read ahead. After this patch, the multishard reader will only read from shards that have data relevant to the read range, both in the case of normal reads and also for read-ahead. Fixes: #8161 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>	2021-02-26 23:29:20 +02:00
Piotr Sarna	0e0282cdf1	Merge ' cdc: move (most of) CDC generation management to a new service' from Kamil Braun Currently all management of CDC generations happens in storage_service, which is a big ball of mud that does many unrelated things. This PR introduces a new service crafted to handle CDC generation management: listening and reacting to generation changes in the cluster. We plug the service in, initializing it in main and test code, passing a reference to storage_service and having storage_service call the service (using the `after_join` method): the service only starts doing its job after the node joins the token ring (either on bootstrap or restart). Some parts of generation management still remain in storage_service: the bootstrap procedure, which happens inside storage_service, must also do some initialization regarding CDC generations, for example: on restart it must retrieve the latest known generation timestamp from disk; on bootstrap it must create a new generation and announce it to other nodes. The order of these operations w.r.t the rest of the startup procedure is important, hence the startup procedure is the only right place for them. We may try decoupling these services even more in follow-up PRs, but that requires a bit of careful reasoning. What this PR does is a low-hanging fruit. Still, what remains in storage_service is a small part of the entire CDC generation management logic; most of it has been moved to the new service. This includes listening for generation changes and updating the data structures for performing CDC log writes (cdc::metadata). Furthermore these handling functions now return futures (and are internally coroutines), where previously they required a seastar::async context. This PR is a prerequisite to fixing #7985. The fact that all the CDC generation management code was in storage_service is technical debt. It will be easier to modify the management algorithms when they sit in their own module. Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java Closes #8172 * github.com:scylladb/scylla: cdc: move (most of) CDC generation management code to the new service cdc: coroutinize make_new_cdc_generation cdc: coroutinize update_streams_description cdc: introduce cdc::generation_service main: move cdc_service initialization just prior to storage_service initialization	2021-02-26 12:42:27 +01:00
Piotr Sarna	c5214eb096	treewide: remove timeout config from query options Timeout config is now stored in each connection, so there's no point in tracking it inside each query as well. This patch removes timeout_config from query_options and follows by removing now unnecessary parameters of many functions and constructors.	2021-02-25 17:20:27 +01:00
Tomasz Grabiec	ecb6c56a2a	Merge 'lsa: background reclaim' from Avi Kivity This series adds background reclaim to lsa, with the goal that most large allocations can be satisfied from available free memory, and and reclaim work can be done from a preemptible context. If the workload has free cpu, then background reclaim will utilize that free cpu, reducing latency for the main workload. Otherwise, background reclaim will compete with the main workload, but since that work needs to happen anyway, throughput will not be reduced. A unit test is added to verify it works. Fixes #1634. Closes #8044 * github.com:scylladb/scylla: test: logalloc_test: test background reclaim logalloc: reduce gap between std min_free and logalloc min_free logalloc: background reclaim logalloc: preemptible reclaim	2021-02-24 13:23:30 +01:00
Avi Kivity	15d3797e97	test: logalloc_test: test background reclaim Test that the background reclaimer is able to compete with a fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU" is fully randomized. If the background reclaimer is disabled, the test fails as soon as the 20MB "gap" is exhausted. With the reclaimer enabled, it is able to free memory ahead of the allocations.	2021-02-23 19:42:42 +02:00
Kamil Braun	d4937daaea	cdc: introduce cdc::generation_service This commit introduces a new service crafted to handle CDC generation management: listening and reacting to generation changes in the cluster. The implementation is a stub for now, the service reacts to generation changes by simply logging the event. The commit plugs the service in, initializing it in main and test code, passing a reference to storage_service and having storage_service start the service (using the `after_join` method): the service only starts doing its job after the node joins the token ring (either on bootstrap or restart).	2021-02-22 12:45:43 +01:00
Konstantin Osipov	95ee8e1b90	raft: fix spelling Fix spelling of a few comments.	2021-02-19 22:56:26 +03:00
Avi Kivity	78d1afeabd	Merge "Use radix tree to store cells on a row" from Pavel E " Current storage of cells in a row is a union of vector and set. The vector holds 5 cell_and_hash's inline, up to 32 ones in the external storage and then it's switched to std::set. Once switched, the whole union becomes the waste of space, as it's size is sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes and only 3 pointers from it are used (std::set header). Also the overhead to keep cell_and_hash as a set entry is more then the size of the structure itself. Column ids are 32-bit integers that most likely come sequentialy. For this kind of a search key a radix tree (with some care for non-sequential cases) can be beneficial. This set introduces a compact radix tree, that uses 7-bit sub values from the search key to index on each node and compacts the nodes themselves for better memory usage. Then the row::_storage is replaced with the new tree. The most notable result is the memory footprint decrease, for wide rows down to 2x times. The performance of micro-benchmarks is a bit lower for small rows and (!) higer for longer (8+ cells). The numbers are in patch #12 (spoiler: they are better than for v2) v3: - trimmed size of radix down to 7 bits - simplified the nodes layouts, now there are 2 of them (was 4) - enhanced perf_mutation to test N-cells schema - added AVX intra-nodes search for medium-sized nodes - added .clone_from() method that helped to improve perf_mutation - minor - changed functions not to return values via refs-arguments - fixed nested classes to properly use language constructors - renamed index_to to key_t to distinguish from node_index_t - improved recurring variadic templates not to use sentinel argument - use standard concepts v2: - fixed potential mis-compilation due to strict-aliasing violation - added oracle test (radix tree is compared with std::map) - added radix to perf_collection - cosmetic changes (concepts, comments, names) A note on item 1 from v2 changelog. The nodes are no longer packed perfectly, each has grown 3 bytes. But it turned out that when used as cells container most of this growth drowned in lsa alignments. next todo: - aarch64 version of 16-keys node search tests: unit(dev), unit(debug for radix), pref(dev) " 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla: test/memory_footpring: Print radix tree node sizes row: Remove old storages row: Prepare row::equal for switch row: Prepare row::difference for switch row: Introduce radix tree storage type row-equal: Re-declare the cells_equal lambda test: Add tests for radix tree utils: Compact radix tree array-search: Add helpers to search for a byte in array test/perf_collection: Add callback to check the speed of clone test/perf_mutation: Add option to run with more than 1 columns test/perf_mutation: Prepare to have several regular columns test/perf_mutation: Use builder to build schema	2021-02-18 21:19:14 +02:00
Konstantin Osipov	32952a744a	raft: add a unit test for voting Test duplicate votes, votes from non-members and voting in joint configuration.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	132db931da	raft: add tracker test	2021-02-18 16:04:44 +03:00
Konstantin Osipov	6e3932bbc7	raft: tidy up follower_progress API Make the API More explicit so it's available for testing.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	e58a3e42ca	raft: add a unit test for raft::log	2021-02-18 16:04:44 +03:00
Konstantin Osipov	cb035a7c8d	raft: do not use raft::log::start_idx() outside raft::log() raft::log::start_idx() is currently not meaningful in case the log is empty. Avoid using it in fsm::replicate_to() and avoid manual search for previous log term, instead encapsulate the search in log::term_for(). As a side effect we currently return a correct term (0) when log matching rule is exercised for an empty log and the very first snapshot with term 0. Update raft_etcd_test.cc accordingly. This change happens to reduce the overall line count. While at it, improve the comments in raft::replicate_to().	2021-02-18 16:04:43 +03:00
Konstantin Osipov	97a16c0f77	raft: extend single_node_is_quiet test	2021-02-18 16:04:43 +03:00
Avi Kivity	f0950e023d	Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams. --- Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. We add an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations). Closes #8116 * github.com:scylladb/scylla: tests: add a simple CDC cql pytest cdc: add config option to disable streams rewriting cdc: rewrite streams to the new description table cql3: query_processor: improve internal paged query API cdc: introduce no_generation_data_exception exception type docs: cdc: mention system.cdc_local table cdc: coroutinize do_update_streams_description sys_dist_ks: split CDC streams table partitions into clustered rows cdc: use chunked_vector for streams in streams_version cdc: remove `streams_version::expired` field system_distributed_keyspace: use mutation API to insert CDC streams storage_service: don't use `sys_dist_ks` before it is started	2021-02-18 12:49:43 +02:00
Kamil Braun	4ef736a0a3	cql3: query_processor: improve internal paged query API The `query_processor::query` method allowed internal paged queries. However, it was quite limited, hardcoding a number of parameters: consistency level, timeout config, page size. This commit does the following improvements: 1. Rename `query` to `query_internal` to make it obvious that this API is supposed to be used for internal queries only 2. Extend the method to take consistency level, timeout config, and page size as parameters 3. Remove unused overloads of `query_internal` 4. Fix a bunch of typos / grammar issues in the docstring	2021-02-18 11:44:59 +01:00
Kamil Braun	67d4e5576d	sys_dist_ks: split CDC streams table partitions into clustered rows Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams.	2021-02-18 11:44:59 +01:00
Tomasz Grabiec	f94f70cda8	Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja Test log consistency after apply_snapshot() is called. Ensure log::last_term() log::last_conf_index() and log::size() work as expected. Misc cleanups. * scylla-dev/raft-confchange-test: raft: add a unit test for voting raft: do not account for the same vote twice raft: remove fsm::set_configuration() raft: consistently use configuration from the log raft: add ostream serialization for enum vote_result raft: advance commit index right after leaving joint configuration raft: add tracker test raft: tidy up follower_progress API raft: update raft::log::apply_snapshot() assert raft: add a unit test for raft::log raft: rename log::non_snapshoted_length() to log::length() raft: inline raft::log::truncate_tail() raft: ignore AppendEntries RPC with a very old term raft: remove log::start_idx() raft: return a correct last term on an empty log raft: do not use raft::log::start_idx() outside raft::log() raft: rename progress.hh to tracker.hh raft: extend single_node_is_quiet test	2021-02-18 10:55:59 +01:00
Botond Dénes	c3b4c3f451	evictable_reader: reset _range_override after fast-forwarding `_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>	2021-02-17 19:11:00 +02:00
Benny Halevy	35256d1b92	treewide: explicitly use flat_mutation_reader_opt Unlike flat_mutation_reader_opt that is defined using optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate to `false` after being moved, only after it is explicitly reset. Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader> to make it easier to check if it was closed before it's destroyed or being assigned-over. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>	2021-02-17 17:57:34 +02:00
Avi Kivity	c63e26e26f	Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8048 * github.com:scylladb/scylla: cdc: Limit size of topology description cdc: Extract create_stream_ids from topology_description_generator	2021-02-17 15:43:53 +02:00
Piotr Jastrzebski	649f254863	cdc: Limit size of topology description Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-02-17 13:24:40 +01:00
Botond Dénes	ba7a9d2ac3	imr: switch back to open-coded description of structures Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578	2021-02-16 23:43:07 +01:00
Michał Chojnowski	6b8a69e01f	test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions The off-by-one error would cause test_multishard_combining_reader_non_strictly_monotonic_positions to fail if the added range_tombstones filled the buffer exactly to the end. In such situation, with the old loop condition, make_fragments_with_non_monotonic_positions would add one range_tombstone too many to the deque, violating the test assumptions.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	5b79d6ca4c	test: mutation_test: remove an obsolete assertion Due to small value optimizations, the removed assertions are not true in general. Until now, atomic_cell did not use small value optimizations, but it will after upcoming changes.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	aa60f28a09	test: mutation_test: initialize an uninitialized variable It was assumed to be zero-initialized, but C++ does not guarantee that. It has to be initialized explicitly.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	52bd190bb3	test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test sstable_run_based_compaction_test assumed that sstables are freed immediately after they are fully processed. Hovewer, since commit `b524f96a74`, mutation_reader_merger releases sstables in batches of 4, which breaks the assumption. This fix adjusts the test accordingly. Until now, the test only kept working by chance: by coincidence, the number of test sstables processed by merging_reader in a single fill_buffer() call was divisible by 4. Since the test checks happen between those calls, the test never witnessed a situation when an sstable was fully processed, but not released yet. The error was noticed during the work on an upcoming patch which changes the size of mutation_fragment, and reduces the number of test sstables processed in a single fill_buffer() call, which breaks the test.	2021-02-16 21:35:14 +01:00
Konstantin Osipov	d293966366	raft: add a unit test for voting Test duplicate votes, votes from non-members and voting in joint configuration.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	1bdb3fc8a9	raft: add tracker test	2021-02-16 23:15:16 +03:00
Konstantin Osipov	63965f46f4	raft: tidy up follower_progress API Make the API More explicit so it's available for testing.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	6ee3aedcc2	raft: add a unit test for raft::log	2021-02-16 23:12:01 +03:00
Konstantin Osipov	6c14775b20	raft: do not use raft::log::start_idx() outside raft::log() raft::log::start_idx() is currently not meaningful in case the log is empty. Avoid using it in fsm::replicate_to() and avoid manual search for previous log term, instead encapsulate the search in log::term_for(). As a side effect we currently return a correct term (0) when log matching rule is exercised for an empty log and the very first snapshot with term 0. Update raft_etcd_test.cc accordingly. This change happens to reduce the overall line count. While at it, improve the comments in raft::replicate_to().	2021-02-16 21:05:44 +03:00
Pavel Emelyanov	aa85bc790b	test: Add tests for radix tree Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Tomasz Grabiec	508f928220	tests: sstables: Test sstable write fails on missing partition_end mid-stream Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210115163055.74398-1-tgrabiec@scylladb.com>	2021-02-15 15:45:49 +02:00
Wojciech Mitros	693b4e0fcd	sstables: move column_family_test class from test/boost to test/lib Column_family_test allows performing private methods on column_family's sstable_set. It may be useful not only in the boost tests, so it's moved from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh. sstable_test.hh includes sstable_test_env.hh, so no includes need to be changed. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Wojciech Mitros	aa0cd940d6	sstables: replace the sstable_set with a versioned structure Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that allows copying without actually copying all the sstables in the set, while providing the same methods(and some extra) without majorly decreasing their speed. This is achieved by associating all copies with sstable_set versions which hold the changes that were performed in them, and references to the versions that were copied, a.k.a. their parents. The set represented by a version is the result of combining all changes of its ancestors. This causes most methods of the version to have a time complexity dependent on the number of its ancestors. To limit this number, versions that represent copies that have already been deleted are merged with its descendants. The strategy used for deciding when and with which of its children should a version be merged heavily depends on the use case of sstable_sets: there is a main copy of the set in a table class which undergoes many insertions and deletions, and there are copies of it in compaction or mutation readers which are further copied or edited few or zero times. It's worth to mention, that when a copy is made, the copied set should not be modified anymore, because it would also modify the results given by the copy. In order to still allow modifying the copied set, if a change is to be performed on it, the version assiociated with this set is replaced with a new version depending on the previous one. As we can see, in our use case there is a main chain of versions(with changes from the table), and smaller branches of versions that start from a version from this chain, but are deleted soon after. In such case we can merge a version when it has exactly one descendant, as this limits the number of concurrent ancestors of a version to the number of copies of its ancestors are concurrently used. During each such merge, the parent version is removed and the child version is modified so that all operations on it give the same results. In order to preserve the same interface, the sstable_set still contains a lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for unordered_set<shared_sstable>) is now a new structure. Each sstable_set contains a sstable_list but not every sstable_list has to be contained by a sstable_set, and we also want to allow fast copying of sstable_lists, so the reference to the sstable_set_version is kept by the sstable_lists and the sstable_set can access the sstable_set_version it's associated with through its sstable_list. Accessing sstables that are elements of a certain sstable_set copy(so the select, select_sstable_runs and sstable_list's iterator) get results from containers that hold all sstables from all versions(which are stored in a single, shared "versioned_sstable_set_data" structure), and then filter out these sstables that aren't present in the version in question. This version of the sstable_set allows adding and erasing the same sstable repeatedly. Inserting and erasing from the set modifies the containers in a version only when it has an actual effect: if an sstable has been added in the parent version, and hasn't been erased in the child version, adding it again will have no effect. This ensures that when merging versions, the versions have disjoint sets of added, and erased sstables (an sstable can still be added in one and erased in the second). It's worth noting hat if an sstable has been added in one of the merged sets and erased in the second, the version that remains after merging doesn't need to have any info about the sstable's inclusion in the set - it can be inferred from the changes in previous versions (and it doesn't matter if the sstable has been erased before or after being added). To release pointers to sstables as soon as possible (i.e. when all references to versions that contain them die), if an sstable is added/erased in all child versions that are based on a version which has no external references, this change gets removed from these versions and added to the parent version. If an sstable's insertion gets overwritten as a result, we might be able to remove the sstable completely from the set. We know how many times this needs to happen by counting, for each sstable, in how many different verisions has it been added. When a change that adds an sstable gets merged with a change that removes it, or when a such a change simply gets deleted alongside its associated version, this count is reduced, and when an sstable gets added to a version that doesn't already contain it, this count is increased. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. Fixes #2622 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Avi Kivity	9cbbf40710	Merge "register_inactive_read: error handling" from Benny " Currently, register_inactive_read accepts an eviction_notify_handler to be called when the inactive_read is evicted. However, in case there was an error in register_inactive_read the notification function isn't called leaving behind state that needs to be cleaned up. This series separates the register_inactive_reader interface into 2 parts: 1. register_inactive_reader(flat_mutation_reader) - which just registers the reader and return an inactive_read_handle, if permitted. Otherwise, the notification handler is not called (it is not known yet) and the caller is not expected to do anything fance at this point that will require cleanup. This optimizes the server when overloaded since we do less work that we'd need to undo in case the reader_concurrecy_semaphore runs out of resources. 2. After register_inactive_reader succeeded to return a valid inactive_read_handle, the caller sets up its local state and may call `set_notify_handler` to set the optional notify_handler and ttl on the o_r_h. After this state, the notify_handler will be called when the inactive_reader is evicted, for any reason. querier_cache::insert_querier was modified to use the above procedure and to handle (and log/ignore) any error in the process. inactive_read_handle and inactive_read keeping track of each other was simplified by keeping an iterator in the handle and a backpointer in the inactive_read object. The former is used to evict the reader and to set the notify_handler and/or ttl without having to lookup the i_r. The latter is used to invalidate the i_r_h when the i_r is destroyed. Test: unit(release), querier_cache_test(debug) " * tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla: querier_cache: insert_querier: ignore errors to register inactive reader querier_cache: insert_querier: handle errors querier_utils: mark functions noexcept reader_concurrency_semaphore: register_inactive_read: make noexcept reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional reader_concurrency_semaphore: inactive_read: use intrusive list reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged reader_concurrency_semaphore: inactive_read_handle: swap definition order reader_lifecycle_policy: retire low level try_resume method reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader	2021-02-10 19:09:21 +02:00
Konstantin Osipov	41387225c3	raft: extend single_node_is_quiet test	2021-02-09 17:04:13 +03:00
Piotr Sarna	4acc6fecf0	Merge 'locator: Check DC names in NetworkTopologyStrategy' from Juliusz Stasiewicz The same trick is used as in C: `79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)` The edited CQL test relied on quietly accepting non-existing DCs, so it had to be removed. Also, one boost-test referred to nonexistent `datacenter2` and had to be removed. Fixes #7595 Closes #8056 github.com:scylladb/scylla: tests: Adjusted tests for DC checking in NTS locator: Check DC names in NTS	2021-02-09 14:45:20 +02:00

1 2 3 4 5 ...

820 Commits