scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-08 07:53:20 +00:00

Author	SHA1	Message	Date
Avi Kivity	8747c684e0	Merge 'Move timeouts to client state' from Piotr Sarna This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests. The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867). Closes #8140 * github.com:scylladb/scylla: treewide: remove timeout config from query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options service: add timeout config to client state	2021-03-01 20:34:35 +02:00
Tomasz Grabiec	cb0b8d1903	row_cache: Zap dummy entries when populating or reading a range This will prevent accumulation of unnecessary dummy entries. A single-partition populating scan with clustering key restrictions will insert dummy entries positioned at the boundaries of the clustering query range to mark the newly populated range as continuous. Those dummy entries may accumulate with time, increasing the cost of the scan, which needs to walk over them. In some workloads we could prevent this. If a populating query overlaps with dummy entries, we could erase the old dummy entry since it will not be needed, it will fall inside a broader continuous range. This will be the case for time series worklodas which scan with a decreasing (newest) lower bound. Refs #8153. _last_row is now updated atomically with _next_row. Before, _last_row was moved first. If exception was thrown and the section was retried, this could cause the wrong entry to be removed (new next instead of old last) by the new algorithm. I don't think this was causing problems before this patch. The problem is not solved for all the cases. After this patch, we remove dummies only when there is a single MVCC version. We could patch apply_monotonically() to also do it, so that dummies which are inside continuous ranges are eventually removed, but this is left for later. perf_row_cache_reads output after that patch shows that the second scan touches no dummies: $ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 265320 Scanning read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB] read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB] Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>	2021-03-01 20:34:35 +02:00
Avi Kivity	31909515b3	Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that keeps all its versions that are referenced somewhere and provides a way of getting a reference to an immutable version of the set. Each sstable in the set is associated with the versions it is alive in, and is removed when all such versions don't have references anymore. To avoid copying, the object holding all sstables in the set version is changed to a new structure, sstable_list, which was previously an alias for std::unordered_set<shared_sstable>, and which implements most of the methods of an unordered_set, but its iterator uses the actual set with all sstables from all referenced versions and iterates over those sstables that belong to the captured version. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. To release shared_sstables as soon as possible (i.e. when all references to versions that contain them die), each time a version is removed, all sstables that were referenced exclusively by this version are erased. We are able to find these sstables efficiently by storing, for each version, all sstables that were added and erased in it, and, when a version is removed, merging it with the next one. When a version that adds an sstable gets merged with a version that removes it, this sstable is erased. Fixes #2622 Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com Closes #8111 * github.com:scylladb/scylla: sstables: add test for checking the latency of updating the sstable_set in a table sstables: move column_family_test class from test/boost to test/lib sstables: use fast copying of the sstable_set instead of rebuilding it sstables: replace the sstable_set with a versioned structure sstables: remove potential ub sstables: make sstable_set constructor less error-prone	2021-03-01 14:16:36 +02:00
Avi Kivity	ef97adc72a	Merge "Validate token monotonicity on the sstable write path" from Botond " We have recently seen out-of-order partitions getting into sstables causing major disruption later on. Given the damage caused, it was again raised that we should enable partition key monotonicity validation unconditionally in the sstable write path. This was also raised in the past but dismissed as key validation was suspected (but not measured) to add considerable per-fragment overhead. One of the problems was that the key monotonicity validation was all or nothing. It either validated all (clustering and partition) key monotonicity or none of it. This series takes a second look at this and solves the all-or-nothing problem by making the configuration of the key monotonicity check more fine grained, allowing for enabling just token monotonicity validation separately, then enables it unconditionally. Refs: #7623 Tests: unit(release) " * 'sstable-writer-validate-partition-keys-unconditionally/v3' of https://github.com/denesb/scylla: sstables: enable token monotonicity validation by default mutation_fragment_stream_validator: add token validation level mutation_fragment_stream_validating_filter: make validation levels more fine-grained	2021-03-01 11:23:51 +02:00
Botond Dénes	694f8a4ec6	mutation_fragment_stream_validating_filter: make validation levels more fine-grained Currently key order validation for the mutation fragment stream validating filter is all or nothing. Either no keys (partition or clustering) are validated or all of them. As we suspect that clustering key order validation would add a significant overhead, this discourages turning key validation on, which means we miss out on partition key monotonicity validation which has a much more moderate cost. This patch makes this configurable in a more fine-grained fashion, providing separate levels for partition and clustering key monotonicity validation. As the choice for the default validation level is not as clear-cut as before, the default value for the validation level is removed in the validating filter's constructor.	2021-03-01 07:49:23 +02:00
Avi Kivity	d980f550d1	Merge 'row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows' from Tomasz Grabiec fill_buffer() will keep scanning until _lower_bound_changed is true, even if preemption is signaled, so that the reader makes forward progress. Before the patch, we did not update _lower_bound on touching a dummy entry. The read will not respect preemption until we hit a non-dummy row. If there is a lot of dummy rows, that can cause reactor stalls. Fix that by updating _lower_bound on dummy entries as well. Refs #8153. Tested with perf_row_cache_reads: ``` $ build/release/test/perf/perf_row_cache_reads -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 373929 Scanning read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB] read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB] ``` Notice that max preemption latency is low in the second "read:" line. Closes #8167 * github.com:scylladb/scylla: row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows tests: perf: Introduce perf_row_cache_reads row_cache: Add metric for dummy row hits	2021-02-28 21:00:20 +02:00
Botond Dénes	1d9b5911fe	time_series_sstable_set::create_single_key_sstable_reader(): fix use-after-free The optimal path of said method mistakenly captures `pos` (a local variable) in its reader factory method and passes a temporary range implicitly constructed from said `pos` as the range parameter to the sstable reader. This will lead to the sstable reader using a dangling range and will result in returning no result for queries. This patch fixes this bug and adds a unit test to cover this code path. Fixes #8138. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>	2021-02-26 23:57:25 +02:00
Botond Dénes	dd5a601aaa	result_memory_accounter: abort unpaged queries hitting the global limit The `result_memory_accounter` terminates a query if it reaches either the global or shard-local limit. This used to be so only for paged queries, unpaged ones could grow indefinitely (until the node OOM'd). This was changed in `fea5067` which enforces the local limit on unpaged queries as well, by aborting them. However a loophole remained in the code: `result_memory_accounter::check_and_update()` has another stop condition, besides `check_local_limit()`, it also checks the global limit. This stop condition was not updated to enforce itself on unpaged queries by aborting them, instead it silently terminated them, causing them to return less data then requested. This was masked by most queries reaching the local limit first. This patch fixes this by aborting unpaged mutation queries when they hit the global limit. Fixes: #8162 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>	2021-02-26 23:43:16 +02:00
Botond Dénes	bc1fcd3db2	multishard_combining_reader: only read from needed shards The multishard combining reader currently assumes that all shards have data for the read range. This however is not always true and in extreme cases (like reading a single token) it can lead to huge read amplification. Avoid this by not pushing shards to `_shard_selection_min_heap` if the first token they are expected to produce falls outside of the read range. Also change the read ahead algorithm to select the shards from `_shard_selection_min_heap`, instead of walking them in shard order. This was wrong in two ways: * Shards may be ordered differently with respect to the first partition they will produce; reading ahead on the next shard in shard order might not bring in data on the next shard the read will continue on. Shard order is only correct when starting a new range and shards are iterated over in the order they own tokens according to the sharding algorithm. * Shards that may not have data relevant to the read range are also considered for read ahead. After this patch, the multishard reader will only read from shards that have data relevant to the read range, both in the case of normal reads and also for read-ahead. Fixes: #8161 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>	2021-02-26 23:29:20 +02:00
Piotr Sarna	0e0282cdf1	Merge ' cdc: move (most of) CDC generation management to a new service' from Kamil Braun Currently all management of CDC generations happens in storage_service, which is a big ball of mud that does many unrelated things. This PR introduces a new service crafted to handle CDC generation management: listening and reacting to generation changes in the cluster. We plug the service in, initializing it in main and test code, passing a reference to storage_service and having storage_service call the service (using the `after_join` method): the service only starts doing its job after the node joins the token ring (either on bootstrap or restart). Some parts of generation management still remain in storage_service: the bootstrap procedure, which happens inside storage_service, must also do some initialization regarding CDC generations, for example: on restart it must retrieve the latest known generation timestamp from disk; on bootstrap it must create a new generation and announce it to other nodes. The order of these operations w.r.t the rest of the startup procedure is important, hence the startup procedure is the only right place for them. We may try decoupling these services even more in follow-up PRs, but that requires a bit of careful reasoning. What this PR does is a low-hanging fruit. Still, what remains in storage_service is a small part of the entire CDC generation management logic; most of it has been moved to the new service. This includes listening for generation changes and updating the data structures for performing CDC log writes (cdc::metadata). Furthermore these handling functions now return futures (and are internally coroutines), where previously they required a seastar::async context. This PR is a prerequisite to fixing #7985. The fact that all the CDC generation management code was in storage_service is technical debt. It will be easier to modify the management algorithms when they sit in their own module. Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java Closes #8172 * github.com:scylladb/scylla: cdc: move (most of) CDC generation management code to the new service cdc: coroutinize make_new_cdc_generation cdc: coroutinize update_streams_description cdc: introduce cdc::generation_service main: move cdc_service initialization just prior to storage_service initialization	2021-02-26 12:42:27 +01:00
Kamil Braun	e2f03e4aba	cdc: move (most of) CDC generation management code to the new service Currently all management of CDC generations happens in storage_service, which is a big ball of mud that does many unrelated things. Previous commits have introduced a new service for managing CDC generations. This code moves most of the relevant code to this new service. However, some part still remains in storage_service: the bootstrap procedure, which happens inside storage_service, must also do some initialization regarding CDC generations, for example: on restart it must retrieve the latest known generation timestamp from disk; on bootstrap it must create a new generation and announce it to other nodes. The order of these operations w.r.t the rest of the startup procedure is important, hence the startup procedure is the only right place for them. Still, what remains in storage_service is a small part of the entire CDC generation management logic; most of it has been moved to the new service. This includes listening for generation changes and updating the data structures for performing CDC log writes (cdc::metadata). Furthermore these functions now return futures (and are internally coroutines), where previously they required a seastar::async context.	2021-02-26 12:06:12 +01:00
Tomasz Grabiec	52e411df36	tests: perf: Introduce perf_row_cache_reads Tests performance of various read patterns from the row cache. Example: $ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M Filling memtable Rows in cache: 0 Populating with dummy rows Rows in cache: 373929 Scanning read: 156.288986 [ms], preemption: {count: 702, 99%: 0.545791 [ms], max: 0.537537 [ms]}, cache: 99/100 [MB] read: 106.480766 [ms], preemption: {count: 6, 99%: 0.006866 [ms], max: 106.496168 [ms]}, cache: 99/100 [MB]	2021-02-26 01:20:38 +01:00
Piotr Sarna	c5214eb096	treewide: remove timeout config from query options Timeout config is now stored in each connection, so there's no point in tracking it inside each query as well. This patch removes timeout_config from query_options and follows by removing now unnecessary parameters of many functions and constructors.	2021-02-25 17:20:27 +01:00
Piotr Sarna	7ceafda70a	service: add timeout config to client state Future patches will use this per-connection timeout config to allow setting different timeouts for each session, based on roles.	2021-02-25 17:20:26 +01:00
Nadav Har'El	750d7903be	cql-pytest: fix some comments in util.py Fix some incorrect comments, pasted from other files or mentioning wrong names. No other changes except comments Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210225133237.1403891-1-nyh@scylladb.com>	2021-02-25 16:00:20 +02:00
Tomasz Grabiec	ecb6c56a2a	Merge 'lsa: background reclaim' from Avi Kivity This series adds background reclaim to lsa, with the goal that most large allocations can be satisfied from available free memory, and and reclaim work can be done from a preemptible context. If the workload has free cpu, then background reclaim will utilize that free cpu, reducing latency for the main workload. Otherwise, background reclaim will compete with the main workload, but since that work needs to happen anyway, throughput will not be reduced. A unit test is added to verify it works. Fixes #1634. Closes #8044 * github.com:scylladb/scylla: test: logalloc_test: test background reclaim logalloc: reduce gap between std min_free and logalloc min_free logalloc: background reclaim logalloc: preemptible reclaim	2021-02-24 13:23:30 +01:00
Avi Kivity	15d3797e97	test: logalloc_test: test background reclaim Test that the background reclaimer is able to compete with a fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU" is fully randomized. If the background reclaimer is disabled, the test fails as soon as the 20MB "gap" is exhausted. With the reclaimer enabled, it is able to free memory ahead of the allocations.	2021-02-23 19:42:42 +02:00
Nadav Har'El	d905e71a90	Alternator: add support for CORS protocol This patch adds to Alternator support for the CORS (Cross-Origin Resource Sharing) protocol - a simple extension over the HTTP protocol which browsers use when Javascript code contacts HTTP-based servers. Although we usually think of Alternator as being used in a three-tier application, in some setups there is no middle layer and the user's browser, running Javascript code, wants to communicate directly with the database. However, for security reasons, by default Javascript loaded from domain X is not allowed to communicate with different domains Y. The CORS protocol is meant to allow this, and Alternator needs to participate in this protocol if it is to be used directly from Javascript in browsers. To implement CORS, Alternator needs to respond to the OPTIONS method which it didn't allow before - with certain headers based on the input headers. It also needs to do some of these things for the regular methods (mostly, POST). The patch includes a comprehensive test that runs against both Alternator and DynamoDB and shows that Alternator handles these headers and methods the same as DynamoDB. Additionally, I tested manually a Javascript DynamoDB client - which didn't work prior to this patch (the browser reported CORS errors), and works after this patch. Fixes #8025. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>	2021-02-23 13:15:03 +01:00
Kamil Braun	d4937daaea	cdc: introduce cdc::generation_service This commit introduces a new service crafted to handle CDC generation management: listening and reacting to generation changes in the cluster. The implementation is a stub for now, the service reacts to generation changes by simply logging the event. The commit plugs the service in, initializing it in main and test code, passing a reference to storage_service and having storage_service start the service (using the `after_join` method): the service only starts doing its job after the node joins the token ring (either on bootstrap or restart).	2021-02-22 12:45:43 +01:00
Kamil Braun	8e72c33d7c	main: move cdc_service initialization just prior to storage_service initialization As a preparation for introducing CDC generation management service. cdc_service will depend on the generation service. But the generation service needs some other services to work properly. In particular, it uses the local database, so it should be initialized after the local database. The only service that will need the cdc generation service is storage_service, so we can place the generation service initialization code right before storage_service initialization code. So the order will be cdc_generation_service -> cdc_service -> storage_service.	2021-02-22 12:43:10 +01:00
Avi Kivity	78d1afeabd	Merge "Use radix tree to store cells on a row" from Pavel E " Current storage of cells in a row is a union of vector and set. The vector holds 5 cell_and_hash's inline, up to 32 ones in the external storage and then it's switched to std::set. Once switched, the whole union becomes the waste of space, as it's size is sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes and only 3 pointers from it are used (std::set header). Also the overhead to keep cell_and_hash as a set entry is more then the size of the structure itself. Column ids are 32-bit integers that most likely come sequentialy. For this kind of a search key a radix tree (with some care for non-sequential cases) can be beneficial. This set introduces a compact radix tree, that uses 7-bit sub values from the search key to index on each node and compacts the nodes themselves for better memory usage. Then the row::_storage is replaced with the new tree. The most notable result is the memory footprint decrease, for wide rows down to 2x times. The performance of micro-benchmarks is a bit lower for small rows and (!) higer for longer (8+ cells). The numbers are in patch #12 (spoiler: they are better than for v2) v3: - trimmed size of radix down to 7 bits - simplified the nodes layouts, now there are 2 of them (was 4) - enhanced perf_mutation to test N-cells schema - added AVX intra-nodes search for medium-sized nodes - added .clone_from() method that helped to improve perf_mutation - minor - changed functions not to return values via refs-arguments - fixed nested classes to properly use language constructors - renamed index_to to key_t to distinguish from node_index_t - improved recurring variadic templates not to use sentinel argument - use standard concepts v2: - fixed potential mis-compilation due to strict-aliasing violation - added oracle test (radix tree is compared with std::map) - added radix to perf_collection - cosmetic changes (concepts, comments, names) A note on item 1 from v2 changelog. The nodes are no longer packed perfectly, each has grown 3 bytes. But it turned out that when used as cells container most of this growth drowned in lsa alignments. next todo: - aarch64 version of 16-keys node search tests: unit(dev), unit(debug for radix), pref(dev) " 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla: test/memory_footpring: Print radix tree node sizes row: Remove old storages row: Prepare row::equal for switch row: Prepare row::difference for switch row: Introduce radix tree storage type row-equal: Re-declare the cells_equal lambda test: Add tests for radix tree utils: Compact radix tree array-search: Add helpers to search for a byte in array test/perf_collection: Add callback to check the speed of clone test/perf_mutation: Add option to run with more than 1 columns test/perf_mutation: Prepare to have several regular columns test/perf_mutation: Use builder to build schema	2021-02-18 21:19:14 +02:00
Nadav Har'El	02dde2aca1	cql-pytest: port Cassandra's unit test validation/entities/json_test In this patch, we port validation/entities/json_test.java, containing 21 tests for various JSON-related operations - SELECT JSON, INSERT JSON, and the fromJson() and toJson() functions. In porting these tests, I uncovered 19 (!!) previously unknown bugs in Scylla: Refs #7911: Failed fromJson() should result in FunctionFailure error, not an internal error. Refs #7912: fromJson() should allow null parameter. Refs #7914: fromJson() integer overflow should cause an error, not silent wrap-around. Refs #7915: fromJson() should accept "true" and "false" also as strings. Refs #7944: fromJson() should not accept the empty string "" as a number. Refs #7949: fromJson() fails to set a map<ascii, int>. Refs #7954: fromJson() fails to set null tuple elements. Refs #7972: toJson() truncates some doubles to integers. Refs #7988: toJson() produces invalid JSON for columns with "time" type. Refs #7997: toJson() is missing a timezone on timestamp. Refs #8001: Documented unit "µs" not supported for assigning a "duration" type. Refs #8002: toJson() of decimal type doesn't use exponents so can produce huge output. Refs #8077: SELECT JSON output for function invocations should be compatible with Cassandra. Refs #8078: SELECT JSON ignores the "AS" specification. Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest error, not internal error. Refs #8086: INSERT JSON cannot handle user-defined types with case- sensitive component names. Refs #8087: SELECT JSON incorrectly quotes strings inside map keys. Refs #8092: SELECT JSON missing null component after adding field to UDT definition. Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY. Due to these bugs, 8 out of the 21 tests here currently xfail and one has to be skipped (issue #8100 causes the sanitizer to detect a use after free, and crash Scylla). As usual in these sort of tests, all 21 tests pass when running against Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>	2021-02-18 20:44:04 +02:00
Avi Kivity	f0950e023d	Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams. --- Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. We add an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations). Closes #8116 * github.com:scylladb/scylla: tests: add a simple CDC cql pytest cdc: add config option to disable streams rewriting cdc: rewrite streams to the new description table cql3: query_processor: improve internal paged query API cdc: introduce no_generation_data_exception exception type docs: cdc: mention system.cdc_local table cdc: coroutinize do_update_streams_description sys_dist_ks: split CDC streams table partitions into clustered rows cdc: use chunked_vector for streams in streams_version cdc: remove `streams_version::expired` field system_distributed_keyspace: use mutation API to insert CDC streams storage_service: don't use `sys_dist_ks` before it is started	2021-02-18 12:49:43 +02:00
Kamil Braun	4bf28aad7a	tests: add a simple CDC cql pytest	2021-02-18 11:44:59 +01:00
Kamil Braun	4ef736a0a3	cql3: query_processor: improve internal paged query API The `query_processor::query` method allowed internal paged queries. However, it was quite limited, hardcoding a number of parameters: consistency level, timeout config, page size. This commit does the following improvements: 1. Rename `query` to `query_internal` to make it obvious that this API is supposed to be used for internal queries only 2. Extend the method to take consistency level, timeout config, and page size as parameters 3. Remove unused overloads of `query_internal` 4. Fix a bunch of typos / grammar issues in the docstring	2021-02-18 11:44:59 +01:00
Kamil Braun	67d4e5576d	sys_dist_ks: split CDC streams table partitions into clustered rows Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams.	2021-02-18 11:44:59 +01:00
Kamil Braun	3d7b990300	system_distributed_keyspace: use mutation API to insert CDC streams The `storage_proxy::mutate` low-level API is much more powerful than the CQL API. This power is not needed for this commit but for the next.	2021-02-18 11:44:59 +01:00
Tomasz Grabiec	f94f70cda8	Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja Test log consistency after apply_snapshot() is called. Ensure log::last_term() log::last_conf_index() and log::size() work as expected. Misc cleanups. * scylla-dev/raft-confchange-test: raft: add a unit test for voting raft: do not account for the same vote twice raft: remove fsm::set_configuration() raft: consistently use configuration from the log raft: add ostream serialization for enum vote_result raft: advance commit index right after leaving joint configuration raft: add tracker test raft: tidy up follower_progress API raft: update raft::log::apply_snapshot() assert raft: add a unit test for raft::log raft: rename log::non_snapshoted_length() to log::length() raft: inline raft::log::truncate_tail() raft: ignore AppendEntries RPC with a very old term raft: remove log::start_idx() raft: return a correct last term on an empty log raft: do not use raft::log::start_idx() outside raft::log() raft: rename progress.hh to tracker.hh raft: extend single_node_is_quiet test	2021-02-18 10:55:59 +01:00
Botond Dénes	c3b4c3f451	evictable_reader: reset _range_override after fast-forwarding `_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>	2021-02-17 19:11:00 +02:00
Benny Halevy	35256d1b92	treewide: explicitly use flat_mutation_reader_opt Unlike flat_mutation_reader_opt that is defined using optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate to `false` after being moved, only after it is explicitly reset. Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader> to make it easier to check if it was closed before it's destroyed or being assigned-over. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>	2021-02-17 17:57:34 +02:00
Avi Kivity	c63e26e26f	Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8048 * github.com:scylladb/scylla: cdc: Limit size of topology description cdc: Extract create_stream_ids from topology_description_generator	2021-02-17 15:43:53 +02:00
Piotr Jastrzebski	649f254863	cdc: Limit size of topology description Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-02-17 13:24:40 +01:00
Avi Kivity	001652815c	Merge 'imr: switch back to open-coded description of structures' from Michał Chojnowski Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578 Closes #8106 * github.com:scylladb/scylla: imr: switch back to open-coded description of structures utils: managed_bytes: add a few trivial helper methods utils: fragment_range: move FragmentedView helpers to fragment_range.hh utils: fragment_range: add single_fragmented_mutable_view utils: fragment_range: implement FragmentRange for fragment_range utils: mutable_view: add front() types: remove an unused helper function test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions test: mutation_test: remove an obsolete assertion test: mutation_test: initialize an uninitialized variable test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test	2021-02-17 13:40:16 +02:00
Botond Dénes	ba7a9d2ac3	imr: switch back to open-coded description of structures Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578	2021-02-16 23:43:07 +01:00
Michał Chojnowski	6b8a69e01f	test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions The off-by-one error would cause test_multishard_combining_reader_non_strictly_monotonic_positions to fail if the added range_tombstones filled the buffer exactly to the end. In such situation, with the old loop condition, make_fragments_with_non_monotonic_positions would add one range_tombstone too many to the deque, violating the test assumptions.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	5b79d6ca4c	test: mutation_test: remove an obsolete assertion Due to small value optimizations, the removed assertions are not true in general. Until now, atomic_cell did not use small value optimizations, but it will after upcoming changes.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	aa60f28a09	test: mutation_test: initialize an uninitialized variable It was assumed to be zero-initialized, but C++ does not guarantee that. It has to be initialized explicitly.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	52bd190bb3	test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test sstable_run_based_compaction_test assumed that sstables are freed immediately after they are fully processed. Hovewer, since commit `b524f96a74`, mutation_reader_merger releases sstables in batches of 4, which breaks the assumption. This fix adjusts the test accordingly. Until now, the test only kept working by chance: by coincidence, the number of test sstables processed by merging_reader in a single fill_buffer() call was divisible by 4. Since the test checks happen between those calls, the test never witnessed a situation when an sstable was fully processed, but not released yet. The error was noticed during the work on an upcoming patch which changes the size of mutation_fragment, and reduces the number of test sstables processed in a single fill_buffer() call, which breaks the test.	2021-02-16 21:35:14 +01:00
Konstantin Osipov	d293966366	raft: add a unit test for voting Test duplicate votes, votes from non-members and voting in joint configuration.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	1bdb3fc8a9	raft: add tracker test	2021-02-16 23:15:16 +03:00
Konstantin Osipov	63965f46f4	raft: tidy up follower_progress API Make the API More explicit so it's available for testing.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	6ee3aedcc2	raft: add a unit test for raft::log	2021-02-16 23:12:01 +03:00
Konstantin Osipov	6c14775b20	raft: do not use raft::log::start_idx() outside raft::log() raft::log::start_idx() is currently not meaningful in case the log is empty. Avoid using it in fsm::replicate_to() and avoid manual search for previous log term, instead encapsulate the search in log::term_for(). As a side effect we currently return a correct term (0) when log matching rule is exercised for an empty log and the very first snapshot with term 0. Update raft_etcd_test.cc accordingly. This change happens to reduce the overall line count. While at it, improve the comments in raft::replicate_to().	2021-02-16 21:05:44 +03:00
Nadav Har'El	946e63ee6e	cql-pytest: remove "xfail" tag from two passing tests Issue #7595 was already fixed last week, in commit `b6fb5ee912`, so the two tests which failed because of this issue no longer fail and their "xfail" tag can be removed. Refs #7595. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210216160606.1172855-1-nyh@scylladb.com>	2021-02-16 19:17:22 +02:00
Nadav Har'El	737c1c6cc7	cql-pytest: Additional JSON tests This patch adds several additional tests o test/cql-pytest/test_json.py to reproduce additional bugs or clarify some non-bugs. First, it adds a reproducer for issue #8087, where SELECT JSON may create invalid JSON - because it doesn't quote a string which is part of a map's key. As usual for these reproducers, the test passes on Cassandra, and fails on Scylla (so marked xfail). We have a bigger test translated from Cassandra's unit tests, cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys which demonstrates the same problem, but the test added in this patch is much shorter and focuses on demonstrating exactly where the problem is. Second, this patch adds a test test verifies that SELECT JSON works correctly for UDTs or tuples where one of their components was never set - in such a case the SELECT JSON should also output this component, with a "null" value. And this test works (i.e., produces the same result in Cassandra and Scylla). This test is interesting because it shows that issue #8092 is specific to the case of an altered UDT, and doesn't happen for every case of null component in a UDT. Refs #8087 Refs #8092 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>	2021-02-16 16:05:31 +01:00
Benny Halevy	50ca693a02	main: disable stall detector during startup We see long reactor stalls from `logalloc::prime_segment_pool` in debug mode yet the stall detector's purpose is to detect reactor stalls during normal operation where they can increase the latency of other queries running in parallel. Since this change doesn't actually fix the stalls but rather hides them, the following annotations will just refrence the respective github issues rather than auto-close them. Refs #7150 Refs #5192 Refs #5960 Restore blocked_reactor_notify_ms right before starting storage_proxy. Once storage_proxy is up, this node affects cluster latency, and so stalls should be reported so they can be fixed. Test: secondary_index_test --blocked-reactor-notify-ms 1 (release) DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>	2021-02-16 13:28:31 +02:00
Pavel Emelyanov	9baf1226dc	test/memory_footpring: Print radix tree node sizes After switching cells storage onto compact radix tree it becomes useful to know the tree nodes' sizes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:41:09 +03:00
Pavel Emelyanov	1bdfa355ea	row: Remove old storages Now when the 3rd storage type (radix tree) is all in, old storage can be safely removed. The result is: 1. memory footprint sizeof(class row): 112 => 16 bytes sizeof(rows_entry): 126 => 120 bytes the "in cache" value depends on the number of cells: num of cells master patch 1 752 656 2 808 712 3 864 768 4 920 824 5 968 936 6 1136 992 ... 16 1840 1672 17 1904 1992 (+88) 18 1976 2048 (+72) 19 2048 2104 (+56) 20 2120 2160 (+40) 21 2184 2208 (+24) 22 2256 2264 ( +8) 23 2328 2320 ... 32 2960 2808 After 32 cells the storage switches into rbtree with 24-bytes per-cell overhead and the radix tree improvement rocketlaunches 64 7872 6056 128 15040 9512 256 29376 18568 2. perf_mutation test is enhanced by this series and the results differ depending on the number of columns used tps value --column-count master patch 1 59.9k 57.6k (-3.8%) 2 59.9k 57.5k 4 59.8k 57.6k 8 57.6k 57.7k <- eq 16 56.3k 57.6k 32 53.2k 57.4k (+7.9%) A note on this. Last time 1-column test was ~5% worse which was explained by inline storage of 5 cells that's present on current implementation and was absent in radix tree. An attempt to make inline storage for small radix trees resulted in complete loss of memory footprint gain, but gave fraction of percent to perf_mutation performance. So this version doesn't have inline nodes. The 1.2% improvement from v2 surprisingly came from the tree::clone_from() which in v2 was work-around-ed by slow walk+emplace sequence while this version has the optimized API call for cloning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:35:06 +03:00
Pavel Emelyanov	aa85bc790b	test: Add tests for radix tree Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Tomasz Grabiec	f86108aef1	Merge "raft: move ticking to external code" from Alejo As Gleb suggested in a previous review, remove ticker from raft and leave calling tick() to external code. While there, tick faster to speed up tests. * https://github.com/alecco/scylla/tree/tests-17-remove-ticker: raft: replication test: reduce ticker from 100ms to 1ms raft: drop ticker from raft	2021-02-15 18:14:03 +02:00

1 2 3 4 5 ...

1294 Commits