scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 22:25:48 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	ecb6c56a2a	Merge 'lsa: background reclaim' from Avi Kivity This series adds background reclaim to lsa, with the goal that most large allocations can be satisfied from available free memory, and and reclaim work can be done from a preemptible context. If the workload has free cpu, then background reclaim will utilize that free cpu, reducing latency for the main workload. Otherwise, background reclaim will compete with the main workload, but since that work needs to happen anyway, throughput will not be reduced. A unit test is added to verify it works. Fixes #1634. Closes #8044 * github.com:scylladb/scylla: test: logalloc_test: test background reclaim logalloc: reduce gap between std min_free and logalloc min_free logalloc: background reclaim logalloc: preemptible reclaim	2021-02-24 13:23:30 +01:00
Avi Kivity	15d3797e97	test: logalloc_test: test background reclaim Test that the background reclaimer is able to compete with a fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU" is fully randomized. If the background reclaimer is disabled, the test fails as soon as the 20MB "gap" is exhausted. With the reclaimer enabled, it is able to free memory ahead of the allocations.	2021-02-23 19:42:42 +02:00
Avi Kivity	78d1afeabd	Merge "Use radix tree to store cells on a row" from Pavel E " Current storage of cells in a row is a union of vector and set. The vector holds 5 cell_and_hash's inline, up to 32 ones in the external storage and then it's switched to std::set. Once switched, the whole union becomes the waste of space, as it's size is sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes and only 3 pointers from it are used (std::set header). Also the overhead to keep cell_and_hash as a set entry is more then the size of the structure itself. Column ids are 32-bit integers that most likely come sequentialy. For this kind of a search key a radix tree (with some care for non-sequential cases) can be beneficial. This set introduces a compact radix tree, that uses 7-bit sub values from the search key to index on each node and compacts the nodes themselves for better memory usage. Then the row::_storage is replaced with the new tree. The most notable result is the memory footprint decrease, for wide rows down to 2x times. The performance of micro-benchmarks is a bit lower for small rows and (!) higer for longer (8+ cells). The numbers are in patch #12 (spoiler: they are better than for v2) v3: - trimmed size of radix down to 7 bits - simplified the nodes layouts, now there are 2 of them (was 4) - enhanced perf_mutation to test N-cells schema - added AVX intra-nodes search for medium-sized nodes - added .clone_from() method that helped to improve perf_mutation - minor - changed functions not to return values via refs-arguments - fixed nested classes to properly use language constructors - renamed index_to to key_t to distinguish from node_index_t - improved recurring variadic templates not to use sentinel argument - use standard concepts v2: - fixed potential mis-compilation due to strict-aliasing violation - added oracle test (radix tree is compared with std::map) - added radix to perf_collection - cosmetic changes (concepts, comments, names) A note on item 1 from v2 changelog. The nodes are no longer packed perfectly, each has grown 3 bytes. But it turned out that when used as cells container most of this growth drowned in lsa alignments. next todo: - aarch64 version of 16-keys node search tests: unit(dev), unit(debug for radix), pref(dev) " 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla: test/memory_footpring: Print radix tree node sizes row: Remove old storages row: Prepare row::equal for switch row: Prepare row::difference for switch row: Introduce radix tree storage type row-equal: Re-declare the cells_equal lambda test: Add tests for radix tree utils: Compact radix tree array-search: Add helpers to search for a byte in array test/perf_collection: Add callback to check the speed of clone test/perf_mutation: Add option to run with more than 1 columns test/perf_mutation: Prepare to have several regular columns test/perf_mutation: Use builder to build schema	2021-02-18 21:19:14 +02:00
Avi Kivity	f0950e023d	Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams. --- Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. We add an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations). Closes #8116 * github.com:scylladb/scylla: tests: add a simple CDC cql pytest cdc: add config option to disable streams rewriting cdc: rewrite streams to the new description table cql3: query_processor: improve internal paged query API cdc: introduce no_generation_data_exception exception type docs: cdc: mention system.cdc_local table cdc: coroutinize do_update_streams_description sys_dist_ks: split CDC streams table partitions into clustered rows cdc: use chunked_vector for streams in streams_version cdc: remove `streams_version::expired` field system_distributed_keyspace: use mutation API to insert CDC streams storage_service: don't use `sys_dist_ks` before it is started	2021-02-18 12:49:43 +02:00
Kamil Braun	4ef736a0a3	cql3: query_processor: improve internal paged query API The `query_processor::query` method allowed internal paged queries. However, it was quite limited, hardcoding a number of parameters: consistency level, timeout config, page size. This commit does the following improvements: 1. Rename `query` to `query_internal` to make it obvious that this API is supposed to be used for internal queries only 2. Extend the method to take consistency level, timeout config, and page size as parameters 3. Remove unused overloads of `query_internal` 4. Fix a bunch of typos / grammar issues in the docstring	2021-02-18 11:44:59 +01:00
Kamil Braun	67d4e5576d	sys_dist_ks: split CDC streams table partitions into clustered rows Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams.	2021-02-18 11:44:59 +01:00
Tomasz Grabiec	f94f70cda8	Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja Test log consistency after apply_snapshot() is called. Ensure log::last_term() log::last_conf_index() and log::size() work as expected. Misc cleanups. * scylla-dev/raft-confchange-test: raft: add a unit test for voting raft: do not account for the same vote twice raft: remove fsm::set_configuration() raft: consistently use configuration from the log raft: add ostream serialization for enum vote_result raft: advance commit index right after leaving joint configuration raft: add tracker test raft: tidy up follower_progress API raft: update raft::log::apply_snapshot() assert raft: add a unit test for raft::log raft: rename log::non_snapshoted_length() to log::length() raft: inline raft::log::truncate_tail() raft: ignore AppendEntries RPC with a very old term raft: remove log::start_idx() raft: return a correct last term on an empty log raft: do not use raft::log::start_idx() outside raft::log() raft: rename progress.hh to tracker.hh raft: extend single_node_is_quiet test	2021-02-18 10:55:59 +01:00
Botond Dénes	c3b4c3f451	evictable_reader: reset _range_override after fast-forwarding `_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>	2021-02-17 19:11:00 +02:00
Benny Halevy	35256d1b92	treewide: explicitly use flat_mutation_reader_opt Unlike flat_mutation_reader_opt that is defined using optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate to `false` after being moved, only after it is explicitly reset. Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader> to make it easier to check if it was closed before it's destroyed or being assigned-over. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>	2021-02-17 17:57:34 +02:00
Avi Kivity	c63e26e26f	Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8048 * github.com:scylladb/scylla: cdc: Limit size of topology description cdc: Extract create_stream_ids from topology_description_generator	2021-02-17 15:43:53 +02:00
Piotr Jastrzebski	649f254863	cdc: Limit size of topology description Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-02-17 13:24:40 +01:00
Botond Dénes	ba7a9d2ac3	imr: switch back to open-coded description of structures Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578	2021-02-16 23:43:07 +01:00
Michał Chojnowski	6b8a69e01f	test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions The off-by-one error would cause test_multishard_combining_reader_non_strictly_monotonic_positions to fail if the added range_tombstones filled the buffer exactly to the end. In such situation, with the old loop condition, make_fragments_with_non_monotonic_positions would add one range_tombstone too many to the deque, violating the test assumptions.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	5b79d6ca4c	test: mutation_test: remove an obsolete assertion Due to small value optimizations, the removed assertions are not true in general. Until now, atomic_cell did not use small value optimizations, but it will after upcoming changes.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	aa60f28a09	test: mutation_test: initialize an uninitialized variable It was assumed to be zero-initialized, but C++ does not guarantee that. It has to be initialized explicitly.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	52bd190bb3	test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test sstable_run_based_compaction_test assumed that sstables are freed immediately after they are fully processed. Hovewer, since commit `b524f96a74`, mutation_reader_merger releases sstables in batches of 4, which breaks the assumption. This fix adjusts the test accordingly. Until now, the test only kept working by chance: by coincidence, the number of test sstables processed by merging_reader in a single fill_buffer() call was divisible by 4. Since the test checks happen between those calls, the test never witnessed a situation when an sstable was fully processed, but not released yet. The error was noticed during the work on an upcoming patch which changes the size of mutation_fragment, and reduces the number of test sstables processed in a single fill_buffer() call, which breaks the test.	2021-02-16 21:35:14 +01:00
Konstantin Osipov	d293966366	raft: add a unit test for voting Test duplicate votes, votes from non-members and voting in joint configuration.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	1bdb3fc8a9	raft: add tracker test	2021-02-16 23:15:16 +03:00
Konstantin Osipov	63965f46f4	raft: tidy up follower_progress API Make the API More explicit so it's available for testing.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	6ee3aedcc2	raft: add a unit test for raft::log	2021-02-16 23:12:01 +03:00
Konstantin Osipov	6c14775b20	raft: do not use raft::log::start_idx() outside raft::log() raft::log::start_idx() is currently not meaningful in case the log is empty. Avoid using it in fsm::replicate_to() and avoid manual search for previous log term, instead encapsulate the search in log::term_for(). As a side effect we currently return a correct term (0) when log matching rule is exercised for an empty log and the very first snapshot with term 0. Update raft_etcd_test.cc accordingly. This change happens to reduce the overall line count. While at it, improve the comments in raft::replicate_to().	2021-02-16 21:05:44 +03:00
Pavel Emelyanov	aa85bc790b	test: Add tests for radix tree Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Tomasz Grabiec	508f928220	tests: sstables: Test sstable write fails on missing partition_end mid-stream Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210115163055.74398-1-tgrabiec@scylladb.com>	2021-02-15 15:45:49 +02:00
Avi Kivity	9cbbf40710	Merge "register_inactive_read: error handling" from Benny " Currently, register_inactive_read accepts an eviction_notify_handler to be called when the inactive_read is evicted. However, in case there was an error in register_inactive_read the notification function isn't called leaving behind state that needs to be cleaned up. This series separates the register_inactive_reader interface into 2 parts: 1. register_inactive_reader(flat_mutation_reader) - which just registers the reader and return an inactive_read_handle, if permitted. Otherwise, the notification handler is not called (it is not known yet) and the caller is not expected to do anything fance at this point that will require cleanup. This optimizes the server when overloaded since we do less work that we'd need to undo in case the reader_concurrecy_semaphore runs out of resources. 2. After register_inactive_reader succeeded to return a valid inactive_read_handle, the caller sets up its local state and may call `set_notify_handler` to set the optional notify_handler and ttl on the o_r_h. After this state, the notify_handler will be called when the inactive_reader is evicted, for any reason. querier_cache::insert_querier was modified to use the above procedure and to handle (and log/ignore) any error in the process. inactive_read_handle and inactive_read keeping track of each other was simplified by keeping an iterator in the handle and a backpointer in the inactive_read object. The former is used to evict the reader and to set the notify_handler and/or ttl without having to lookup the i_r. The latter is used to invalidate the i_r_h when the i_r is destroyed. Test: unit(release), querier_cache_test(debug) " * tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla: querier_cache: insert_querier: ignore errors to register inactive reader querier_cache: insert_querier: handle errors querier_utils: mark functions noexcept reader_concurrency_semaphore: register_inactive_read: make noexcept reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional reader_concurrency_semaphore: inactive_read: use intrusive list reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged reader_concurrency_semaphore: inactive_read_handle: swap definition order reader_lifecycle_policy: retire low level try_resume method reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader	2021-02-10 19:09:21 +02:00
Konstantin Osipov	41387225c3	raft: extend single_node_is_quiet test	2021-02-09 17:04:13 +03:00
Piotr Sarna	4acc6fecf0	Merge 'locator: Check DC names in NetworkTopologyStrategy' from Juliusz Stasiewicz The same trick is used as in C: `79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)` The edited CQL test relied on quietly accepting non-existing DCs, so it had to be removed. Also, one boost-test referred to nonexistent `datacenter2` and had to be removed. Fixes #7595 Closes #8056 github.com:scylladb/scylla: tests: Adjusted tests for DC checking in NTS locator: Check DC names in NTS	2021-02-09 14:45:20 +02:00
Juliusz Stasiewicz	97bb15b2f2	tests: Adjusted tests for DC checking in NTS CQL test relied on quietly acceptiong non-existing DCs, so it had to be removed. Also, one boost-test referred to nonexisting `datacenter2` and had to be removed.	2021-02-09 08:29:35 +01:00
Benny Halevy	46c2229b78	reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader Register the inactive reader first with no evict_notify_handler and ttl. Those can be set later, only if registration succeeded. Otherwise, as in the querier example, there is no need to to place the querier in the index and erase it on eviction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	e072199b8d	reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error Calling unregister_inactive_read on the wrong semaphore is a blatant bug so better call on_internal_error so it'd be easier to catch and fix. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Konstantin Osipov	adc87aa278	raft: re-lookup progress object after a configuration change Fix raft_fsm_test failure in debug mode. ASAN complained that follower_progress is used in append_entries_reply() after it was destroyed. This could happen if in maybe_commit() we switched to a new configuration and destroyed old progress objects. The fix is to lookup the object one more time after maybe_commit().	2021-02-05 12:40:19 +01:00
Benny Halevy	ca6f5cb0bc	test: commitlog_test: test_allocation_failure: fill memory using smaller allocations commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output). So if there are discontiguous small memory blocks, they can be used to satisfy an allocation even if no contiguous memory blocks are available. To prevent that, as Avi suggested, this change allocates in 128K blocks and frees the last one to succeed (so that we won't fail on allocating continuations). Fixes #8028 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>	2021-02-03 12:21:20 +02:00
Avi Kivity	913d970c64	Merge "Unify inactive readers" from Botond " Currently inactive readers are stored in two different places: * reader concurrency semaphore * querier cache With the latter registering its inactive readers with the former. This is an unnecessarily complex (and possibly surprising) setup that we want to move away from. This series solves this by moving the responsibility if storing of inactive reads solely to the reader concurrency semaphore, including all supported eviction policies. The querier cache is now only responsible for indexing queriers and maintaining relevant stats. This makes the ownership of the inactive readers much more clear, hopefully making Benny's work on introducing close() and abort() a little bit easier. Tests: unit(release, debug:v1) " * 'unify-inactive-readers/v2' of https://github.com/denesb/scylla: reader_concurrency_semaphore: store inactive readers directly querier_cache: store readers in the reader concurrency semaphore directly querier_cache: retire memory based cache eviction querier_cache: delegate expiry to the reader_concurrency_semaphore reader_concurrency_semaphore: introduce ttl for inactive reads querier_cache: use new eviction notify mechanism to maintain stats reader_concurrency_semaphore: add eviction notification facility reader_concurrency_semaphore: extract evict code into method evict()	2021-02-03 10:59:04 +02:00
Tomasz Grabiec	873e732042	Merge "Switch partition rows onto B-tree" from Pavel Emelyanov This is the continuaiton of the row-cache performance improvements, this time -- the rework of clustering keys part. The goal is to solve the same set of problems: - logN eviction complexity - deep and sparse tree Unlike partitions, this cache has one big feature that makes it impossible to just use existing B+ tree: There's no copyable key at hands. The clustering key is the managed_bytes() that is not nothrow-copy-constructibe, neither it's hash-able for lookup due to prefix lookup. Thus the choice is the B-tree, which is also N-ary one, but doesn't copy keys around. B-trees are like B+, but can have key:data pairs in inner nodes, thus those nodes may be significantly bigger then B+ ones, that have data-s only in leaf trees. Not to make the memory footprint worse, the tree assumes that keys and data live on the same object (the rows_entry one), and the tree itself manages only the key pointers. Not to invalidate iterators on insert/remove the tree nodes keep pointers on keys, not the keys themselves. The tree uses tri-compare instead of less-compare. This makes the .find and .lower_bound methods do ~10% less comparisons on random insert/lookup test. Numbers: - memory_footprint: B-tree master rows_entry size: 216 232 1 row in-cache: 968 960 (because of dummy entry) in-memtable: 1006 1022 100 rows in-cache: 50774 50856 in-memtable: 50620 50918 - mutation_test: B-tree master tps.average: 891177 833896 - simple_query: B-tree master tps.median: 71807 71656 tps.maximum: 71847 71708 * xemul/clustering-cache-over-btree-4: mutation_partition: Save one keys comparison partition_snapshot_row_cursor: Remove rows pointer mutation_partition: Use B-tree insertion sugar perf-test : Print B-tree sizes mutation_partition: Switch cache of rows onto B-tree partition_snapshot_reader: Rename cmp to less for explicity mutation_partition: Make insertion bullet-proof mutation_partition: Use tri-compare in non-set places flat_mutation_reader: Use clear() in destroy_current_mutation() rows_entry: Generalize compare utils: Intrusive B-tree (with tests) tests: Generalize bptree compaction test tests: Generalize bptree stress test	2021-02-02 12:26:02 +01:00
Tomasz Grabiec	75eb97b12c	Merge 'Commitlog multi-entry write' from Calle Wilund Fixes #7615 Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles. Guarantees that all entries are written or none are, and that they will be flushed to disk together. Small test included. Closes #7616 * github.com:scylladb/scylla: commitlog_test: Add multi-entry write test commitlog: Add "add_entries" call to allow inputting N mutations commitlog: Make commitlog entries optionally multi-entry commitlog: Move entry_writer definition to cc file	2021-02-02 12:23:19 +01:00
Tomasz Grabiec	7b17969a6e	Merge 'sstable: reader: preempt after every fragment' from Avi Kivity Whenever we push a fragment, we check whether the buffer is full and return proceed::no if so, so that the state machine pauses and lets the consumer continue. This patch adds an additional condition - if preemption is needed, we also return proceed::no. This drops us back to the outer loop (in sstable_mutation_reader::fill_buffer), which will yield to the reactor as part of seastar::do_until(). Two cases (partition_start and partition_end) did not have the check for is_buffer_full(); it is added now. This can trigger is the partition has no rows. Unlike the previous attempt, push_ready_fragments() is not touched. The extra preemption opportunities triggered a preexisting bug in clustering_ranges_walker; it is fixed in the first patch of the series. I tested this by reading from a large partition with a simple schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE. However, even without the patch I only got sporadic stalls with the detector set to 1ms, so it's possible I'm not testing correctly. Test: unit (dev, debug, release) Fixes #7883. Closes #7928 * github.com:scylladb/scylla: sstable: reader: preempt after every fragment clustering_range_walker: fix false discontiguity detected after a static row	2021-02-02 12:21:58 +01:00
Benny Halevy	0fecc78d88	user_function: throw on_internal_error if executed outside a seastar thread Rather than asserting, as seen in #7977. This shouldn't crash the server in production. Add unit test that reproduces this scenario and verifies the internal error exception. Fixes #7977 Test: unit(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>	2021-02-02 13:03:39 +02:00
Calle Wilund	720a47fe8a	commitlog_test: Add multi-entry write test	2021-02-02 10:41:08 +00:00
Avi Kivity	da4fa0629a	Merge "sstables: add sstable_origin to scylla_metadata" from Benny " This series extends the scylla_metadata sstable component to hold an optional testual description of the sstable origin. It describes where the sstables originated from (e.g. memtable, repair, streaming, compaction, etc.) The origin string is provided by the sstable writer via sstable_writer_config, written to the scylla_metadata component, and loaded on sstable::load(). A get_origin() method was added to class sstable to retrieve its origin. It returns an empty string by default if the origin is missing. Compaction now logs the sstable origin for each sstable it compacts, and it generates the sstable origin for all sstables in generates. Regular compaction origin is simply set to "compaction" while other compaction types are mentioned by name, as "cleanup", "resharding", "reshaping", etc. A unit test was added to test the sstable_origin by writing either an empty origin and a random string, and then comparing the origin retrieved by sstable::load to the one written. Test: unit(release) Fixes #7880 " * tag 'sstable-origin-v2' of github.com:bhalevy/scylla: compaction: log sstable origin sstables: scylla_metadata: add support for sstable_origin sstables: sstable_writer_config: add origin member	2021-02-02 10:35:11 +02:00
Pavel Emelyanov	5c0f9a8180	mutation_partition: Switch cache of rows onto B-tree The switch is pretty straightforward, and consists of - change less-compare into tri-compare - rename insert/insert_check into insert_before_hint - use tree::key_grabber in mutation_partition::apply_monotonically to exception-safely transfer a row from one tree to another - explicitly erase the row from tree in rows_entry::on_evicted, there's a O(1) tree::iterator method for this - rewrite rows_entry -> cache_entry transofrmation in the on_evicted to fit the B-tree API - include the B-tree's external memory usage into stats That's it. The number of keys per node was is set to 12 with linear search and linear extention of 20 because - experimenting with tree shows that numbers 8 through 10 keys with linear search show the best performance on stress tests for insert/find-s of keys that are memcmp-able arrays of bytes (which is an approximation of current clustring key compare). More keys work slower, but still better than any bigger value with any type of search up to 64 keys per node - having 12 keys per nodes is the threshold at which the memory footprint for B-tree becomes smaller than for boost::intrusive::set for partitions with 32+ keys - 20 keys for linear root eats the first-split peak and still performs well in linear search As a result the footpring for B tree is bigger than the one for BST only for trees filled with 21...32 keys by 0.1...0.7 bytes per key. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	2f7c03d84c	utils: Intrusive B-tree (with tests) The design of the tree goes from the row-cache needs, which are 1. Insert/Remove do not invalidate iterators 2. Elements are LSA-manageable 3. Low key overhead 4. External tri-comparator 5. As little actions on insert/remove as possible With the above the design is Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes and N pointers on keys (not keys themselves). Two differences: inner nodes have array of pointers on kids, leaf nodes keep pointer on the tree (to update left- and rightmost tree pointers on node move). Nodes do not keep pointers/references on trees, thus we have O(1) move of any object, but O(logN) to get the tree size. Fortunately, with big keys-per-node value this won't result in too many steps. In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter is for constant-time begin() and end(). Keys are managed by user with the help of embeddable member_hook instance, which is 1 pointer in size. The code was copied from the B+ tree one, then heavily reworked, the internal algorythms turned out to differ quite significantly. For the sake of mutation_partition::apply_monotonically(), which needs to move an element from one tree into another, there's a key_grabber helping wrapper that allows doing this move respecting the exception-safety requirement. As measured by the perf_collections test the B-tree with 8 keys is faster, than the std::set, but slower than the B+tree: vs set vs b+tree fill: +13% -6% find: +23% -35% Another neat thing is that 1-key insertion-removal is ~40% faster than for BST (the same number of allocations, but the key object is smaller, less pointers to set-up and less instructions to execute when linking node with root). v4: - equip insertion methods with on_alloc_point() calls to catch potential exception guarantees violations eariler - add unlink_leftmost_without_rebalance. The method is borrowed from boost intrusive set, and is added to kill two birds -- provide it, as it turns out to be popular, and use a bit faster step-by-step tree destruction than plain begin+erase loop v3: - introduce "inline" root node that is embedded into tree object and in which the 1st key is inserted. This greatly improves the 1-key-tree performance, which is pretty common case for rows cache v2: - introduce "linear" root leaf that grows on demand This improves the memory consumption for small trees. This linear node may and should over-grow the NodeSize parameter. This comes from the fact that there are two big per-key memory spikes on small trees -- 1-key root leaf and the first split, when the tree becomes 1-key root with two half-filled leaves. If the linear extention goes above NodeSize it can flatten even the 2nd peak - mitigate the keys indirection a bit Prefetching the keys while doing the intra-node linear scan and the nodes while descending the tree gives ~+5% of fill and find - generalize stress tests for B and B+ trees - cosmetic changes TODO: - fix few inefficincies in the core code (walks the sub-tree twice sometimes) - try to optimize the leaf nodes, that are not lef-/righmost not to carry unused tree pointer on board Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:29 +03:00
Avi Kivity	7634a90dd2	clustering_range_walker: fix false discontiguity detected after a static row clustering_range_walker detects when we jump from one row range to another. When a static row is included in the query, the constructor sets up the first before/after bounds to be exactly that static row. That creates an artificial range crossing if the first clustering range is contiguous with the static row. This can cause the index to be consulted needlessly if we happen to fall back to sstable_mutation_reader after reading the static row. A unit test is added. Ref #7883.	2021-02-01 19:32:07 +02:00
Pavel Solodovnikov	9d17a654a6	raft: use null_sharder for raft tables Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210201105300.110210-1-pa.solodovnikov@scylladb.com>	2021-02-01 18:52:04 +02:00
Tomasz Grabiec	eac9c1d80a	Merge "raft: configuration changes with joint consensus" from Kostja Support configuration changes based on joint consensus. When a user adds a configuration entry, commit an interim "joint consensus" configuration to the log first, and transition to the final configuration once both C_old and C_new configurations accept the joint entry. Misc cleanups. * scylla-dev/raft-config-changes-v2: raft: update README.md raft: add a simple test for configuration changes raft: joint consensus, wire up configuration changes in the API raft: joint consensus, count votes using joint config raft: joint consensus, wire up configuration changes in FSM raft: joint consensus, update progress tracker with joint configuration raft: joint consensus, don't store configuration in FSM raft: joint consensus, keep track of the last confchange index in the log raft: joint consensus, implement helpers in class configuration raft: joint consensus, use unordered_set for server_address list raft: joint consensus, switch configuration to joint raft: rename check_committed() to maybe_commit() raft: fix spelling and add comments	2021-02-01 18:52:04 +02:00
Benny Halevy	77328a936a	sstables: scylla_metadata: add support for sstable_origin Add new scylla_metadata_type::SSTableOrigin. Store and retrive a sstring to the scylla metadata component. Pass sstable_writer_config::origin from the mx sstable writer and ignore it in the k_l writer. Add unit test to verify the sstable_origin extension using both empty and a random string. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Benny Halevy	22f6023ac3	sstables: sstable_writer_config: add origin member Add a string describing where the sstables originated from (e.g. memtable, repair, streaming, compaction, etc.) If configure_writer is called with a nullptr, the origin will be equal to an empty string. Introduce test_env_sstables_manager that provides an overload of configure_writer with no parmeters that calls the base-class' configure_writer with "test" origin. This was to reduce the code churn in this patch and to keep the tests simple. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Konstantin Osipov	b7692af8bc	raft: add a simple test for configuration changes Test adding, removing replacing a node. With fix-ups by Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-29 22:07:08 +03:00
Konstantin Osipov	1ca738d9a2	raft: joint consensus, use unordered_set for server_address list	2021-01-29 22:07:07 +03:00
Pavel Emelyanov	575c992a35	test: Bring test_apply_monotonically_is_monotonic back to work The idea of the monotonicity checking test is: try to apply one one random partition to another random one sequentually failing allocations. Each time allocation fails (with the bad_alloc exception) -- check the exception guarantee is respected, then apply (!) the very same two partitions to each other. At the end of the test we make sure, that an exception may pop up at any point of application and it will be safe. This idea is flawed currently. When verifying the guarantee the test moves the 2nd partition and leaves it empty for the next loop iteration. So right on the 2nd attempt to apply partitions it becomes a no-op, doesn't fail and no more exceptions arise. Fix by restoring both partitions at the end of each check. Broken since `74db08165d`. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210129153641.5449-1-xemul@scylladb.com>	2021-01-29 18:47:15 +01:00
Tomasz Grabiec	16eb4c6ce2	Merge "raft: system table backed persistency module" from Pavel Solodovnikov This series contains an initial implementation of raft persistency module that uses `raft` system table as the underlying storage model. "system.raft" table will be used as a backend storage for implementing raft persistence module in Scylla. It combines both raft log, persisted vote and term, and snapshot info. The table is partitioned by group id, thus allowing multi-raft operation. The rest of the table structure mirrors the fields of corresponding core raft structures defined in `raft.hh`, such as `raft::log_entry`. The raft table stores the only the latest snapshot id while the actual snapshot will be available in a separate table called `system.raft_snapshots`. The schema of `raft_snapshots` mirrors the fields of `raft::snapshot` structure. IDL definitions are also added for every raft struct so that we automatically provide serialization and deserialization facilities needed both for persistency module and for future RPC implmementation. The first patch is a side-change needed to provide complete serialization/deserialization for `bytes_ostream`, which we need when persisting the raft log in the table (since `data` is a variant containing `raft::command` (aka `bytes_ostream`) among others). `bytes_ostream` was lacking `deserialize` function, which is added in the patch. The second patch provides serializer for `lw_shared_ptr<T>` which will be used for `raft::append_entries`, which has a field with `std::vector<const lw_shared_ptr<raft::log_entry>>` type. There is also a patch to extend `fragmented_temporary_buffer` with a static function `allocate_to_fit` that allocates an instance of the fragmented buffer that has a specified size. Individual fragment size is limited to 128kb. The patch-set also contains the test suite covering basic functionality of the persistency module. * manmanson/raft-api-impl-v11: raft/sys_table_storage: add basic tests for raft_sys_table_storage raft: introduce `raft_sys_table_storage` class utils: add `fragmented_temporary_buffer::allocate_to_fit` raft: add IDL definitions for raft types raft: create `system.raft` and `system.raft_snapshots` tables serializer: add `serializer<lw_shared_ptr<T>>` specialization serializer: add `deserialize` function overload for `bytes_ostream`	2021-01-29 11:40:39 +02:00
Pavel Solodovnikov	e309502c42	raft/sys_table_storage: add basic tests for raft_sys_table_storage The test suite covers the most basic use cases for the system table backed raft persistency module: * store/load vote and term * store/load snapshot * store snapshot with log tail truncation * store/load log entries * log truncation Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-29 02:00:27 +03:00

1 2 3 4 5 ...

796 Commits