scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 19:21:01 +00:00

Author	SHA1	Message	Date
Piotr Sarna	7ceafda70a	service: add timeout config to client state Future patches will use this per-connection timeout config to allow setting different timeouts for each session, based on roles.	2021-02-25 17:20:26 +01:00
Raphael S. Carvalho	7bf0744d36	reshape/TWCS: Fix off-by-one in threshold check A given time bucket should also be reshaped if its # of sstables has reached the threshold. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210223182634.570648-1-raphaelsc@scylladb.com>	2021-02-24 15:12:40 +02:00
Raphael S. Carvalho	21608bd677	sstables: Fix TWCS reshape for windows with at least min_threshold sstables TWCS reshape was silently ignoring windows which contain at least min_threshold sstables (can happen with data segregation). When resizing candidates, size of multi_window was incorrectly used and it was always empty in this path, which means candidates was always cleared. Fixes #8147. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>	2021-02-24 15:11:19 +02:00
Tomasz Grabiec	ecb6c56a2a	Merge 'lsa: background reclaim' from Avi Kivity This series adds background reclaim to lsa, with the goal that most large allocations can be satisfied from available free memory, and and reclaim work can be done from a preemptible context. If the workload has free cpu, then background reclaim will utilize that free cpu, reducing latency for the main workload. Otherwise, background reclaim will compete with the main workload, but since that work needs to happen anyway, throughput will not be reduced. A unit test is added to verify it works. Fixes #1634. Closes #8044 * github.com:scylladb/scylla: test: logalloc_test: test background reclaim logalloc: reduce gap between std min_free and logalloc min_free logalloc: background reclaim logalloc: preemptible reclaim	2021-02-24 13:23:30 +01:00
Piotr Sarna	25f47561cb	transport: fix an outdated comment The comment mentions calling a lambda in-place, but the lambda is no longer there since 2019! Message-Id: <3903c84d5c151415409f28935e328b552dd548f8.1614155567.git.sarna@scylladb.com>	2021-02-24 11:14:01 +02:00
Avi Kivity	15d3797e97	test: logalloc_test: test background reclaim Test that the background reclaimer is able to compete with a fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU" is fully randomized. If the background reclaimer is disabled, the test fails as soon as the 20MB "gap" is exhausted. With the reclaimer enabled, it is able to free memory ahead of the allocations.	2021-02-23 19:42:42 +02:00
Nadav Har'El	d905e71a90	Alternator: add support for CORS protocol This patch adds to Alternator support for the CORS (Cross-Origin Resource Sharing) protocol - a simple extension over the HTTP protocol which browsers use when Javascript code contacts HTTP-based servers. Although we usually think of Alternator as being used in a three-tier application, in some setups there is no middle layer and the user's browser, running Javascript code, wants to communicate directly with the database. However, for security reasons, by default Javascript loaded from domain X is not allowed to communicate with different domains Y. The CORS protocol is meant to allow this, and Alternator needs to participate in this protocol if it is to be used directly from Javascript in browsers. To implement CORS, Alternator needs to respond to the OPTIONS method which it didn't allow before - with certain headers based on the input headers. It also needs to do some of these things for the regular methods (mostly, POST). The patch includes a comprehensive test that runs against both Alternator and DynamoDB and shows that Alternator handles these headers and methods the same as DynamoDB. Additionally, I tested manually a Javascript DynamoDB client - which didn't work prior to this patch (the browser reported CORS errors), and works after this patch. Fixes #8025. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>	2021-02-23 13:15:03 +01:00
Asias He	7018377bd7	messaging_service: Move gossip ack message verb to gossip group Fix a scheduling group leak: INFO [shard 0] gossip - gossiper::run sg=gossip INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip After the fix: INFO [shard 0] gossip - gossiper::run sg=gossip INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip Fixes #7986 Closes #8129	2021-02-23 10:10:00 +02:00
Tomasz Grabiec	fb1d3fe2cf	table: Fix schema mismatch between memtable reader and sstable writer The schema used to create the sstable writer has to be the same as the schema used by the reader, as the former is used to intrpret mutation fragments produced by the reader. Commit `9124a70` intorduced a deferring point between reader creation and writer creation which can result in schema mismatch if there was a concurrent alter. This could lead to the sstable write to crash, or generate a corrupted sstable. Fixes #7994 Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>	2021-02-22 17:51:00 +02:00
Raphael S. Carvalho	81d773e5d8	compaction_manager: Redefine weight for better control of parallel compactions Compaction manager allows compaction of different weights to proceed in parallel. For example, a small-sized compaction job can happen in parallel to a large-sized one, but similar-sized jobs are serialized. The problem is the current definition of weight, which is the log (base 4) of total size (size of all sstables) of a job. This is what we get with the current weight definition: weight=5 for sizes=[1K, 3K] weight=6 for sizes=[4K, 15K] weight=7 for sizes=[16K, 63K] weight=8 for sizes=[64K, 255K] weight=9 for sizes=[258K, 1019K] weight=10 for sizes=[1M, 3M] weight=11 for sizes=[4M, 15M] weight=12 for sizes=[16M, 63M] weight=13 for sizes=[64M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 12 Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5 jobs smaller than 1MB could proceed in parallel. High number of parallel compactions can be observed after repair, which potentially produces tons of small sstables of varying sizes. That causes compaction to use a significant amount of resources. To fix this problem, let's add a fixed tax to the size before taking the log, so that jobs smaller than 1M will all have the same weight. Look at what we get with the new weight definition: weight=10 for sizes=[1K, 2M] weight=11 for sizes=[3M, 14M] weight=12 for sizes=[15M, 62M] weight=13 for sizes=[63M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 7 Fixes #8124. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>	2021-02-22 15:50:29 +02:00
Asias He	554ab035dd	main: Run init_server and join_cluster inside maintenance scheduling group Currently, init_server and join_cluster which initiate the bootstrap and replace operations on the new node run inside the main scheduling group. We should run them inside the maintenance scheduling group to reduce the impact on the user workload. This patch fixes a scheduling group leak for bootstrap and replace operation. Before: [shard 0] storage_service - storage_service::bootstrap sg=main [shard 0] repair - bootstrap_with_repair sg=main After: [shard 0] storage_service - storage_service::bootstrap sg=streaming [shard 0] repair - bootstrap_with_repair sg=streaming Fixes #8130 Closes #8131	2021-02-22 14:55:02 +02:00
Michał Chojnowski	a24f83852e	atomic_cell: fix operator<< for atomic_cell_or_collection operator<< used the wrong criterium for deciding whether the data is stored as atomic_cell or collection_mutation, resulting in catastrophical failure if it was used with frozen collections or UDTs. Since frozen collections and UDTs are stored as atomic_cell, not collection_mutation, the correct criterium is not is_collection(), but is_multi_cell(). Closes #8134	2021-02-22 14:45:34 +02:00
Liu Lan	d2378129a3	docs: fix invalid path in README.mds Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com> Closes #8126	2021-02-21 13:49:12 +02:00
Pekka Enberg	d483922671	Update tools/java submodule * tools/java 0187829d5e...142f517a23 (2): > nodetool: Enable resetlocalschema > sstableloader: Make progress printout less eager.	2021-02-19 12:37:04 +02:00
Avi Kivity	78d1afeabd	Merge "Use radix tree to store cells on a row" from Pavel E " Current storage of cells in a row is a union of vector and set. The vector holds 5 cell_and_hash's inline, up to 32 ones in the external storage and then it's switched to std::set. Once switched, the whole union becomes the waste of space, as it's size is sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes and only 3 pointers from it are used (std::set header). Also the overhead to keep cell_and_hash as a set entry is more then the size of the structure itself. Column ids are 32-bit integers that most likely come sequentialy. For this kind of a search key a radix tree (with some care for non-sequential cases) can be beneficial. This set introduces a compact radix tree, that uses 7-bit sub values from the search key to index on each node and compacts the nodes themselves for better memory usage. Then the row::_storage is replaced with the new tree. The most notable result is the memory footprint decrease, for wide rows down to 2x times. The performance of micro-benchmarks is a bit lower for small rows and (!) higer for longer (8+ cells). The numbers are in patch #12 (spoiler: they are better than for v2) v3: - trimmed size of radix down to 7 bits - simplified the nodes layouts, now there are 2 of them (was 4) - enhanced perf_mutation to test N-cells schema - added AVX intra-nodes search for medium-sized nodes - added .clone_from() method that helped to improve perf_mutation - minor - changed functions not to return values via refs-arguments - fixed nested classes to properly use language constructors - renamed index_to to key_t to distinguish from node_index_t - improved recurring variadic templates not to use sentinel argument - use standard concepts v2: - fixed potential mis-compilation due to strict-aliasing violation - added oracle test (radix tree is compared with std::map) - added radix to perf_collection - cosmetic changes (concepts, comments, names) A note on item 1 from v2 changelog. The nodes are no longer packed perfectly, each has grown 3 bytes. But it turned out that when used as cells container most of this growth drowned in lsa alignments. next todo: - aarch64 version of 16-keys node search tests: unit(dev), unit(debug for radix), pref(dev) " 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla: test/memory_footpring: Print radix tree node sizes row: Remove old storages row: Prepare row::equal for switch row: Prepare row::difference for switch row: Introduce radix tree storage type row-equal: Re-declare the cells_equal lambda test: Add tests for radix tree utils: Compact radix tree array-search: Add helpers to search for a byte in array test/perf_collection: Add callback to check the speed of clone test/perf_mutation: Add option to run with more than 1 columns test/perf_mutation: Prepare to have several regular columns test/perf_mutation: Use builder to build schema	2021-02-18 21:19:14 +02:00
Nadav Har'El	02dde2aca1	cql-pytest: port Cassandra's unit test validation/entities/json_test In this patch, we port validation/entities/json_test.java, containing 21 tests for various JSON-related operations - SELECT JSON, INSERT JSON, and the fromJson() and toJson() functions. In porting these tests, I uncovered 19 (!!) previously unknown bugs in Scylla: Refs #7911: Failed fromJson() should result in FunctionFailure error, not an internal error. Refs #7912: fromJson() should allow null parameter. Refs #7914: fromJson() integer overflow should cause an error, not silent wrap-around. Refs #7915: fromJson() should accept "true" and "false" also as strings. Refs #7944: fromJson() should not accept the empty string "" as a number. Refs #7949: fromJson() fails to set a map<ascii, int>. Refs #7954: fromJson() fails to set null tuple elements. Refs #7972: toJson() truncates some doubles to integers. Refs #7988: toJson() produces invalid JSON for columns with "time" type. Refs #7997: toJson() is missing a timezone on timestamp. Refs #8001: Documented unit "µs" not supported for assigning a "duration" type. Refs #8002: toJson() of decimal type doesn't use exponents so can produce huge output. Refs #8077: SELECT JSON output for function invocations should be compatible with Cassandra. Refs #8078: SELECT JSON ignores the "AS" specification. Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest error, not internal error. Refs #8086: INSERT JSON cannot handle user-defined types with case- sensitive component names. Refs #8087: SELECT JSON incorrectly quotes strings inside map keys. Refs #8092: SELECT JSON missing null component after adding field to UDT definition. Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY. Due to these bugs, 8 out of the 21 tests here currently xfail and one has to be skipped (issue #8100 causes the sanitizer to detect a use after free, and crash Scylla). As usual in these sort of tests, all 21 tests pass when running against Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>	2021-02-18 20:44:04 +02:00
Takuya ASADA	32d4ec6b8a	scylla_util.py: resolve /dev/root to get actual device on aws When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly reports root partition is part of ephemeral disks, and RAID construction will fail. This prevents the error and reports correct free disks. Fixes #8055 Closes #8040	2021-02-18 20:25:45 +02:00
Avi Kivity	90a7f76fb6	Merge 'cdc: log: fix a use-after-free in process_bytes_visitor' from Michał Chojnowski Due to small value optimization used in `bytes`, views to `bytes` stored in `vector` can be invalidated when the vector resizes, resulting in use-after-free and data corruption. Fix that. Closes #8105 * github.com:scylladb/scylla: cdc: log: avoid an unnecessary copy cdc: log: fix use-after-free in process_bytes_visitor	2021-02-18 20:23:41 +02:00
Michał Chojnowski	96c22cf3f8	cdc: log: avoid an unnecessary copy There is no need to copy `bytes_view` into `bytes` here.	2021-02-18 14:08:18 +01:00
Michał Chojnowski	8cc4f39472	cdc: log: fix use-after-free in process_bytes_visitor Due to small value optimization used in `bytes`, views to `bytes` stored in `vector` can be invalidated when the vector resizes, resulting in use-after-free and data corruption. Fix that. Fixes #8117	2021-02-18 14:08:17 +01:00
Avi Kivity	f0950e023d	Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams. --- Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. We add an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations). Closes #8116 * github.com:scylladb/scylla: tests: add a simple CDC cql pytest cdc: add config option to disable streams rewriting cdc: rewrite streams to the new description table cql3: query_processor: improve internal paged query API cdc: introduce no_generation_data_exception exception type docs: cdc: mention system.cdc_local table cdc: coroutinize do_update_streams_description sys_dist_ks: split CDC streams table partitions into clustered rows cdc: use chunked_vector for streams in streams_version cdc: remove `streams_version::expired` field system_distributed_keyspace: use mutation API to insert CDC streams storage_service: don't use `sys_dist_ks` before it is started	2021-02-18 12:49:43 +02:00
Kamil Braun	4bf28aad7a	tests: add a simple CDC cql pytest	2021-02-18 11:44:59 +01:00
Kamil Braun	841f07e9b7	cdc: add config option to disable streams rewriting Rewriting stream descriptions is a long, expensive, and prone-to-failure operation. Due to #8061 it may consume a lot of memory. In general, it may keep failing (and being retried) endlessly, straining the cluster. As a backdoor we add this flag for potential future needs of admins or field engineers. I don't expect it will ever be used, but it won't hurt and may save us some work in the worst case scenario.	2021-02-18 11:44:59 +01:00
Kamil Braun	9bdd000e97	cdc: rewrite streams to the new description table Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. This commit adds an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations).	2021-02-18 11:44:59 +01:00
Kamil Braun	4ef736a0a3	cql3: query_processor: improve internal paged query API The `query_processor::query` method allowed internal paged queries. However, it was quite limited, hardcoding a number of parameters: consistency level, timeout config, page size. This commit does the following improvements: 1. Rename `query` to `query_internal` to make it obvious that this API is supposed to be used for internal queries only 2. Extend the method to take consistency level, timeout config, and page size as parameters 3. Remove unused overloads of `query_internal` 4. Fix a bunch of typos / grammar issues in the docstring	2021-02-18 11:44:59 +01:00
Kamil Braun	7c91894ddf	cdc: introduce no_generation_data_exception exception type	2021-02-18 11:44:59 +01:00
Kamil Braun	99cc9b8051	docs: cdc: mention system.cdc_local table	2021-02-18 11:44:59 +01:00
Kamil Braun	44aab61aea	cdc: coroutinize do_update_streams_description	2021-02-18 11:44:59 +01:00
Kamil Braun	67d4e5576d	sys_dist_ks: split CDC streams table partitions into clustered rows Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams.	2021-02-18 11:44:59 +01:00
Kamil Braun	ba920361b3	cdc: use chunked_vector for streams in streams_version The vector may get quite long (say... 1,6M stream IDs). We prevent a large allocation by using utils::chunked_vector.	2021-02-18 11:44:59 +01:00
Kamil Braun	9ae4467970	cdc: remove `streams_version::expired` field This field was not used anywhere.	2021-02-18 11:44:59 +01:00
Kamil Braun	3d7b990300	system_distributed_keyspace: use mutation API to insert CDC streams The `storage_proxy::mutate` low-level API is much more powerful than the CQL API. This power is not needed for this commit but for the next.	2021-02-18 11:44:59 +01:00
Kamil Braun	0df15ca8cc	storage_service: don't use `sys_dist_ks` before it is started It could happen that system_distributed_keyspace was used by storage_service before it was fully started (inside `handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned (on shard 0). It only checked whether `local_is_initialized()` returns true, so it only ensured that the service is constructed. Currently, sys_dist_ks' `start` only announces migrations, so this was mostly harmless. More concretely: it could result in the node trying to send CQL requests using a table that it didn't yet recognize by calling sys_dist_ks' methods before the `announce_migration` call inside `start` has returned. This would result in an exception; however, the exception would be catched by the caller and the procedure would be retried, succeeding eventually. See `handle_cdc_generation` for details. Still, the initial intention of the code was to wait for the sys_dist_ks service to be fully started before it was used. This commit fixes that.	2021-02-18 11:44:59 +01:00
Tomasz Grabiec	f94f70cda8	Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja Test log consistency after apply_snapshot() is called. Ensure log::last_term() log::last_conf_index() and log::size() work as expected. Misc cleanups. * scylla-dev/raft-confchange-test: raft: add a unit test for voting raft: do not account for the same vote twice raft: remove fsm::set_configuration() raft: consistently use configuration from the log raft: add ostream serialization for enum vote_result raft: advance commit index right after leaving joint configuration raft: add tracker test raft: tidy up follower_progress API raft: update raft::log::apply_snapshot() assert raft: add a unit test for raft::log raft: rename log::non_snapshoted_length() to log::length() raft: inline raft::log::truncate_tail() raft: ignore AppendEntries RPC with a very old term raft: remove log::start_idx() raft: return a correct last term on an empty log raft: do not use raft::log::start_idx() outside raft::log() raft: rename progress.hh to tracker.hh raft: extend single_node_is_quiet test	2021-02-18 10:55:59 +01:00
Raphael S. Carvalho	5206a97915	compaction: Fix leak of expired sstable in the backlog tracker expired sstables are skipped in the compaction setup phase, because they don't need to be actually compacted, but rather only deleted at the end. that is causing such sstables to not be removed from the backlog tracker, meaning that backlog caused by expired sstables will not be removed even after their deletion, which means shares will be higher than needed, making compaction potentially more aggressive than it have to. to fix this bug, let's manually register these sstables into the monitor, such that they'll be removed from the tracker once compaction completes. Fixes #6054. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>	2021-02-18 11:12:00 +02:00
Takuya ASADA	d7f202f900	dist/debian: fix renaming debian/scylla-* files rule Current renaming rule of debian/scylla-* files is buggy, it fails to install some .service files when custom product name specified. Introduce regex based rewriting instead of adhoc renaming, and fixed wrong renaming rule. Fixes #8113 Closes #8114	2021-02-18 10:35:19 +02:00
Pekka Enberg	843bf57c3c	Update tools/jmx submodule * tools/jmx 949cefc...bf8bb16 (1): > Merge 'dist/debian: fix renaming debian/scylla-* files rule' from Takuya ASADA	2021-02-18 10:35:00 +02:00
Botond Dénes	c3b4c3f451	evictable_reader: reset _range_override after fast-forwarding `_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>	2021-02-17 19:11:00 +02:00
Benny Halevy	4b46793c19	row_cache: scanning_and_populating_reader: add _read_next_partition flag Instead of resetting _reader in scanning_and_populating_reader::fill_buffer in the `reader_finished` case, use a gentler, _read_next_partition flag on which `read_next_partition` will be called in the next iteration. Then, read_next_partition can close _reader only before overwriting it with a new reader. Otherwise, if _reader is always closed in the ``reader_finished` case, we end up hitting premature end_of_stream. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>	2021-02-17 19:06:21 +02:00
Benny Halevy	57540dae42	mutation_query: mark reconcilable_result_builder constructor noexcept With result_memory_accounter begin nothrow move constructible reconcilable_result_builder does not throw. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-67-bhalevy@scylladb.com>	2021-02-17 18:56:12 +02:00
Benny Halevy	92e0e84ee5	database: futurize remove In preparation for futurizing the querier_cache api. Coroutinize drop_column_family while at it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-61-bhalevy@scylladb.com>	2021-02-17 18:52:53 +02:00
Benny Halevy	5263ab0e9d	row_cache: read_context: use query-request is_single_partition helper Rather than hand-coding the same logic. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-32-bhalevy@scylladb.com>	2021-02-17 18:29:39 +02:00
Benny Halevy	35256d1b92	treewide: explicitly use flat_mutation_reader_opt Unlike flat_mutation_reader_opt that is defined using optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate to `false` after being moved, only after it is explicitly reset. Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader> to make it easier to check if it was closed before it's destroyed or being assigned-over. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>	2021-02-17 17:57:34 +02:00
Avi Kivity	c63e26e26f	Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8048 * github.com:scylladb/scylla: cdc: Limit size of topology description cdc: Extract create_stream_ids from topology_description_generator	2021-02-17 15:43:53 +02:00
Piotr Jastrzebski	649f254863	cdc: Limit size of topology description Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-02-17 13:24:40 +01:00
Avi Kivity	001652815c	Merge 'imr: switch back to open-coded description of structures' from Michał Chojnowski Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578 Closes #8106 * github.com:scylladb/scylla: imr: switch back to open-coded description of structures utils: managed_bytes: add a few trivial helper methods utils: fragment_range: move FragmentedView helpers to fragment_range.hh utils: fragment_range: add single_fragmented_mutable_view utils: fragment_range: implement FragmentRange for fragment_range utils: mutable_view: add front() types: remove an unused helper function test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions test: mutation_test: remove an obsolete assertion test: mutation_test: initialize an uninitialized variable test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test	2021-02-17 13:40:16 +02:00
Botond Dénes	ba7a9d2ac3	imr: switch back to open-coded description of structures Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578	2021-02-16 23:43:07 +01:00
Michał Chojnowski	25a9569cc4	utils: managed_bytes: add a few trivial helper methods We will use them in the upcoming IMR removal patch.	2021-02-16 23:43:07 +01:00
Michał Chojnowski	3f248ca7cc	utils: fragment_range: move FragmentedView helpers to fragment_range.hh In the upcoming IMR removal patch we will need read_simple() and similar helpers for FragmentedView outside of types.hh. For now, let's move them to fragment_range.hh, where FragmentedView is defined. Since it's a widely included header, we should consider moving them to a more specialized header later.	2021-02-16 21:35:15 +01:00
Michał Chojnowski	8a06a576aa	utils: fragment_range: add single_fragmented_mutable_view We will use it later in the upcoming IMR removal patch.	2021-02-16 21:35:15 +01:00

1 2 3 4 5 ...

25257 Commits