scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 11:55:15 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	cccdd6aaae	compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism Prior to `463d0ab`, only one table could be cleaned up at a time on a given shard. Since then, all tables belonging to a given keyspace are cleaned up in parallel. Cleanup serialization on each shard was enforced with a semaphore, which was incorrectly removed by the patch aforementioned. So space requirement for cleanup to succeed can be up to the size of keyspace, increasing the chances of node running out of space. Node could also run out of memory if there are tons of tables in the keyspace. Memory requirement is at least #_of_tables * 128k (not taking into account write behind, etc). With 5k tables, it's ~0.64G per shard. Also all tables being cleaned up in parallel will compete for the same disk and cpu bandwidth, so making them all much slower, and consequently the operation time is significantly higher. This problem was detected with cleanup, but scrub and upgrade go through the same rewrite procedure, so they're affected by exact the same problem. Fixes #8247. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com> (cherry picked from commit `7171244844`)	2021-03-18 14:29:38 +02:00
Raphael S. Carvalho	92871a88c3	compaction: Prevent cleanup and regular from compacting the same sstable Due to regression introduced by `463d0ab`, regular can compact in parallel a sstable being compacted by cleanup, scrub or upgrade. This redundancy causes resources to be wasted, write amplification is increased and so does the operation time, etc. That's a potential source of data resurrection because the now-owned data from a sstable being compacted by both cleanup and regular will still exist in the node afterwards, so resurrection can happen if node regains ownership. Fixes #8155. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com> (cherry picked from commit `2cf0c4bbf1`) Includes fixup patch: compaction_manager: Fix use-after-free in rewrite_sstables() Use-after-free introduced by `2cf0c4bbf1`. That's because compacting is moved into then_wrapped() lambda, so it's potentially freed on the next iteration of repeat(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com> (cherry picked from commit `f7cc431477`)	2021-03-11 08:24:56 +02:00
Benny Halevy	85bbf6751d	repair: repair_writer: do not capture lw_shared_ptr cross-shard The shared_from_this lw_shared_ptr must not be accessed across shards. Capturing it in the lambda passed to mutation_writer::distribute_reader_and_consume_on_shards causes exactly that since the captured lw_shared_ptr is copied on other shards, and ends up in memory corruption as seen in #7535 (probably due to lw_shared_ptr._count going out-of-sync when incremented/decremented in parallel on other shards with no synchronization. This was introduced in `289a08072a`. The writer is not needed in the body of this lambda anyways so it doesn't need to capture it. It is already held by the continuations until the end of the chain. Fixes #7535 Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com> (cherry picked from commit `f93fb55726`)	2021-03-03 21:27:44 +02:00
Hagit Segev	0ac069fdcc	release: prepare for 4.2.4 scylla-4.2.4	2021-03-02 14:52:31 +02:00
Avi Kivity	738f8eaccd	Update seastar submodule * seastar 1266e42c82...0fba7da929 (1): > io_queue: Fix "delay" metrics Fixes #8166.	2021-03-01 13:59:02 +02:00
Avi Kivity	5d32e91e16	Update seastar submodule * seastar f760efe0a0...1266e42c82 (1): > rpc: streaming sink: order outgoing messages Fixes #7552.	2021-03-01 12:22:17 +02:00
Benny Halevy	6c5f6b3f69	large_data_handler: disable deletion of large data entries Currently we decide whether to delete large data entries based on the overall sstable data_size, since the entries themselves are typically much smaller than the whole sstable (especially cells and rows), this causes overzealous deletions (#7668) and inefficiency in the rows cache due to the large number of range tombstones created. Refs #7575 Test: sstable_3_x_test(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> This patch is targetted for branch-4.3 or earlier. In 4.4, the problem was fixed in #7669, but the fix is out of scope for backporting. Branch: 4.3 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201203130018.1920271-1-bhalevy@scylladb.com> (cherry picked from commit `bb99d7ced6`)	2021-03-01 10:54:33 +02:00
Raphael S. Carvalho	fba26b78d2	sstables: Fix TWCS reshape for windows with at least min_threshold sstables TWCS reshape was silently ignoring windows which contain at least min_threshold sstables (can happen with data segregation). When resizing candidates, size of multi_window was incorrectly used and it was always empty in this path, which means candidates was always cleared. Fixes #8147. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com> (cherry picked from commit `21608bd677`)	2021-02-28 16:43:02 +02:00
Pavel Solodovnikov	06e785994f	large_data_handler: fix segmentation fault when constructing `data_value` from a `nullptr` It turns out that `cql_table_large_data_handler::record_large_rows` and `cql_table_large_data_handler::record_large_cells` were broken for reporting static cells and static rows from the very beginning: In case a large static cell or a large static row is encountered, it tries to execute `db::try_record` with `nullptr` additional values, denoting that there is no clustering key to be recorded. These values are next passed to `qctx.execute_cql()`, which creates `data_value` instances for each statement parameter, hence invoking `data_value(nullptr)`. This uses `const char*` overload which delegates to `std::string_view` ctor overload. It is UB to pass `nullptr` pointer to `std::string_view` ctor. Hence leading to segmentation faults in the aforementioned large data reporting code. What we want here is to make a null `data_value` instead, so just add an overload specifically for `std::nullptr_t`, which will create a null `data_value` with `text` type. A regression test is provided for the issue (written in `cql-pytest` framework). Tests: test/cql-pytest/test_large_cells_rows.py Fixes: #6780 Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com> (cherry picked from commit `219ac2bab5`)	2021-02-23 12:14:12 +02:00
Takuya ASADA	5bc48673aa	scylla_util.py: resolve /dev/root to get actual device on aws When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly reports root partition is part of ephemeral disks, and RAID construction will fail. This prevents the error and reports correct free disks. Fixes #8055 Closes #8040 (cherry picked from commit `32d4ec6b8a`)	2021-02-21 16:23:45 +02:00
Nadav Har'El	59a01b2981	alternator: fix ValidationException in FilterExpression - and more The first condition expressions we implemented in Alternator were the old "Expected" syntax of conditional updates. That implementation had some specific assumptions on how it handles errors: For example, in the "LT" operator in "Expected", the second operand is always part of the query, so an error in it (e.g., an unsupported type) resulted it a ValidationException error. When we implemented ConditionExpression and FilterExpression, we wrongly used the same functions check_compare(), check_BETWEEN(), etc., to implement them. This results in some inaccurate error handling. The worst example is what happens when you use a FilterExpression with an expression such as "x < y" - this filter is supposed to silently skip items whose "x" and "y" attributes have unsupported or different types, but in our implementation a bad type (e.g., a list) for y resulted in a ValidationException which aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH) we actually noticed the slightly different behavior needed and implemented the same operator twice - with ugly code duplication. But in other operators we missed this problem completely. This patch first adds extensive tests of how the different expressions (Expected, QueryFilter, FilterExpression, ConditionExpression) and the different operators handle various input errors - unsupported types, missing items, incompatible types, etc. Importantly, the tests demonstrate that there is often different behavior depending on whether the bad input comes from the query, or from the item. Some of the new tests fail before this patch, but others pass and were useful to verify that the patch doesn't break anything that already worked correctly previously. As usual, all the tests pass on Cassandra. Finally, this patch fixes all these problems. The comparison functions like check_compare() and check_BETWEEN() now not only take the operands, they also take booleans saying if each of the operands came from the query or from an item. The old-syntax caller (Expected or QueryFilter) always say that the first operand is from the item and the second is from the query - but in the new-syntax caller (ConditionExpression or FilterExpression) any or all of the operands can come from the query and need verification. The old duplicated code for check_BEGINS_WITH() - which a TODO to remove it - is finally removed. Instead we use the same idea of passing booleans saying if each of its operands came from an item or from the query. Fixes #8043 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `653610f4bc`)	2021-02-21 10:06:50 +02:00
Nadav Har'El	5dd49788c1	alternator: fix UpdateItem ADD for non-existent attribute UpdateItem's "ADD" operation usually adds elements to an existing set or adds a number to an existing counter. But it can also be used to create a new set or counter (as if adding to an empty set or zero). We unfortunately did not have a test for this case (creating a new set or counter), and when I wrote such a test now, I discovered the implementation was missing. So this patch adds both the test and the implementation. The new test used to fail before this patch, and passes with it - and passes on DynamoDB. Note that we only had this bug for the newer UpdateItem syntax. For the old AttributeUpdates syntax, we already support ADD actions on missing attributes, and already tested it in test_update_item_add(). I just forgot to test the same thing for the newer syntax, so I missed this bug :-( Fixes #7763. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201207085135.2551845-1-nyh@scylladb.com> (cherry picked from commit `a8fdbf31cd`)	2021-02-21 08:58:49 +02:00
Benny Halevy	56cbc9f3ed	stream_session: prepare: fix missing string format argument As seen in mv_populating_from_existing_data_during_node_decommission_test dtest: ``` ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found) ``` Fixes #8067 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com> (cherry picked from commit `d01e7e7b58`)	2021-02-14 13:11:43 +02:00
Avi Kivity	7469896017	table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race on_compaction_completion() updates _sstables_compacted_but_not_deleted through a temporary to avoid an exception causing a partial update: 1. copy _sstables_compacted_but_not_deleted to a temporary 2. update temporary 3. do dangerous stuff 4. move temporary to _sstables_compacted_but_not_deleted This is racy when we have parallel compactions, since step 3 yields. We can have two invocations running in parallel, taking snapshots of the same _sstables_compacted_but_not_deleted in step 1, each modifying it in different ways, and only one of them winning the race and assigning in step 4. With the right timing we can end with extra sstables in _sstables_compacted_but_not_deleted. Before `a5369881b3`, this was a benign race (only resulting in deleted file space not being reclaimed until the service is shut down), but afterwards, extra sstable references result in the service refusing to shut down. This was observed in database_test in debug mode, where the race more or less reliably happens for system.truncated. Fix by using a different method to protect _sstables_compacted_but_not_deleted. We unconditionally update it, and also unconditionally fix it up (on success or failure) using seastar::defer(). The fixup includes a call to rebuild_statistics() which must happen every time we touch the sstable list. Ref #7331. Fixes #8038. BACKPORT NOTES: - Turns out this race prevented deletion of expired sstables because the leaked deleted sstables would be accounted when checking if an expired sstable can be purged. - Switch to unordered_set<>::count() as it's not supported by older compilers. (cherry picked from commit `a43d5079f3`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>	2021-02-14 11:35:57 +02:00
Piotr Wojtczak	c7e2711dd4	Validate ascii values when creating from CQL Although the code for it existed already, the validation function hasn't been invoked properly. This change fixes that, adding a validating check when converting from text to specific value type and throwing a marshal exception if some characters are not ASCII. Fixes #5421 Closes #7532 (cherry picked from commit `caa3c471c0`)	2021-02-10 19:37:56 +02:00
Piotr Dulikowski	a2355a35db	hinted handoff: use default timeout for sending orphaned hints This patch causes orphaned hints (hints that were written towards a node that is no longer their replica) to be sent with a default write timeout. This is what is currently done for non-orphaned hints. Previously, the timeout was hardcoded to one hour. This could cause a long delay while shutting down, as hints manager waits until all ongoing hint sending operation finish before stopping itself. Fixes: #7051 (cherry picked from commit `b111fa98ca`)	2021-02-10 10:15:01 +02:00
Piotr Sarna	9e225ab447	Merge 'select_statement: Fix aggregate results on indexed selects (timeouts fixed) ' from Piotr Grabowski Overview Fixes #7355. Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below). Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR. It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page). GROUP BY (first commit) Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times): ``` CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck)); CREATE INDEX ks_t on ks.t(v); INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3); INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3); SELECT pk FROM ks.t WHERE v=3 GROUP BY pk; pk ---- 1 1 ``` This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355. Paging (second commit) Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit). The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range: ``` CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c)); CREATE INDEX kst_index ON ks.t(c); INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3); INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4); INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5); SELECT COUNT() FROM ks.t WHERE c = 3; count ------- 2 ``` The second commit fixes this issue by properly trimming `row_ranges`. The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`. The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query). The second commit fixes the first two failing examples from issue #7355. Tests (fourth commit) Extensively tests queries on tables with secondary indices with aggregates and `GROUP BY`s. Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`, `whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios). I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios. Configurable internal_paging_size (third commit) Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000). This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes. Closes #7497 github.com:scylladb/scylla: tests: Add secondary index aggregates tests select_statement: Introduce internal_paging_size select_statement: Fix paging on indexed selects select_statement: Fix GROUP BY on indexed select (cherry picked from commit `8c645f74ce`)	2021-02-08 20:32:36 +02:00
Amnon Heiman	e1205d1d5b	API: Fix aggregation in column_familiy Few method in column_familiy API were doing the aggregation wrong, specifically, bloom filter disk size. The issue is not always visible, it happens when there are multiple filter files per shard. Fixes #4513 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #8007 (cherry picked from commit `4498bb0a48`)	2021-02-08 17:04:27 +02:00
Avi Kivity	a78402efae	Merge 'Add waiting for flushes on table drops' from Piotr Sarna This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish. Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race: 1. Run a node with `auto_snapshot=false` 2. Schedule a memtable flush (e.g. via nodetool) 3. Get preempted in the middle of the flush 4. Drop the table 5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied) Fixes #7792 Closes #7798 * github.com:scylladb/scylla: database: add flushes to waiting for pending operations table: unify waiting for pending operations database: add a phaser for flush operations database: add waiting for pending streams on table drop (cherry picked from commit `7636799b18`)	2021-02-02 17:23:34 +02:00
Avi Kivity	9fcf790234	row_cache: linearize key in cache_entry::do_read() do_read() does not linearize cache_entry::_key; this can cause a crash with keys larger than 13k. Fixes #7897. Closes #7898 (cherry picked from commit `d508a63d4b`)	2021-01-17 09:30:44 +02:00
Hagit Segev	24346215c2	release: prepare for 4.2.3 scylla-4.2.3	2021-01-04 19:51:12 +02:00
Benny Halevy	918ec5ecb3	compaction: compaction_writer: destroy shared_sstable after the sstable_writer sstable_writer may depend on the sstable throughout its whole lifecycle. If the sstable is freed before the sstable_writer we might hit use-after-free as in the follwing case: ``` std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240 (inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378 (inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252 (inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327 (inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214 sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123 (inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519 seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:? (inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432 seastar::output_stream<char>::flush() at table.cc:? seastar::output_stream<char>::close() at table.cc:? sstables::file_writer::close() at sstables.cc:? sstables::mc::writer::~writer() at writer.cc:? (inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790 sstables::mc::writer::~writer() at writer.cc:? flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:? (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280 (inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401 (inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474 (inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659 (inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229 (inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468 (inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538 (inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>) const at /usr/include/c++/10/bits/unique_ptr.h:85 (inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361 (inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342 (inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201 auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389 (inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612 ``` What happens here is that: compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc) : _out(std::move(out)) , _compression_metadata(cm) , _offsets(_compression_metadata->offsets.get_writer()) , _compression(lc) , _full_checksum(ChecksumType::init_checksum()) _compression_metadata points to a buffer held by the sstable object. and _compression_metadata->offsets.get_writer returns a writer that keeps a reference to the segmented_offsets in the sstables::compression that is used in the ~writer -> close path. Fixes #7821 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com> (cherry picked from commit `8a745a0ee0`)	2021-01-04 15:04:34 +02:00
Avi Kivity	7457683328	Revert "Merge 'Move temporaries to value view' from Piotr S" This reverts commit `d1fa0adcbe`. It causes a regression when processing some bind variables. Fixes #7761.	2020-12-24 12:40:46 +02:00
Gleb Natapov	567889d283	mutation_writer: pass exceptions through feed_writer feed_writer() eats exception and transforms it into an end of stream instead. Downstream validators hate when this happens. Fixes #7482 Message-Id: <20201216090038.GB3244976@scylladb.com> (cherry picked from commit `61520a33d6`)	2020-12-16 17:20:11 +02:00
Aleksandr Bykov	c605ed73bf	dist: scylla_util: fix aws_instance.ebs_disks method aws_instance.ebs_disks() method should return ebs disk instead of ephemeral Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com> Closes #7780 (cherry picked from commit `e74dc311e7`)	2020-12-16 11:58:47 +02:00
Takuya ASADA	d0530d8ac2	node_exporter_install: stop service before force installing Stop node-exporter.service before re-install it, to avoid 'Text file busy' error. Fixes #6782 (cherry picked from commit `ef05ea8e91`)	2020-12-15 16:28:25 +02:00
Hagit Segev	696ef24226	release: prepare for 4.2.2 scylla-4.2.2	2020-12-13 20:34:03 +02:00
Avi Kivity	b8fe144301	dist: rpm: uninstall tuned when installing scylla-kernel-conf tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns and other sysctl tunables that we so laboriously tuned, dropping performance by a factor of 5 (due to increased latency). Fix by obsoleting tuned during install (in effect, we are a better tuned, at least for us). Not needed for .deb, since debian/ubunto do not install tuned by default. Fixes #7696 Closes #7776 (cherry picked from commit `615b8e8184`)	2020-12-12 14:30:38 +02:00
Nadav Har'El	62f783be87	alternator: fix broken Scan/Query paging with bytes keys When an Alternator table has partition keys or sort keys of type "bytes" (blobs), a Scan or Query which required paging used to fail - we used an incorrect function to output LastEvaluatedKey (which tells the user where to continue at the next page), and this incorrect function was correct for strings and numbers - but NOT for bytes (for bytes, we need to encode them as base-64). This patch also includes two tests - for bytes partition key and for bytes sort key - that failed before this patch and now pass. The test test_fetch_from_system_tables also used to fail after a Limit was added to it, because one of the tables it scans had a bytes key. That test is also fixed by this patch. Fixes #7768 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201207175957.2585456-1-nyh@scylladb.com> (cherry picked from commit `86779664f4`)	2020-12-09 15:16:41 +02:00
Piotr Sarna	863e784951	db: fix getting local ranges for size estimates table When getting local ranges, an assumption is made that if a range does not contain an end or when its end is a maximum token, then it must contain a start. This assumption proven not true during manual tests, so it's now fortified with an additional check. Here's a gdb output for a set of local ranges which causes an assertion failure when calling `get_local_ranges` on it: (gdb) p ranges $1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = { _start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = { _kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}} Closes #7764 (cherry picked from commit `1cc4ed50c1`)	2020-12-09 15:16:14 +02:00
Nadav Har'El	e5a6199b4d	alternator, test: make test_fetch_from_system_tables faster The test test_fetch_from_system_tables tests Alternator's system-table feature by reading from all system tables. The intention was to confirm we don't crash reading any of them - as they have different schemas and can run into different problems (we had such problems in the initial implementation). The intention was not to read a lot from each table - we only make a single "Scan" call on each, to read one page of data. However, the Scan call did not set a Limit, so the single page can get pretty big. This is not normally a problem, but in extremely slow runs - such as when running the debug build on an extremely overcommitted test machine (e.g., issue #7706) reading this large page may take longer than our default timeout. I'll send a separate patch for the timeout issue, but for now, there is really no reason why we need to read a big page. It is good enough to just read 50 rows (with Limit=50). This will still read all the different types and make the test faster. As an example, in the debug run on my laptop, this test spent 2.4 seconds to read the "compaction_history" table before this patch, and only 0.1 seconds after this patch. 2.4 seconds is close to our default timeout (10 seconds), 0.1 is very far. Fixes #7706 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201207075112.2548178-1-nyh@scylladb.com> (cherry picked from commit `220d6dde17`)	2020-12-09 15:15:15 +02:00
Nadav Har'El	abaf6c192a	alternator: fix query with both projection and filtering We had a bug when a Query/Scan had both projection (ProjectionExpression or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter). The problem was that projection left only the requested attributes, and the filter might have needed - and not got - additional attributes. The solution in this patch is to add the generated JSON item also the extra attributes needed by filtering (if any), run the filter on that, and only at the end remove the extra filtering attributes from the item to be returned. The two tests test_query_filter.py::test_query_filter_and_attributes_to_get test_filter_expression.py::test_filter_expression_and_projection_expression Which failed before this patch now pass so we drop their "xfail" tag. Fixes #6951. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `282742a469`)	2020-12-09 14:39:17 +02:00
Eliran Sinvani	ef2f5ed434	consistency level: fix wrong quorum calculation whe RF = 0 We used to calculate the number of endpoints for quorum and local_quorum unconditionally as ((rf / 2) + 1). This formula doesn't take into account the corner case where RF = 0, in this situation quorum should also be 0. This commit adds the missing corner case. Tests: Unit Tests (dev) Fixes #6905 Closes #7296 (cherry picked from commit `925cdc9ae1`)	2020-11-29 16:45:14 +02:00
Raphael S. Carvalho	bac40e2512	sstable_directory: Fix 50% space requirement for resharding This is a regression caused by `aebd965f0`. After the sstable_directory changes, resharding now waits for all sstables to be exhausted before releasing reference to them, which prevents their resources like disk space and fd from being released. Let's restore the old behavior of incrementally releasing resources, reducing the space requirement significantly. Fixes #7463. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com> (cherry picked from commit `6f805bd123`)	2020-11-29 15:26:14 +02:00
Asias He	681c4d77bb	repair: Make repair_writer a shared pointer The future of the fiber that writes data into sstables inside the repair_writer is stored in _writer_done like below: class repair_writer { _writer_done[node_idx] = mutation_writer::distribute_reader_and_consume_on_shards().then([this] { ... }).handle_exception([this] { ... }); } The fiber access repair_writer object in the error handling path. We wait for the _writer_done to finish before we destroy repair_meta object which contains the repair_writer object to avoid the fiber accessing already freed repair_writer object. To be safer, we can make repair_writer a shared pointer and take a reference in the distribute_reader_and_consume_on_shards code path. Fixes #7406 Closes #7430 (cherry picked from commit `289a08072a`)	2020-11-29 13:30:49 +02:00
Pavel Emelyanov	8572ee9da2	query_pager: Fix continuation handling for noop visitor Before updating the _last_[cp]key (for subsequent .fetch_page()) the pager checks is 'if the pager is not exhausted OR the result has data'. The check seems broken: if the pager is not exhausted, but the result is empty the call for keys will unconditionally try to reference the last element from empty vector. The not exhausted condition for empty result can happen if the short_read is set, which, in turn, unconditionally happens upon meeting partition end when visiting the partition with result builder. The correct check should be 'if the pager is not exhausted AND the result has data': the _last_[pc]key-s should be taken for continuation (not exhausted), but can be taken if the result is not empty (has data). fixes: #7263 tests: unit(dev), but tests don't trigger this corner case Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200921124329.21209-1-xemul@scylladb.com> (cherry picked from commit `550fc734d9`)	2020-11-29 12:01:37 +02:00
Takuya ASADA	95dbac56e5	install.sh: set PATH for relocatable CLI tools in python thunk We currently set PATH for relocatable CLI tools in scylla_util.run() and scylla_util.out(), but it doesn't work for perftune.py, since it's not part of Scylla, does not use scylla_util module. We can set PATH in python thunk instead, it can set PATH for all python scripts. Fixes #7350 (cherry picked from commit `5867af4edd`)	2020-11-29 11:54:42 +02:00
Bentsi Magidovich	eeadeff0dc	scylla_util.py: fix exception handling in curl Retry mechanism didn't work when URLError happend. For example: urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable> Let's catch URLError instead of HTTP since URLError is a base exception for all exceptions in the urllib module. Fixes: #7569 Closes #7567 (cherry picked from commit `956b97b2a8`)	2020-11-29 11:48:30 +02:00
Takuya ASADA	62f3caab18	dist/redhat: packaging dependencies.conf as normal file, not ghost When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost, but it should be normal file, should be installed normally on package installation. Fixes #7703 Closes #7704 (cherry picked from commit `ba4d54efa3`)	2020-11-29 11:40:22 +02:00
Takuya ASADA	1a4869231a	install.sh: apply sysctl.d files on non-packaging installation We don't apply sysctl.d files on non-packaging installation, apply them just like rpm/deb taking care of that. Fixes #7702 Closes #7705 (cherry picked from commit `5f81f97773`)	2020-11-29 11:35:37 +02:00
Avi Kivity	3568d0cbb6	dist: sysctl: configure more inotify instances Since `f3bcd4d205` ("Merge 'Support SSL Certificate Hot Reloading' from Calle"), we reload certificates as they are modified on disk. This uses inotify, which is limited by a sysctl fs.inotify.max_user_instances, with a default of 128. This is enough for 64 shards only, if both rpc and cql are encrypted; above that startup fails. Increase to 1200, which is enough for 6 instances * 200 shards. Fixes #7700. Closes #7701 (cherry picked from commit `390e07d591`)	2020-11-29 11:04:45 +02:00
Raphael S. Carvalho	030c2e3270	compaction: Make sure a partition is filtered out only by producer If interposer consumer is enabled, partition filtering will be done by the consumer instead, but that's not possible because only the producer is able to skip to the next partition if the current one is filtered out, so scylla crashes when that happens with a bad function call in queue_reader. This is a regression which started here: `55a8b6e3c9` To fix this problem, let's make sure that partition filtering will only happen on the producer side. Fixes #7590. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com> (cherry picked from commit `13fa2bec4c`)	2020-11-19 14:08:25 +02:00
Piotr Dulikowski	37a5e9ab15	hints: don't read hint files when it's not allowed to send When there are hint files to be sent and the target endpoint is DOWN, end_point_hints_manager works in the following loop: - It reads the first hint file in the queue, - For each hint in the file it decides that it won't be sent because the target endpoint is DOWN, - After realizing that there are some unsent hints, it decides to retry this operation after sleeping 1 second. This causes the first segment to be wholly read over and over again, with 1 second pauses, until the target endpoint becomes UP or leaves the cluster. This causes unnecessary I/O load in the streaming scheduling group. This patch adds a check which prevents end_point_hints_manager from reading the first hint file at all when it is not allowed to send hints. First observed in #6964 Tests: - unit(dev) - hinted handoff dtests Closes #7407 (cherry picked from commit `77a0f1a153`)	2020-11-16 14:30:07 +02:00
Botond Dénes	a15b5d514d	mutation_reader: queue_reader: don't set EOS flag on abort If the consumer happens to check the EOS flag before it hits the exception injected by the abort (by calling fill_buffer()), they can think the stream ended normally and expect it to be valid. However this is not guaranteed when the reader is aborted. To avoid consumers falsely thinking the stream ended normally, don't set the EOS flag on abort at all. Additionally make sure the producer is aborted too on abort. In theory this is not needed as they are the one initiating the abort, but better to be safe then sorry. Fixes: #7411 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20201102100732.35132-1-bdenes@scylladb.com> (cherry picked from commit `f5323b29d9`)	2020-11-15 11:07:38 +02:00
Botond Dénes	064f8f8bcf	types: validate(): linearize values lazily Instead of eagerly linearizing all values as they are passed to validate(), defer linearization to those validators that actually need linearized values. Linearizing large values puts pressure on the memory allocator with large contiguous allocation requests. This is something we are trying to actively avoid, especially if it is not really neaded. Turns out the types, whose validators really want linearized values are a minority, as most validators just look at the size of the value, and some like bytes don't need validation at all, while usually having large values. This is achieved by templating the validator struct on the view and using the FragmentedRange concept to treat all passed in views (`bytes_view` and `fragmented_temporary_buffer_view`) uniformly. This patch makes no attempt at converting existing validators to work with fragmented buffers, only trivial cases are converted. The major offenders still left are ascii/utf8 and collections. Fixes: #7318 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20201007054524.909420-1-bdenes@scylladb.com> (cherry picked from commit `db56ae695c`)	2020-11-11 10:55:54 +02:00
Amnon Heiman	04fe0a7395	scyllatop/livedata.py: Safe iteration over metrics This patch change the code that iterates over the metrics to use a copy of the metrics names to make it safe to remove the metrics from the metrics object. Fixes #7488 Signed-off-by: Amnon Heiman <amnon@scylladb.com> (cherry picked from commit `52db99f25f`)	2020-11-08 19:16:13 +02:00
Calle Wilund	ef26d90868	partition_version: Change range_tombstones() to return chunked_vector Refs #7364 The number of tombstones can be large. As a stopgap measure to just returning a source range (with keepalive), we can at least alleviate the problem by using a chunked vector. Closes #7433 (cherry picked from commit `4b65d67a1a`)	2020-11-08 14:38:32 +02:00
Tomasz Grabiec	790f51c210	sstables: ka/la: Fix abort when next_partition() is called with certain reader state Cleanup compaction is using consume_pausable_in_thread() to skip over disowned partitions, which uses flat_mutation_reader::next_partition(). The implementation of next_partition() for the sstable reader has a bug which may cause the following assertion failure: scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed. This happens when the sstable reader's buffer gets full when we reach the partition end. The last fragment of the partition won't be pushed into the buffer but will stay in the _ready variable. When next_partition() is called in this state, _ready will not be cleared and the fragment will be carried over to the next partition. This will cause assertion failure when the reader attempts to emit the first fragment of the next partition. The fix is to clear _ready when entering a partition, just like we clear _range_tombstones there. Fixes #7553. Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `fb9b5cae05`)	2020-11-08 14:25:47 +02:00
Yaron Kaikov	4fb8ebccff	release: prepare for 4.2.1 scylla-4.2.1	2020-11-08 12:41:06 +02:00
Avi Kivity	d1fa0adcbe	Merge 'Move temporaries to value view' from Piotr S " Issue https://github.com/scylladb/scylla/issues/7019 describes a problem of an ever-growing map of temporary values stored in query_options. In order to mitigate this kind of problems, the storage for temporary values is moved from an external data structure to the value views itself. This way, the temporary lives only as long as it's accessible and is automatically destroyed once a request finishes. The downside is that each temporary is now allocated separately, while previously they were bundled in a single byte stream. Tests: unit(dev) Fixes https://github.com/scylladb/scylla/issues/7019 " `7055297649` ("cql3: remove query_options::linearize and _temporaries") is reverted from this backport since linearize() is still used in this branch. * psarna-move_temporaries_to_value_view: cql3: remove query_options::linearize and _temporaries cql3: remove make_temporary helper function cql3: store temporaries in-place instead of in query_options cql3: add temporary_value to value view cql3: allow moving data out of raw_value cql3: split values.hh into a .cc file (cherry picked from commit `2b308a973f`)	2020-11-05 19:24:23 +02:00

1 2 3 4 5 ...

22694 Commits