schema.cc/describe: fix invalid compaction options in schema

There is a typo in schema.cql of snapshot, lack of comma after compaction strategy. It will fail to restore schema by the file. AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'} map_as_cql_param() function has a `first` parameter to smartly add comma, the compaction_strategy_options is always not the first. Fixes #7741 Signed-off-by: Amos Kong <amos@scylladb.com> Closes #7734 (cherry picked from commit 6b1659ee80)
sstable: writer: ka/la: Write row marker cell after row tombstone
2021-03-24 12:58:11 +02:00 · 2021-03-24 10:42:11 +02:00 · 2021-03-21 10:51:36 +02:00 · 2021-03-18 19:20:10 +02:00 · 2021-03-18 14:29:38 +02:00 · 2021-03-11 08:24:56 +02:00
7 changed files with 19 additions and 16 deletions
--- a/repair/row_level.cc
+++ b/repair/row_level.cc
@@ -519,7 +519,7 @@ public:
        table& t = db.local().find_column_family(_schema->id());
        auto writer = shared_from_this();
        _writer_done[node_idx] = mutation_writer::distribute_reader_and_consume_on_shards(_schema, std::move(queue_reader),
-                [&db, reason = this->_reason, estimated_partitions = this->_estimated_partitions, writer] (flat_mutation_reader reader) {
+                [&db, reason = this->_reason, estimated_partitions = this->_estimated_partitions] (flat_mutation_reader reader) {
            auto& t = db.local().find_column_family(reader.schema());
            return db::view::check_needs_view_update_path(_sys_dist_ks->local(), t, reason).then([t = t.shared_from_this(), estimated_partitions, reader = std::move(reader)] (bool use_view_update_path) mutable {
                //FIXME: for better estimations this should be transmitted from remote
--- a/schema.cc
+++ b/schema.cc
@@ -827,7 +827,7 @@ std::ostream& schema::describe(std::ostream& os) const {
    os << "}";
    os << "\n    AND comment = '" << comment()<< "'";
    os << "\n    AND compaction = {'class': '" <<  sstables::compaction_strategy::name(compaction_strategy()) << "'";
-    map_as_cql_param(os, compaction_strategy_options()) << "}";
+    map_as_cql_param(os, compaction_strategy_options(), false) << "}";
    os << "\n    AND compression = {";
    map_as_cql_param(os,  get_compressor_params().get_options());
    os << "}";
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -2546,7 +2546,7 @@ future<> storage_service::rebuild(sstring source_dc) {
                    slogger.info("Streaming for rebuild successful");
                }).handle_exception([] (auto ep) {
                    // This is used exclusively through JMX, so log the full trace but only throw a simple RTE
-                    slogger.warn("Error while rebuilding node: {}", std::current_exception());
+                    slogger.warn("Error while rebuilding node: {}", ep);
                    return make_exception_future<>(std::move(ep));
                });
            });
--- a/sstables/compaction_manager.cc
+++ b/sstables/compaction_manager.cc
@@ -646,10 +646,11 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
    _tasks.push_back(task);

    auto sstables = std::make_unique<std::vector<sstables::shared_sstable>>(get_func(*cf));
+    auto compacting = make_lw_shared<compacting_sstable_registration>(this, *sstables);
    auto sstables_ptr = sstables.get();
    _stats.pending_tasks += sstables->size();

-    task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr] () mutable {
+    task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr, compacting] () mutable {

        // FIXME: lock cf here
        if (!can_proceed(task)) {
@@ -659,7 +660,7 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
        auto sst = sstables_ptr->back();
        sstables_ptr->pop_back();

-        return repeat([this, task, options, sst = std::move(sst)] () mutable {
+        return repeat([this, task, options, sst = std::move(sst), compacting] () mutable {
            column_family& cf = *task->compacting_cf;
            auto sstable_level = sst->get_sstable_level();
            auto run_identifier = sst->run_identifier();
@@ -667,21 +668,22 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
            auto descriptor = sstables::compaction_descriptor({ sst }, cf.get_sstable_set(), service::get_local_compaction_priority(),
                sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, options);

-            auto compacting = make_lw_shared<compacting_sstable_registration>(this, descriptor.sstables);
            // Releases reference to cleaned sstable such that respective used disk space can be freed.
            descriptor.release_exhausted = [compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
                compacting->release_compacting(exhausted_sstables);
            };

-            _stats.pending_tasks--;
-            _stats.active_tasks++;
-            task->compaction_running = true;
-            compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
-            return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
-                return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)] () mutable {
-                    return cf.run_compaction(std::move(descriptor));
+            return with_semaphore(_rewrite_sstables_sem, 1, [this, task, &cf, descriptor = std::move(descriptor)] () mutable {
+                _stats.pending_tasks--;
+                _stats.active_tasks++;
+                task->compaction_running = true;
+                compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
+                return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
+                    return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)]() mutable {
+                        return cf.run_compaction(std::move(descriptor));
+                    });
                });
-            }).then_wrapped([this, task, compacting = std::move(compacting)] (future<> f) mutable {
+            }).then_wrapped([this, task, compacting] (future<> f) mutable {
                task->compaction_running = false;
                _stats.active_tasks--;
                if (!can_proceed(task)) {
--- a/sstables/compaction_manager.hh
+++ b/sstables/compaction_manager.hh
@@ -110,6 +110,7 @@ private:
    std::unordered_map<column_family*, rwlock> _compaction_locks;

    semaphore _custom_job_sem{1};
+    seastar::named_semaphore _rewrite_sstables_sem = {1, named_semaphore_exception_factory{"rewrite sstables"}};

    std::function<void()> compaction_submission_callback();
    // all registered column families are submitted for compaction at a constant interval.
--- a/sstables/leveled_compaction_strategy.cc
+++ b/sstables/leveled_compaction_strategy.cc
@@ -176,7 +176,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input

    unsigned max_filled_level = 0;

-    size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
+    size_t offstrategy_threshold = (mode == reshape_mode::strict) ? std::max(schema->min_compaction_threshold(), 4) : std::max(schema->max_compaction_threshold(), 32);
    size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));
    auto tolerance = [mode] (unsigned level) -> unsigned {
        if (mode == reshape_mode::strict) {
--- a/sstables/sstables.cc
+++ b/sstables/sstables.cc
@@ -1734,8 +1734,8 @@ void sstable::write_collection(file_writer& out, const composite& clustering_key
 void sstable::write_clustered_row(file_writer& out, const schema& schema, const clustering_row& clustered_row) {
    auto clustering_key = composite::from_clustering_element(schema, clustered_row.key());

-    maybe_write_row_marker(out, schema, clustered_row.marker(), clustering_key);
    maybe_write_row_tombstone(out, clustering_key, clustered_row);
+    maybe_write_row_marker(out, schema, clustered_row.marker(), clustering_key);

    if (schema.clustering_key_size()) {
        column_name_helper::min_max_components(schema, _collector.min_column_names(), _collector.max_column_names(),
Author	SHA1	Message	Date
Amos Kong	44ec73cfc4	schema.cc/describe: fix invalid compaction options in schema There is a typo in schema.cql of snapshot, lack of comma after compaction strategy. It will fail to restore schema by the file. AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'} map_as_cql_param() function has a `first` parameter to smartly add comma, the compaction_strategy_options is always not the first. Fixes #7741 Signed-off-by: Amos Kong <amos@scylladb.com> Closes #7734 (cherry picked from commit `6b1659ee80`)	2021-03-24 12:58:11 +02:00
Tomasz Grabiec	df6f9a200f	sstable: writer: ka/la: Write row marker cell after row tombstone Row marker has a cell name which sorts after the row tombstone's start bound. The old code was writing the marker first, then the row tombstone, which is incorrect. This was harmeless to our sstable reader, which recognized both as belonging to the current clustering row fragment, and collects both fine. However, if both atoms trigger creation of promoted index blocks, the writer will create a promoted index with entries wich violate the cell name ordering. It's very unlikely to run into in practice, since to trigger promoted index entries for both atoms, the clustering key would be so large so that the size of the marker cell exceeds the desired promoted index block size, which is 64KB by default (but user-controlled via column_index_size_in_kb option). 64KB is also the limit on clustering key size accepted by the system. This was caught by one of our unit tests: sstable_conforms_to_mutation_source_test ...which runs a battery of mutation reader tests with various desired promoted index block sizes, including the target size of 1 byte, which triggers an entry for every atom. The test started to fail for some random seeds after commit `ecb6abe` inside the test_streamed_mutation_forwarding_is_consistent_with_slicing test case, reporting a mutation mismatch in the following line: assert_that(sliced_m).is_equal_to(fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key())); It compares mutations read from the same sstable using different methods, slicing using clustering key restricitons, and fast forwarding. The reported mismatch was that fwd_m contained the row marker, but sliced_m did not. The sstable does contain the marker, so both reads should return it. After reverting the commit which introduced dynamic adjustments, the test passes, but both mutations are missing the marker, both are wrong! They are wrong because the promoted index contians entries whose starting positions violate the ordering, so binary search gets confused and selects the row tombstone's position, which is emitted after the marker, thus skipping over the row marker. The explanation for why the test started to fail after dynamic adjustements is the following. The promoted index cursor works by incrementally parsing buffers fed by the file input stream. It first parses the whole block and then does a binary search within the parsed array. The entries which cursor touches during binary search depend on the size of the block read from the file. The commit which enabled dynamic adjustements causes the block size to be different for subsequent reads, which allows one of the reads to walk over the corrupted entries and read the correct data by selecting the entry corresponding to the row marker. Fixes #8324 Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com> (cherry picked from commit `9272e74e8c`)	2021-03-24 10:42:11 +02:00
Nadav Har'El	2f4a3c271c	storage_service: correct missing exception in logging rebuild failure When failing to rebuild a node, we would print the error with the useless explanation "<no exception>". The problem was a typo in the logging command which used std::current_exception() - which wasn't relevant in that point - instead of "ep". Refs #8089 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210314113118.1690132-1-nyh@scylladb.com> (cherry picked from commit `d73934372d`)	2021-03-21 10:51:36 +02:00
Raphael S. Carvalho	6a11c20b4a	LCS: reshape: tolerate more sstables in level 0 with relaxed mode Relaxed mode, used during initialization, of reshape only tolerates min_threshold (default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in level 0, otherwise boot will have to reshape level 0 every time it crosses the min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32. This change is beneficial because once table is populated, LCS regular compaction can decide to merge those sstables in level 0 into level 1 instead, therefore reducing WA. Refs #8297. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com> (cherry picked from commit `e53cedabb1`)	2021-03-18 19:20:10 +02:00
Raphael S. Carvalho	cccdd6aaae	compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism Prior to `463d0ab`, only one table could be cleaned up at a time on a given shard. Since then, all tables belonging to a given keyspace are cleaned up in parallel. Cleanup serialization on each shard was enforced with a semaphore, which was incorrectly removed by the patch aforementioned. So space requirement for cleanup to succeed can be up to the size of keyspace, increasing the chances of node running out of space. Node could also run out of memory if there are tons of tables in the keyspace. Memory requirement is at least #_of_tables * 128k (not taking into account write behind, etc). With 5k tables, it's ~0.64G per shard. Also all tables being cleaned up in parallel will compete for the same disk and cpu bandwidth, so making them all much slower, and consequently the operation time is significantly higher. This problem was detected with cleanup, but scrub and upgrade go through the same rewrite procedure, so they're affected by exact the same problem. Fixes #8247. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com> (cherry picked from commit `7171244844`)	2021-03-18 14:29:38 +02:00
Raphael S. Carvalho	92871a88c3	compaction: Prevent cleanup and regular from compacting the same sstable Due to regression introduced by `463d0ab`, regular can compact in parallel a sstable being compacted by cleanup, scrub or upgrade. This redundancy causes resources to be wasted, write amplification is increased and so does the operation time, etc. That's a potential source of data resurrection because the now-owned data from a sstable being compacted by both cleanup and regular will still exist in the node afterwards, so resurrection can happen if node regains ownership. Fixes #8155. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com> (cherry picked from commit `2cf0c4bbf1`) Includes fixup patch: compaction_manager: Fix use-after-free in rewrite_sstables() Use-after-free introduced by `2cf0c4bbf1`. That's because compacting is moved into then_wrapped() lambda, so it's potentially freed on the next iteration of repeat(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com> (cherry picked from commit `f7cc431477`)	2021-03-11 08:24:56 +02:00
Benny Halevy	85bbf6751d	repair: repair_writer: do not capture lw_shared_ptr cross-shard The shared_from_this lw_shared_ptr must not be accessed across shards. Capturing it in the lambda passed to mutation_writer::distribute_reader_and_consume_on_shards causes exactly that since the captured lw_shared_ptr is copied on other shards, and ends up in memory corruption as seen in #7535 (probably due to lw_shared_ptr._count going out-of-sync when incremented/decremented in parallel on other shards with no synchronization. This was introduced in `289a08072a`. The writer is not needed in the body of this lambda anyways so it doesn't need to capture it. It is already held by the continuations until the end of the chain. Fixes #7535 Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com> (cherry picked from commit `f93fb55726`)	2021-03-03 21:27:44 +02:00