gossip: Add an option to force gossip generation

Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation number g1, g2, g3. n1, n2, n3 running scylla version with commit 0a52ecb6df (gossip: Fix max generation drift measure) One year later, user wants the upgrade n1,n2,n3 to a new version when n3 does a rolling restart with a new version, n3 will use a generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's gossip update and mark g3 as down. Such unnecessary marking of node down can cause availability issues. For example: DC1: n1, n2 DC2: n3, n4 When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which causes the whole DC2 to be unavailable. To fix, we can start the node with a gossip generation within MAX_GENERATION_DIFFERENCE difference for the new node. Once all the nodes run the version with commit 0a52ecb6df, the option is no logger needed. Fixes #5164 (cherry picked from commit 743b529c2b) [tgrabiec: resolved major conflicts in config.hh]
gossiper: Always use the new generation number
2020-03-27 13:08:26 +01:00 · 2020-03-27 12:53:26 +01:00 · 2020-03-22 10:08:48 +01:00 · 2020-03-12 12:10:45 +02:00 · 2020-03-12 11:25:50 +02:00 · 2020-03-09 15:22:58 +02:00
7 changed files with 52 additions and 13 deletions
--- a/cql3/restrictions/statement_restrictions.cc
+++ b/cql3/restrictions/statement_restrictions.cc
@@ -380,28 +380,45 @@ std::vector<const column_definition*> statement_restrictions::get_column_defs_fo
    if (need_filtering()) {
        auto& sim = db.find_column_family(_schema).get_index_manager();
        auto [opt_idx, _] = find_idx(sim);
-        auto column_uses_indexing = [&opt_idx] (const column_definition* cdef) {
-            return opt_idx && opt_idx->depends_on(*cdef);
+        auto column_uses_indexing = [&opt_idx] (const column_definition* cdef, ::shared_ptr<single_column_restriction> restr) {
+            return opt_idx && restr && restr->is_supported_by(*opt_idx);
        };
+        auto single_pk_restrs = dynamic_pointer_cast<single_column_partition_key_restrictions>(_partition_key_restrictions);
        if (_partition_key_restrictions->needs_filtering(*_schema)) {
            for (auto&& cdef : _partition_key_restrictions->get_column_defs()) {
-                if (!column_uses_indexing(cdef)) {
+                ::shared_ptr<single_column_restriction> restr;
+                if (single_pk_restrs) {
+                    auto it = single_pk_restrs->restrictions().find(cdef);
+                    if (it != single_pk_restrs->restrictions().end()) {
+                        restr = dynamic_pointer_cast<single_column_restriction>(it->second);
+                    }
+                }
+                if (!column_uses_indexing(cdef, restr)) {
                    column_defs_for_filtering.emplace_back(cdef);
                }
            }
        }
+        auto single_ck_restrs = dynamic_pointer_cast<single_column_clustering_key_restrictions>(_clustering_columns_restrictions);
        const bool pk_has_unrestricted_components = _partition_key_restrictions->has_unrestricted_components(*_schema);
        if (pk_has_unrestricted_components || _clustering_columns_restrictions->needs_filtering(*_schema)) {
            column_id first_filtering_id = pk_has_unrestricted_components ? 0 : _schema->clustering_key_columns().begin()->id +
                    _clustering_columns_restrictions->num_prefix_columns_that_need_not_be_filtered();
            for (auto&& cdef : _clustering_columns_restrictions->get_column_defs()) {
-                if (cdef->id >= first_filtering_id && !column_uses_indexing(cdef)) {
+                ::shared_ptr<single_column_restriction> restr;
+                if (single_pk_restrs) {
+                    auto it = single_ck_restrs->restrictions().find(cdef);
+                    if (it != single_ck_restrs->restrictions().end()) {
+                        restr = dynamic_pointer_cast<single_column_restriction>(it->second);
+                    }
+                }
+                if (cdef->id >= first_filtering_id && !column_uses_indexing(cdef, restr)) {
                    column_defs_for_filtering.emplace_back(cdef);
                }
            }
        }
        for (auto&& cdef : _nonprimary_key_restrictions->get_column_defs()) {
-            if (!column_uses_indexing(cdef)) {
+            auto restr = dynamic_pointer_cast<single_column_restriction>(_nonprimary_key_restrictions->get_restriction(*cdef));
+            if (!column_uses_indexing(cdef, restr)) {
                column_defs_for_filtering.emplace_back(cdef);
            }
        }
--- a/db/config.hh
+++ b/db/config.hh
@@ -735,6 +735,7 @@ public:
    val(shutdown_announce_in_ms, uint32_t, 2 * 1000, Used, "Time a node waits after sending gossip shutdown message in milliseconds. Same as -Dcassandra.shutdown_announce_in_ms in cassandra.") \
    val(developer_mode, bool, false, Used, "Relax environment checks. Setting to true can reduce performance and reliability significantly.") \
    val(skip_wait_for_gossip_to_settle, int32_t, -1, Used, "An integer to configure the wait for gossip to settle. -1: wait normally, 0: do not wait at all, n: wait for at most n polls. Same as -Dcassandra.skip_wait_for_gossip_to_settle in cassandra.") \
+    val(force_gossip_generation, int32_t, -1, Used, "Force gossip to use the generation number provided by user") \
    val(experimental, bool, false, Used, "Set to true to unlock experimental features.") \
    val(lsa_reclamation_step, size_t, 1, Used, "Minimum number of segments to reclaim in a single step") \
    val(prometheus_port, uint16_t, 9180, Used, "Prometheus port, set to zero to disable") \
--- a/dist/redhat/scylla.spec.mustache
+++ b/dist/redhat/scylla.spec.mustache
@@ -15,6 +15,10 @@ Obsoletes:	scylla-server < 1.1
 %global __brp_python_bytecompile %{nil}
 %global __brp_mangle_shebangs %{nil}

+# Prevent find-debuginfo.sh from tempering with scylla's build-id (#5881)
+%undefine _unique_build_ids
+%global _no_recompute_build_ids 1
+
 %description
 Scylla is a highly scalable, eventually consistent, distributed,
 partitioned row DB.
--- a/gms/gossiper.cc
+++ b/gms/gossiper.cc
@@ -1612,11 +1612,15 @@ future<> gossiper::start_gossiping(int generation_nbr, std::map<application_stat
    // message on all cpus and forard them to cpu0 to process.
    return get_gossiper().invoke_on_all([do_bind] (gossiper& g) {
        g.init_messaging_service_handler(do_bind);
-    }).then([this, generation_nbr, preload_local_states] {
+    }).then([this, generation_nbr, preload_local_states] () mutable {
        build_seeds_list();
-        /* initialize the heartbeat state for this localEndpoint */
-        maybe_initialize_local_state(generation_nbr);
+        if (_cfg.force_gossip_generation() > 0) {
+            generation_nbr = _cfg.force_gossip_generation();
+            logger.warn("Use the generation number provided by user: generation = {}", generation_nbr);
+        }
        endpoint_state& local_state = endpoint_state_map[get_broadcast_address()];
+        local_state.set_heart_beat_state_and_update_timestamp(heart_beat_state(generation_nbr));
+        local_state.mark_alive();
        for (auto& entry : preload_local_states) {
            local_state.add_application_state(entry.first, entry.second);
        }
@@ -1820,7 +1824,8 @@ future<> gossiper::do_stop_gossiping() {
        if (my_ep_state && !is_silent_shutdown_state(*my_ep_state)) {
            logger.info("Announcing shutdown");
            add_local_application_state(application_state::STATUS, _value_factory.shutdown(true)).get();
-            for (inet_address addr : _live_endpoints) {
+            auto live_endpoints = _live_endpoints;
+            for (inet_address addr : live_endpoints) {
                msg_addr id = get_msg_addr(addr);
                logger.trace("Sending a GossipShutdown to {}", id);
                ms().send_gossip_shutdown(id, get_broadcast_address()).then_wrapped([id] (auto&&f) {
--- a/locator/simple_strategy.cc
+++ b/locator/simple_strategy.cc
@@ -53,13 +53,13 @@ std::vector<inet_address> simple_strategy::calculate_natural_endpoints(const tok
    endpoints.reserve(replicas);

    for (auto& token : tm.ring_range(t)) {
+        if (endpoints.size() == replicas) {
+           break;
+        }
        auto ep = tm.get_endpoint(token);
        assert(ep);

        endpoints.push_back(*ep);
-        if (endpoints.size() == replicas) {
-           break;
-        }
    }

    return std::move(endpoints.get_vector());
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -1440,7 +1440,8 @@ future<> storage_service::drain_on_shutdown() {
            ss._sys_dist_ks.invoke_on_all(&db::system_distributed_keyspace::stop).get();
            slogger.info("Drain on shutdown: system distributed keyspace stopped");

-            get_storage_proxy().invoke_on_all([&ss] (storage_proxy& local_proxy) mutable {
+            get_storage_proxy().invoke_on_all([] (storage_proxy& local_proxy) mutable {
+                auto& ss = service::get_local_storage_service();
                ss.unregister_subscriber(&local_proxy);
                return local_proxy.drain_on_shutdown();
            }).get();
--- a/utils/logalloc.cc
+++ b/utils/logalloc.cc
@@ -2065,6 +2065,17 @@ bool segment_pool::migrate_segment(segment* src, segment* dst)
 #endif

 void tracker::impl::register_region(region::impl* r) {
+    // If needed, increase capacity of regions before taking the reclaim lock,
+    // to avoid failing an allocation when push_back() tries to increase
+    // capacity.
+    //
+    // The capacity increase is atomic (wrt _regions) so it cannot be
+    // observed
+    if (_regions.size() == _regions.capacity()) {
+        auto copy = _regions;
+        copy.reserve(copy.capacity() * 2);
+        _regions = std::move(copy);
+    }
    reclaiming_lock _(*this);
    _regions.push_back(r);
    llogger.debug("Registered region @{} with id={}", r, r->id());
Author	SHA1	Message	Date
Asias He	9b46b9f1a8	gossip: Add an option to force gossip generation Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation number g1, g2, g3. n1, n2, n3 running scylla version with commit `0a52ecb6df` (gossip: Fix max generation drift measure) One year later, user wants the upgrade n1,n2,n3 to a new version when n3 does a rolling restart with a new version, n3 will use a generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's gossip update and mark g3 as down. Such unnecessary marking of node down can cause availability issues. For example: DC1: n1, n2 DC2: n3, n4 When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which causes the whole DC2 to be unavailable. To fix, we can start the node with a gossip generation within MAX_GENERATION_DIFFERENCE difference for the new node. Once all the nodes run the version with commit `0a52ecb6df`, the option is no logger needed. Fixes #5164 (cherry picked from commit `743b529c2b`) [tgrabiec: resolved major conflicts in config.hh]	2020-03-27 13:08:26 +01:00
Asias He	93da2e2ff0	gossiper: Always use the new generation number User reported an issue that after a node restart, the restarted node is marked as DOWN by other nodes in the cluster while the node is up and running normally. Consier the following: - n1, n2, n3 in the cluster - n3 shutdown itself - n3 send shutdown verb to n1 and n2 - n1 and n2 set n3 in SHUTDOWN status and force the heartbeat version to INT_MAX - n3 restarts - n3 sends gossip shadow rounds to n1 and n2, in storage_service::prepare_to_join, - n3 receives response from n1, in gossiper::handle_ack_msg, since _enabled = false and _in_shadow_round == false, n3 will apply the application state in fiber1, filber 1 finishes faster filber 2, it sets _in_shadow_round = false - n3 receives response from n2, in gossiper::handle_ack_msg, since _enabled = false and _in_shadow_round == false, n3 will apply the application state in fiber2, filber 2 yields - n3 finishes the shadow round and continues - n3 resets gossip endpoint_state_map with gossiper.reset_endpoint_state_map() - n3 resumes fiber 2, apply application state about n3 into endpoint_state_map, at this point endpoint_state_map contains information including n3 itself from n2. - n3 calls gossiper.start_gossiping(generation_number, app_states, ...) with new generation number generated correctly in storage_service::prepare_to_join, but in maybe_initialize_local_state(generation_nbr), it will not set new generation and heartbeat if the endpoint_state_map contains itself - n3 continues with the old generation and heartbeat learned in fiber 2 - n3 continues the gossip loop, in gossiper::run, hbs.update_heart_beat() the heartbeat is set to the number starting from 0. - n1 and n2 will not get update from n3 because they use the same generation number but n1 and n2 has larger heartbeat version - n1 and n2 will mark n3 as down even if n3 is alive. To fix, always use the the new generation number. Fixes: #5800 Backports: 3.0 3.1 3.2 (cherry picked from commit `62774ff882`)	2020-03-27 12:53:26 +01:00
Piotr Sarna	b764db3f1c	cql: fix qualifying indexed columns for filtering When qualifying columns to be fetched for filtering, we also check if the target column is not used as an index - in which case there's no need of fetching it. However, the check was incorrectly assuming that any restriction is eligible for indexing, while it's currently only true for EQ. The fix makes a more specific check and contains many dynamic casts, but these will hopefully we gone once our long planned "restrictions rewrite" is done. This commit comes with a test. Fixes #5708 Tests: unit(dev) (cherry picked from commit `767ff59418`)	2020-03-22 10:08:48 +01:00
Konstantin Osipov	304d339193	locator: correctly select endpoints if RF=0 SimpleStrategy creates a list of endpoints by iterating over the set of all configured endpoints for the given token, until we reach keyspace replication factor. There is a trivial coding bug when we first add at least one endpoint to the list, and then compare list size and replication factor. If RF=0 this never yields true. Fix by moving the RF check before at least one endpoint is added to the list. Cassandra never had this bug since it uses a less fancy while() loop. Fixes #5962 Message-Id: <20200306193729.130266-1-kostja@scylladb.com> (cherry picked from commit `ac6f64a885`)	2020-03-12 12:10:45 +02:00
Avi Kivity	9f7ba4203d	logalloc: increase capacity of _regions vector outside reclaim lock Reclaim consults the _regions vector, so we don't want it moving around while allocating more capacity. For that we take the reclaim lock. However, that can cause a false-positive OOM during startup: 1. all memory is allocated to LSA as part of priming (`2baa16b371`) 2. the _regions vector is resized from 64k to 128k, requiring a segment to be freed (plenty are free) 3. but reclaiming_lock is taken, so we cannot reclaim anything. To fix, resize the _regions vector outside the lock. Fixes #6003. Message-Id: <20200311091217.1112081-1-avi@scylladb.com> (cherry picked from commit `c020b4e5e2`)	2020-03-12 11:25:50 +02:00
Benny Halevy	8b6a792f81	dist/redhat: scylla.spec.mustache: set _no_recompute_build_ids By default, `/usr/lib/rpm/find-debuginfo.sh` will temper with the binary's build-id when stripping its debug info as it is passed the `--build-id-seed <version>.<release>` option. To prevent that we need to set the following macros as follows: unset `_unique_build_ids` set `_no_recompute_build_ids` to 1 Fixes #5881 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `25a763a187`)	2020-03-09 15:22:58 +02:00
Benny Halevy	8a94f6b180	gossiper: do_stop_gossiping: copy live endpoints vector It can be resized asynchronously by mark_dead. Fixes #5701 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200203091344.229518-1-bhalevy@scylladb.com> (cherry picked from commit `f45fabab73`)	2020-02-26 13:00:33 +02:00
Benny Halevy	27209a5b2e	storage_service: drain_on_shutdown: unregister storage_proxy subscribers from local_storage_service Match subscription done in main() and avoid cross shard access to _lifecycle_subscribers vector. Fixes #5385 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Acked-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200123092817.454271-1-bhalevy@scylladb.com> (cherry picked from commit `5b0ea4c114`)	2020-02-25 16:40:31 +02:00