In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.
To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.
Similar to "gossip: Limit number of pending gossip ACK messages", limit
the number of pending gossip ACK2 messages in gossiper::handle_ack_msg.
Fixes#5210
In a cross-dc large cluster, the receiver node of the gossip SYN message
might be slow to send the gossip ACK message. The ack messages can be
large if the payload of the application state is big, e.g.,
CACHE_HITRATES with a lot of tables. As a result, the unlimited ACK
message can consume unlimited amount of memory which causes OOM
eventually.
To fix, this patch queues the SYN message and handles it later if the
previous ACK message is still being sent. However, we only store the
latest SYN message. Since the latest SYN message from peer has the
latest information, so it is safe to drop the previous SYN message and
keep the latest one only. After this patch, there can be at most 1
pending SYN message and 1 pending ACK message per peer node.
Fixes#5210
Assume n1 and n2 in a cluster with generation number g1, g2. The
cluster runs for more than 1 year (MAX_GENERATION_DIFFERENCE). When n1
reboots with generation g1' which is time based, n2 will see
g1' > g2 + MAX_GENERATION_DIFFERENCE and reject n1's gossip update.
To fix, check the generation drift with generation value this node would
get if this node were restarted.
This is a backport of CASSANDRA-10969.
Fixes#5164
We would like to share with other nodes
the value of ignore_msb_bits property used by the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know ignore_msb_bits property value for each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We would like to share with other nodes
the number of shards available at the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know how many shards are on each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
- Update the outdated comments in do_stop_gossiping. It was
storage_service not storage_proxy that used the lock. More
importantly, storage_service does not use it any more.
- Drop the unused timer_callback_lock and timer_callback_unlock API
- Use with_semaphore to make sure the semaphore usage is balanced.
- Add log in gossiper::do_stop_gossiping when it tries to take the
semaphore to help debug hang during the shutdown.
Refs: #4891
Refs: #4971
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
Because inet_address was initially hardcoded to
ipv4, its wire format is not very forward compatible.
Since we potentially need to communicate with older version nodes, we
manually define the new serial format for inet_address to be:
ipv4: 4 bytes address
ipv6: 4 bytes marker 0xffffffff (invalid address)
16 bytes data -> address
It can be invoked with a lambda without the ceremony of creating a
class deriving from gms::feature::listener.
The reutrned registration object controls listener's scope.
Store the failure_detector object inside gossiper object.
- No more the global object sharded<failure_detector>
- No need to initialize sharded<failure_detector> manually which
simplifies the code in tests/cql_test_env.cc and init.cc.
n1, n2, n3 in the cluster,
shutdown n1, n2, n3
start n1, n2
start n3, we saw features are enabled using the system table while n1 and n2 are already up and running in the cluster.
INFO 2019-02-27 09:24:41,023 [shard 0] gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,025 [shard 0] storage_service - Starting up server gossip
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.1 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.2 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
The problem is we enable the features too early in the start up process.
We should enable features after gossip is settled.
Fixes#4289
Message-Id: <04f2edb25457806bd9e8450dfdcccc9f466ae832.1551406991.git.asias@scylladb.com>
Three nodes in the cluster node1, node2, node3
Shutdown the whole cluster
Start node1
Start node2, node2 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.2 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.
Start node3, node3 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.3 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.
Fixes#4225
We saw the log "Feature FOO is enabled" more than once like below. It is
better to log it only when the feature is not enabled previously.
gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - InetAddress 127.0.0.2 is now UP, status = NORMAL
Committer: Avi Kivity <avi@scylladb.com>
Branch: next
Switch to the the CMake-ified Seastar
This change allows Scylla to be compiled against the `master` branch of
Seastar.
The necessary changes:
- Add `-Wno-error` to prevent a Seastar warning from terminating the
build
- The new Seastar build system generates the pkg-config files (for
example, `seastar.pc`) at configure time, so we don't need to invoke
Ninja to generate them
- The `-march` argument is no longer inherited from Seastar (correctly),
so it needs to be provided independently
- Define `SEASTAR_TESTING_MAIN` so that the definition of an entry
point is included for all unit test compilation units
- Independently link Scylla against Seastar's compiled copy of fmt in
its build directory
- All test files use the (now public) Seastar testing headers
- Add some missing Seastar headers to source files
[avi: regenerate frozen toolchain, adjust seastar submoule]
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <02141f2e1ecff5cbcd56b32768356c3bf62750c4.1548820547.git.jhaberku@scylladb.com>
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
This lays the groundwork for brokering a node's view update
backlog across the whole cluster. This is needed for when a
coordinator does not contact a given replica for a long time, and uses
a backlog view that is outdated and causes requests to be
unnecessarily delayed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
db::config is a global class; changes in any module can cause changes
in db::config. Therefore, it is a cause of needless recompilation.
Remove some of these dependencies by having consumers of db::config
declare an intermediate config struct that is contains only
configuration of interest to them, and have their caller fill it out
(in the case of auth, it already followed this scheme and the patchset
only moves the translation function).
In addition, some outright pointless inclusions of db/config.hh are
removed.
The result is somewhat shorter compile times, and fewer needless
recompiles.
* https://github.com/avikivity/scylla unconfig-1/v1:
config: remove inclusions of db/config.hh from header files
repair: remove unneeded config.hh inclusion
batchlog_manager: remove dependency on db::config
auth: remove permissions_cache dependency on db::config
auth: remove auth::service dependency on db::config
auth: remove unneeded db/config.hh includes
- New scylla node always send application_state::RPC_READY = false when
the node boots and send application_state::RPC_READY = true when cql
server is up
- Old scylla node that does not support the application_state::RPC_READY
never has application_state::RPC_READY in the endpoint_state, we can
only think their cql server is up, so we return true here if
application_state::RPC_READY is not present
Instead, distribute those inclusions to .cc files that require them. This
reduces rebuilds when config.hh changes, and makes it easier to locate files
that need config disaggregation.
storage_service keeps a bunch of "feature" variables, indicating cluster-wide
supported features, and has the ability to wait until the entire cluster supports
a given feature.
The propagation of features depends on gossip, but gossip is initialized after
storage_service, so the current code late-initializes the features. However, that
means that whoever waits on a feature between storage_service initialization and
gossip initialization loses their wait entry. In #3952, we have proof that this
in fact happens.
Fix this by removing the circular dependency. We now store features in a new
service, feature_service, that is started before both gossip and storage_service.
Gossip updates feature_service while storage_service reads for it.
Fixes#3953.
* https://github.com/avikivity/3953/v4.1:
storage_service: deinline enable_all_features()
gossiper: keep features registered
tests/gossip: switch to seastar::thread
storage_service: deinline init/deinit functions
gossiper: split feature storage into a new feature_service
gossiper: maybe enable features after start_gossiping()
storage_service: fix gap when feature::when_enabled() doesn't work
Since we may now start with features already registered, we need to enable
features immediately after gossip is started. This case happens in a cluster
that already is fully upgraded on startup. Before this series, features were
only added after this point.
Feature lifetime is tied to storage_service lifetime, but features are now managed
by gossip. To avoid circular dependency, add a new feature_service service to manage
feature lifetime.
To work around the problem, the current code re-initializes features after
gossip is initialized. This patch does not fix this problem; it only makes it
possible to solve it by untyping features from gossip.
Gossiper unregisters enabled features as an optimization. However that makes
decoupling features from gossiper harder. Disable this optimization; since the
number of features is small and normal access is to a single feature at a time,
there is no significant performance or memory loss.
In dtest, we have
self.check_rows_on_node(node1, 2000)
self.check_rows_on_node(node2, 2000)
which introduce the following cluster operations:
1) Initially:
- node1 up
- node2 up
2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)
3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)
Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.
In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"
As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.
TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']
In the log we can see node1 marked as DOWN and UP almost at the same time on node2:
INFO 2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
INFO 2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown
Fixes#3940
Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
* seastar d152f2d...c1e0e5d (6):
> scripts: perftune.py: properly merge parameters from the command line and the configuration file
> fmt: update to 5.2.1
> io_queue: only increment statistics when request is admitted
> Adds `read_first_line.cc` and `read_first_line.hh` to CMake.
> fstream: remove default extent allocation hint
> core/semaphore: Change the access of semaphore_units main ctor
Due to a compile-time fight between fmt and boost::multiprecision, a
lexical_cast was added to mediate.
sprint("%s", var) no longer accepts numeric values, so some sprint()s were
converted to format() calls. Since more may be lurking we'll need to remove
all sprint() calls.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Fixes#3798Fixes#3694
Tests:
unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)
* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
gms/gossiper: Replicate enpoint states in add_saved_endpoint()
gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
gms/gossiper: Always override states from older generations
Lack of this may result in non-zero shards on some nodes still seeing
STATUS as NORMAL for a node which shut down, in some cases.
mark_as_shutdown() is invoked in reaction to an RPC call initiated by
the node which is shutting down. Another way a node can learn about
other node shutting down is via gossiping with a node which knows
this. In that case, the states will be replicated to non-zero
shards. The node which learnt via mark_as_shutdown() may also
eventually propagate this to non-zero shards, e.g. when it gossips
about it with other nodes, and its local version number at the time of
mark_as_shudown() was smaller than the one used to set the STATE by
the shutting down node.