Six cluster test files override failure_detector_timeout_in_ms to 2000ms
for faster failure detection. In debug and sanitize builds, this causes
flaky node join failures. The following log analysis shows how.
The coordinator (server 614, IP 127.2.115.3) accepts the joining node
(server 615, host_id 53b01f0b, IP 127.2.115.2) into group0:
20:10:57,049 [shard 0] raft_group0 - server 614 entered
'join group0' transition state for 53b01f0b
The joining node begins receiving the raft snapshot 100ms later:
20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539
It then spends ~280ms applying schema changes -- creating 6 keyspaces
and 12+ tables from the snapshot:
20:10:57,511 [shard 0] migration_manager - Creating keyspace
system_auth_v2
...
20:10:57,788 [shard 0] migration_manager - Creating
system_auth_v2.role_members
Meanwhile, the coordinator's failure detector pings the joining node.
Under debug+ASan load the RPC call times out after ~4.6 seconds:
20:11:01,643 [shard 0] direct_failure_detector - unexpected exception
when pinging 53b01f0b: seastar::rpc::timeout_error
(rpc call timed out)
25ms later, the coordinator marks the joining node DOWN and removes it:
20:11:01,668 [shard 0] raft_group0 - failure_detector_loop:
Mark node 53b01f0b as DOWN
20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept
53b01f0b
The joining node was still retrying the snapshot transfer at that point:
20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539
It then receives the ban notification and aborts:
20:11:01,844 [shard 0] raft_group0 - received notification of being
banned from the cluster
Replace the hardcoded 2000ms with the failure_detector_timeout fixture
from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for
debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms).
Test measurements (before -> after fix):
debug mode:
test_replace_with_same_ip_twice 24.02s -> 25.02s
test_banned_node_notification 217.22s -> 221.72s
test_kill_coordinator_during_op 116.11s -> 127.13s
test_node_failure_during_tablet_migration
[streaming-source] 183.25s -> 192.69s
test_replace (4 tests) skipped in debug (skip_in_debug)
test_raft_replace_ignore_nodes skipped in debug (run_in_dev only)
dev mode:
test_replace_different_ip 10.51s -> 11.50s
test_replace_different_ip_using_host_id 10.01s -> 12.01s
test_replace_reuse_ip 10.51s -> 12.03s
test_replace_reuse_ip_using_host_id 13.01s -> 12.01s
test_raft_replace_ignore_nodes 19.52s -> 19.52s
Scylla
What is Scylla?
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.
For more information, please see the ScyllaDB web site.
Build Prerequisites
Scylla is fairly fussy about its build environment, requiring very recent versions of the C++23 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain. This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).
Building Scylla
Building Scylla with the frozen toolchain dbuild is as easy as:
$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla
For further information, please see:
- Developer documentation for more information on building Scylla.
- Build documentation on how to build Scylla binaries, tests, and packages.
- Docker image build documentation for information on how to build Docker images.
Running Scylla
To start Scylla server, run:
$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1
This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory.
The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations).
Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.
For more run options, run:
$ ./tools/toolchain/dbuild ./build/release/scylla --help
Testing
See test.py manual.
Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.
Documentation
Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.
Training
Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.
Contributing to Scylla
If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.
If you are a developer working on Scylla, please read the developer guidelines.
Contact
- The community forum and Slack channel are for users to discuss configuration, management, and operations of ScyllaDB.
- The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.