scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 09:30:45 +00:00

Files

Dawid Mędrek f0dfe29d88 service: strong_consistency: Abort state_machine::apply when aborting server

The state machine used by strongly consistent tablets may block on a
read barrier if the local schema is insufficient to resolve pending
mutations [1]. To deal with that, we perform a read barrier that may
block for a long time.

When a strongly consistent tablet is being removed, we'd like to cancel
all ongoing executions of `state_machine::apply`: the shard is no
longer responsible for the tablet, so it doesn't matter what the outcome
is.

---

In the implementation, we abort the operations by simply throwing
an exception from `state_machine::apply` and not doing anything.
That's a red flag considering that it may lead to the instance
being killed on the spot [2].

Fortunately for us, strongly consistent tables use the default Raft
server implementation, i.e. `raft::server_impl`, which actually
handles one type of an exception thrown by the method: namely,
`abort_requested_exception`, which is the default exception thrown
by `seastar::abort_source` [3]. We leverage this property.

---

Unfortunately, `raft::server_impl::abort` isn't perfectly suited for
us. If we look into its code, we'll see that the relevant portion of
the procedure boils down to three steps:

1. Prevent scheduling adding new entries.
2. Wait for the applier fiber.
3. Abort the state machine.

Since aborting the state machine happens only after the applier fiber
has already finished, there will no longer be anything to abort. Either
all executions of `state_machine::apply` have already finished, or they
are hanging and we cannot do anything.

That's a pre-existing problem that we won't be solving here (even
though it's possible). We hope the problem will be solved, and it seems
likely: the code suggests that the behavior is not intended. For more
details, see e.g. [4].

---

We provide two validation tests. They simulate the abortion of
`state_machine::apply` in two different scenarios:

* when the table is dropped (which should also cover the case of tablet
  migration),
* when the node is shutting down.

The value of the tests isn't high since they don't ensure that the
state of the group is still valid (though it should be), nor do they
perform any other check. Instead, we rely on the testing framework to
spot any anomalies or errors. That's probably the best we can do at
the moment.

Unfortunately, both tests are marked as skipped becuause of the current
limitations of `raft::server_impl::abort` described above and in [4].

References:
[1] 4c8dba1
[2] See the description of `raft::state_machine` in `raft/raft.hh`.
[3] See `server_impl::applier_fiber` in `raft/server.cc`.
[4] SCYLLADB-1056

2026-04-09 11:36:51 +02:00

alternator

test/pylib: add typed skip markers plugin

2026-04-08 10:38:56 +03:00

boost

Merge 'Simplify and improve API descibe_ring code flow' from Pavel Emelyanov

2026-04-08 10:50:07 +03:00

broadcast_tables

…

cluster

service: strong_consistency: Abort state_machine::apply when aborting server

2026-04-09 11:36:51 +02:00

cql

…

cqlpy

Merge 'cql3: fix DESCRIBE INDEX WITH INTERNALS name' from Piotr Smaron

2026-04-09 08:37:51 +03:00

ldap

test: ldap: add regression test for double-free on unregistered message ID

2026-04-01 12:57:50 +02:00

lib

Merge 'test: add test_sstable_clone_preserves_staging_state' from Benny Halevy

2026-04-07 17:02:04 +03:00

manual

gossiper: remove the code that was only used in gossiper topology

2026-03-10 10:39:58 +02:00

nodetool

test.py: fix nodetool mock server port collision

2026-04-02 16:24:07 +02:00

perf

test: perf_simple_query: Add 'sstable-format' command-line option

2026-03-18 16:25:20 +01:00

pylib

test/pylib: add typed skip markers plugin

2026-04-08 10:38:56 +03:00

pylib_test

test/pylib: add typed skip markers plugin

2026-04-08 10:38:56 +03:00

raft

test: add tracker voter demotion test to fsm_test.cc

2026-04-08 12:37:19 +02:00

resource

schema: remove calculate_schema_digest function

2026-03-10 10:46:47 +02:00

rest_api

test: add snapshot REST API tests for logical index names

2026-04-08 13:38:17 +02:00

scylla_gdb

test/scylla_gdb: fix flakiness by preparing objects at test time

2026-03-23 16:54:03 +02:00

unit

…

vector_search

Merge 'vector_search: fix SELECT on local vector index' from Karol Nowacki

2026-04-07 17:43:35 +03:00

__init__.py

test/pylib: introduce scale_timeout fixture helper

2026-03-05 13:07:09 +02:00

CMakeLists.txt

Revert "Merge 'vector_search: add validator tests' from Pawel Pery"

2026-02-08 16:29:58 +02:00

conftest.py

test/pylib: add typed skip markers plugin

2026-04-08 10:38:56 +03:00

pytest.ini

test/pylib: add typed skip markers plugin

2026-04-08 10:38:56 +03:00

README.md

…

README.md

Scylla in-source tests.

For details on how to run the tests, see docs/dev/testing.md

Shared C++ utils, libraries are in lib/, for Python - pylib/

alternator - Python tests which connect to a single server and use the DynamoDB API unit, boost, raft - unit tests in C++ cqlpy - Python tests which connect to a single server and use CQL topology* - tests that set up clusters and add/remove nodes cql - approval tests that use CQL and pre-recorded output rest_api - tests for Scylla REST API Port 9000 scylla-gdb - tests for scylla-gdb.py helper script nodetool - tests for C++ implementation of nodetool

If you can use an existing folder, consider adding your test to it. New folders should be used for new large categories/subsystems, or when the test environment is significantly different from some existing suite, e.g. you plan to start scylladb with different configuration, and you intend to add many tests and would like them to reuse an existing Scylla cluster (clusters can be reused for tests within the same folder).

To add a new folder, create a new directory, and then copy & edit its suite.ini.