The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/node/storage_mode?intended_mode={tablets,vnodes}
This endpoint is part of the vnodes-to-tablets migration process and
controls a node's intended_storage_mode in system.topology. The storage
mode represents the node-local data distribution model, i.e., how data
are organized across shards. The node will apply the intended storage
mode to migrating tables upon next restart by resharding their SSTables
(either on vnode boundaries if intended_mode=tablets, or with the static
sharder if intended_mode=vnodes).
Note that this endpoint controls the intended_storage_mode of the local
node only. This has the nice benefit that once the API call returns, the
change has not only been committed to group0 but also applied to the
local node's state machine. This guarantees that the change is part of
the node's local copy upon next restart; no additional read barrier is
needed.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}
Its purpose is to start the migration of a whole keyspace from vnodes to
tablets.
When called, Scylla will synchronously create a tablet map for each
table in the specified keyspace. The tablet maps of all tables are
identical and they mirror the vnode layout; they contain one tablet per
vnode and each tablet uses the same replica hosts and token boundaries
as the corresponding vnode.
The only difference from vnodes lies in the sharding approach. Tablets
are assigned to a single shard - using a round-robin strategy in this
patch - whereas vnodes are distributed evenly across all shards. If the
tablet count per shard is low and tablet sizes are uneven, or some
shards have more tablets than others, performance may degrade during the
migration process. For example, a cluster with i8g.48xlarge (192 vCPUs),
256 vnodes per node and RF=3 will have 256 * 3 / 192 vCPUs = 4 tablet
replicas per shard during the migration. One additional tablet or a
double-sized tablet would cause 25% overcommit.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Part of the vnodes-to-tablets migration is to reshard the SSTables of
each node on vnode boundaries. Resharding is a heavy operation that
runs on startup while the node is offline. Since nodes can restart
for unexpected reasons, we need a flag to do it in a controllable way.
We also need the ability to roll back the migration, which requires
resharding in the opposite direction. This means a node must be aware of
the intended migration direction.
To address both requirements, this patch introduces a new column,
intended_storage_mode, in system.topology. A non-null value indicates
that a node should perform a migration and specifies the migration
direction.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Make the table populator migration-aware. If a table is migrating to
tablets, switch from normal resharding to vnode-based resharding.
Vnode-based resharding requires passing a vector of "owned ranges" upon
which resharding will segregate the SSTables. Compute it from the tablet
map. We could also compute them from the vnodes, since tablets are
identical to vnodes during the migration, but in the future we may
switch to a different model (multiple tablets per vnode).
Let the distributed loader decide if a table is migrating or not and
communicate that to the table populator. A table is migrating if the
keyspace replication strategy uses vnodes but the table replication
strategy uses tablets.
Currently, tables cannot enter this "migrating" state; support for this
will be introduced in the next patches.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In the previous patch, reshard compaction was extended with a special
operation mode where SSTables from vnode-based tables are segregated on
vnode boundaries and not with the static sharder. This will later be
wired into vnodes-to-tablets migration.
The problem is that resharding requires a compaction group. With a
vnode-based table, there is only one compaction group per shard, and
this is what the current code utilizes
(`try_get_compaction_group_view_with_static_sharding()`). But the new
operation mode will apply to migrating tables, which use a
`tablet_storage_group_manager`, which creates one compaction group for
each tablet. Some compaction group needs to be selected.
Pick any compaction group that is available on the current shard.
Reshard compaction is an operation that happens early in the startup
process; compaction groups do not own any SSTables yet, so all
compaction groups are equivalent.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In this mode, the output sstables generated by resharding
compaction are segregated by token range, based on the keyspace
vnode-based owned token ranges vector.
A basic unit test was also added to sstable_directory_test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a table is migrating from vnodes to tablets, the cluster is in a
mixed state where some nodes use vnode ERMs and others use tablet ERMs.
The ERM flavor is a node-local property that expresses the node's
storage organization.
Preserve the flavor across token metadata changes. The flavor needs to
be on par with storage, but the storage can change only on startup, as
it requires resharding all SSTables to conform with the flavor.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The tablet load balancer operates on all tablet-based tables that appear
in the tablet metadata.
With the introduction of the vnodes-to-tablets migration procedure later
in this series, migrating tables will also appear in the tablet
metadata, but they need to be treated as vnode tables until migration is
finished. This patch excludes such tables from load balancing.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Vnodes-to-tablets migrations require cluster-level support: the REST API
and the group0 state need to be supported by all nodes.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147.
To fix the problem, add an explicit `permissions:` block to the workflow
(either at the top level or inside the `trigger-jenkins` job) that
constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This
codifies least-privilege in the workflow itself instead of relying on
repository or organization defaults.
The best minimal, non‑breaking change is to define a root‑level
`permissions:` block with read‑only contents access because the job does
not perform any write operations to the repository, nor does it interact
with issues, pull requests, or other GitHub resources. A conservative,
widely accepted baseline is `contents: read`. If later steps require more
permissions, they can be added explicitly, but for this snippet, no such
need is visible.
Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert:
```yaml
permissions:
contents: read
```
between the `name:` block and the `on:` block (e.g., after line 2).
No additional methods, imports, or definitions are needed since this is
a pure YAML configuration change and does not alter runtime behavior of
the existing shell steps.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27815
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169.
In general, the fix is to add an explicit `permissions:` block to the
workflow (at the root level or per job) so that the `GITHUB_TOKEN` has
only the minimal scopes needed. Since this job only reads event data and
uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to
read‑only repository contents.
The single best fix here is to add a top‑level `permissions:` block
right under the `name:` (and before `on:`) in
`.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`.
This applies to all jobs in the workflow, including `trigger-jenkins`,
and does not alter any existing steps or logic. No additional imports or
methods are needed, as this is purely a YAML configuration change for
GitHub Actions.
Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert:
```yaml
permissions:
contents: read
```
after line 1. No other lines in the file need to change.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27812
Fixtures previously ran GDB once (module scope) to find live objects
(sstables, tasks, schemas) and stored their addresses. Tests then
reused those addresses in separate GDB invocations. Sometimes these
addresses would become stale and the test would step on use-after-free
(e.g. sstables compacted away between invocations).
Fix by dropping the fixtures. The helper functions used by the fixtures
to obtain the required objects are converted to gdb convenience
functions, which can be used in the same expression as the test command
invocation. Thus, the object is aquired on-demand at the moment it is
used, so it is guaranteed to be fresh and relevant.
Fixes: SCYLLADB-1020
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28999
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.
Closesscylladb/scylladb#29148
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.
In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.
Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.
A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128
Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side
Closesscylladb/scylladb#29110
* github.com:scylladb/scylladb:
encryption: fix deadlock in encrypted_data_source::get()
test_lib: mark `limiting_data_source_impl` as not `final`
Fix formatting after previous patch
Fix indentation after previous patch
test_lib: make limiting_data_source_impl available to tests
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.
Closesscylladb/scylladb#29170
The 'scylla sstable-summary' GDB command crashes with
'ValueError: Argument "count" should be greater than zero' when
inspecting ms-format (trie-based) sstables. This happens because
ms-format sstables don't populate the traditional summary structure,
leaving all fields zeroed out, which causes gdb.read_memory() to be
called with a zero count.
Fix by:
- Adding zero-length guards to sstring.to_hex() and sstring.as_bytes()
to return early when the data length is zero, consistent with the
existing guard in managed_bytes.get().
- Adding the same guard to scylla_sstable_summary.to_hex().
- Detecting ms-format sstables (version == 5) early in
scylla_sstable_summary.invoke() and printing an informative message
instead of attempting to read the unpopulated summary.
Fixes: SCYLLADB-1180
Closesscylladb/scylladb#29162
In test_index_requires_rf_rack_valid_keyspace, the create_table call
for a plain tablet-based table can fail with 'Unable to reach schema
agreement' after the server's 10s timeout is exceeded. This happens
when schema gossip propagation across the 4-node cluster takes longer
than expected after a sequence of rapid schema changes earlier in the
test.
Add a retry (up to 2 attempts) on schema agreement errors for this
specific create_table call rather than increasing the server-side
timeout.
Fixes: SCYLLADB-1135
Closesscylladb/scylladb#29132
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139,
To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions
for this workflow/job so it has only what is needed. The script reads PR
data and repository info (which is covered by `contents: read`/default
read scopes) and posts a comment via `github.rest.issues.createComment`,
which requires `issues: write`. No other write scopes (e.g., `contents:
write`, `pull-requests: write`) are necessary.
The best fix without changing functionality is to add a `permissions`
block scoped to this job (or at the workflow root). Since we only see a
single job here, we’ll add it under `check-fixes-prefix`. Concretely, in
`.github/workflows/backport-pr-fixes-validation.yaml`, between the
`runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add:
```yaml
permissions:
contents: read
issues: write
```
This keeps the token minimally privileged while still allowing the script
to create issue/PR comments.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27810
`read_barrier(session2)` was supposed to ensure `node2` has caught up on schema
before a CL=ALL write. But `patient_cql_connection(node2)` creates a
cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))`
that can route the barrier CQL statement to any node — not necessarily `node2`.
If the barrier runs on `node1` or `node3` (which already have the new schema),
it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`.
The fix is to switch to `patient_exclusive_cql_connection(node2)`,
which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`.
This is already the established pattern used by other tests in the same file.
Fixes: SCYLLADB-1139
No need to backport yet, appeared only on master.
Closesscylladb/scylladb#29151
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.
Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.
Fixes: SCYLLADB-1152
Closesscylladb/scylladb#29152
Use add_new_sstable_and_update_cache() when attaching SSTables
downloaded by the node-scoped local loader.
This is the correct variant for new SSTables: it can unlink the
SSTable on failure to add it, and it can split the SSTable if a
tablet split is in progress. The older
add_sstable_and_update_cache() helper is intended for preexisting
SSTables that are already stable on disk.
Additionally, downloaded SSTables are now left unsealed (TemporaryTOC)
until they are successfully added to the table's SSTable set. The
download path (download_fully_contained_sstables) passes
leave_unsealed=true to create_stream_sink, and attach_sstable opens
the SSTable with unsealed_sstable=true and seals it only inside the
on_add callback — matching the pattern used by stream_blob.cc and
storage_service.cc for tablet streaming.
This prevents a data-resurrection hazard: previously, if the process
crashed between download and attach_sstable, or if attach_sstable
failed mid-loop, sealed (TOC) SSTables would remain in the table
directory and be reloaded by distributed_loader on restart. With
TemporaryTOC, sstable_directory automatically cleans them up on
restart instead.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1085.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#29072
The native Scylla nodetool reports ECONNREFUSED as 'Connection refused',
not as 'ConnectException' (which is the Java nodetool format). Add
'Connection refused' to the valid_errors list so that transient
connection failures during concurrent decommission/bootstrap topology
changes are properly tolerated.
Fixes SCYLLADB-1167
Closesscylladb/scylladb#29156
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.
This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.
Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.
Fixes: SCYLLADB-832
Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.
Closesscylladb/scylladb#29031
* github.com:scylladb/scylladb:
vector_search: test: fix flaky test
vector_search: fix race condition on connection timeout
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.
The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.
Fixes#29060
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29061
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.
Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.
Minus one scan_dir user twards scan_dir removal from code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29064
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.
Fixes#29056
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29057
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29053
The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it.
This is the continuation of #28569
Tests improvement, not backporting
Closesscylladb/scylladb#28994
* github.com:scylladb/scylladb:
test: Replace a bunch of ternary operators with an if-else block
test: Squash test_restore_primary_replica_same|different_domain tests
test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
The do_test_backup_abort() fetched the node's workdir and resolved cf_dir
solely to construct a unique-ish backup prefix:
prefix = f'{cf_dir}/backup'
The comment already acknowledged this was only "unique(ish)" — relying
on the UUID-derived cf_dir name as a uniqueness source is roundabout.
unique_name() is already imported and used for exactly this purpose
elsewhere in the file.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29030
The endpoint URL remains intact. Having it next to another toppartitions
endpoint (the /column_family/toppartitions one) is natural.
This endpoint only needs sharded<replica::database>&, grabs it from
http_context and doesn't use any other service. In column_family.cc the
database reference is already available as a parameter. Once more user
of http_context.db is gone.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28996
The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from)
Enhancing tests, not backporting
Closesscylladb/scylladb#29130
* github.com:scylladb/scylladb:
test/refresh: Simplify refresh invocation
test/refresh: Remove r_servers alias for servers
test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
test/refresh: Remove unused wait_for_cql_and_get_hosts import
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access
Fixes#29058
The code appeared in 2026.1, worth fixing it there as well
Closesscylladb/scylladb#29059
* github.com:scylladb/scylladb:
sstables: Fix object storage lister not resetting position in batch vector
sstables: Fix object storage lister skipping entries when filter is active
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.
Fixes SCYLLADB-928
Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.
Closesscylladb/scylladb#29007
* github.com:scylladb/scylladb:
tablets: Fix deadlock in background storage group merge fiber
replica: table: Propagate old erm to storage group merge
test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
storage_service: Extract local_topology_barrier()
This patch series introduces a new documentation for exiting guardrails.
Moreover:
- Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
- Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.
Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)
No backport, just new docs and tests.
[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29011
* github.com:scylladb/scylladb:
test: add new guardrail tests matching documentation scenarios
test: add metric assertions to guardrail replication strategy tests
test: use regex matching in guardrail replication strategy tests
test: extract ks_opts helper in test_guardrail_replication_strategy
docs: document CQL guardrails
cql: improve write consistency level guardrail messages
This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services.
**Motivation**: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues.
This configuration enables two specific use cases:
- Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed.
- Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting.
example of promql queries:
- `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1`
- `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1`
Closes#28402
reader_concurrency_semaphore::signal() guards against available
resources exceeding the initial limit after a signal, which would
indicate a bug such as double-returning resources. It reports the
issue via on_internal_error_noexcept and clamps resources back to
the initial values. However, before this commit there were no tests
that verified this behavior, so bugs like SCYLLADB-1014 went
undetected.
Add a test that artificially signals resources that were never
consumed and verifies that signal() detects the negative leak and
clamps available resources back to the initial limit.
Refs: SCYLLADB-1014
Fixes: SCYLLADB-1031
Closesscylladb/scylladb#28993
Since 7564a56dc8, all tables default to
repair-mode tombstone-gc, which is identical to immediate-mode for RF=1
tables. Consequently the tombstones written by the tests in this test
file are immediately collectible and with some unlucky timing, some of
them can be collected before the end of the test, failing the empty-page
prefix check because the empty pages prefix will be smaller than
expected based on the number of tombstones written.
Disable tombstone-gc to remove this source of flakyness.
Fixes: SCYLLADB-1062
Closesscylladb/scylladb#29077
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node
However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.
Use explicit CL for the ingestion to prevent this.
The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.
Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102
Closesscylladb/scylladb#29128
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.
Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.
Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.
to use, add to config:
```
enable_logstor: true
experimental_features:
- logstor
```
create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```
INSERT, SELECT, DELETE work as expected
UPDATE not supported yet
no backport - new feature
Closesscylladb/scylladb#28706
* github.com:scylladb/scylladb:
logstor: trigger separator flush for buffers that hold old segments
docs/dev: add logstor documentation
logstor: recover segments into compaction groups
logstor: range read
logstor: change index to btree by token per table
logstor: move segments to replica::compaction_group
db: update dirty mem limits dynamically
logstor: track memory usage
logstor: logstor stats api
logstor: compaction buffer pool
logstor: separator: flush buffer when full
logstor: hold segment until index updates
logstor: truncate table
logstor: enable/disable compaction per table
logstor: separator buffer pool
test: logstor: add separator and compaction tests
logstor: segment and separator barrier
logstor: separator debt controller
logstor: compaction controller
logstor: recovery: recover mixed segments using separator
logstor: wait for pending reads in compaction
logstor: separator
logstor: compaction groups
logstor: cache files for read
logstor: recovery: initial
logstor: add segment generation
logstor: reserve segments for compaction
logstor: index: buckets
logstor: add buffer header
logstor: add group_id
logstor: record generation
logstor: generation utility
logstor: use RIPEMD-160 for index key
test: add test_logstor.py
api: add logstor compaction trigger endpoint
replica: add logstor to db
schema: add logstor cf property
logstor: initial commit
db: disable tablet balancing with logstor
db: add logstor experimental feature flag
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.
New functionality. No backport needed.
This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.
Closesscylladb/scylladb#28960
* github.com:scylladb/scylladb:
db/config: announce ms format as highest supported
db/config: enable `ms` sstable format by default
cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
api/system: add /system/chosen_sstable_version
test/cluster/dtest: reduce num_tokens to 16
take_snapshot return values were unused so drop them. do_refresh was a
thin wrapper around load_new_sstables that added no logic; inline it
directly into the gather expression.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The goal of test_refresh_deletes_uploaded_sstables is to verify that
sstables are removed from the upload directory after refresh. The replica
check was just a sanity guard; a simple SELECT of all keys is sufficient
and much lighter.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Replace create_dataset() with explicit keyspace creation via new_test_keyspace,
inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern
used in do_test_streaming_scopes. This removes the last dependency on backup
helpers for dataset setup and makes the test self-contained.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wrap the test body under if True: to pre-indent it, making the subsequent
patch that introduces new_test_keyspace a pure content change with no
whitespace noise.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace create_cluster() from object_store/test_backup.py with a plain
manager.servers_add(2) call. The test does not use object storage, so
there is no need to pull in the backup helper along with its config and
logging knobs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This PR fixes the Installation page:
- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).
Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087
This update affects all supported versions and should be backported as a bug fix.
Closesscylladb/scylladb#29088
* github.com:scylladb/scylladb:
doc: remove the Open Source Example from Installation
doc: replace http with https in the installation instructions