This instruction adds additional safety. The faster we notice that
a node didn't restart properly, the better.
The old gossip-based recovery procedure had a similar recommendation
to verify that each restarting node entered `RECOVERY` mode.
Fixes#25375
This is a documentation improvement. We should backport it to all
branches with the new recovery procedure, so 2025.2 and 2025.3.
Closesscylladb/scylladb#25376
(cherry picked from commit 7b77c6cc4a)
Closesscylladb/scylladb#25439
In one of the previous commits, we made it possible to set
`recovery_leader` on each node just before restarting it. Here, we
update the corresponding documentation.
(cherry picked from commit f408d1fa4f)
This commit:
- Extends the Drivers support table with information on which driver supports tablets
and since which version.
- Adds the driver support policy to the Drivers page.
- Reorganizes the Drivers page to accommodate the updates.
In addition:
- The CPP-over-Rust driver is added to the table.
- The information about Serverless (which we don't support) is removed
and replaced with tablets to correctly describe the contents of the table.
Fixes https://github.com/scylladb/scylladb/issues/19471
Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69Closesscylladb/scylladb#24635
(cherry picked from commit 18b4d4a77c)
Closesscylladb/scylladb#25250
ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing
this limitation in the current docs. Replace the description of the
limitation with general description of frozen semantics for UDTs.
Fixes: #22929Closesscylladb/scylladb#24763
(cherry picked from commit 37ef9efb4e)
Closesscylladb/scylladb#24782
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
- (cherry picked from commit 92b5fe8983)
- (cherry picked from commit 0753643606)
- (cherry picked from commit b0d5462440)
- (cherry picked from commit 093d4f8d69)
- (cherry picked from commit 678deece88)
- (cherry picked from commit 64f8500367)
- (cherry picked from commit b931145a26)
- (cherry picked from commit 3e1c50e9a7)
- (cherry picked from commit 46ff7f9c12)
- (cherry picked from commit ebd9420687)
- (cherry picked from commit aae212a87c)
- (cherry picked from commit 592ca789e2)
- (cherry picked from commit edc2906892)
Parent PR: #24492Closesscylladb/scylladb#24744
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.
(cherry picked from commit 92b5fe8983)
In 2025.2, we don't force enabling the Raft-based topology in the code,
but we stated in the upgrade guides that it's a mandatory step of the
upgrade to 2025.1. We also remind users to enable the Raft-based
topology in the upgrade guides to 2025.2. Hence, we can rely in the
the documentation on the Raft-based topology being enabled. If it is
still disabled, we can just send the user to the upgrade guides. Hence:
- we remove all documentation related to enabling the Raft-based
topology, enabling the Raft-based schema (enabled Raft-based topology
implies enabled Raft-based schema), and the gossip-based topology,
- we can replace the documentation of the old manual recovery procedure
with the documentation of the new manual recovery procedure (done in
the previous commit).
(cherry picked from commit 203ea5d8f9)
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.
We can get rid of the old procedure from the documentation because
we requested users to enable the Raft-based topology during upgrades to
2025.1 and 2025.2.
We leave the note that enabling the Raft-based topology is required to
use the new recovery procedure just in case, since we didn't force
enabling the Raft-based topology in the code.
(cherry picked from commit 4e256182a0)
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.
The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
255 bytes (common filesystem limit for a path component)
- 32 bytes (for the 32-character UUID string)
- 1 byte (for the '-' separator)
- 15 bytes (for the '_scylla_cdc_log' suffix)
- 15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).
This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.
(cherry picked from commit 4577c66a04)
The option "conflicts" with load-and-stream. Tests and doc included.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c0796244bb)
This change adds the --scope option to nodetool refresh.
Like in the case of nodetool restore, you can pass either of:
* node - On the local node.
* rack - On the local rack.
* dc - In the datacenter (DC) where the local node lives.
* all (default) - Everywhere across the cluster.
as scope.
The feature is based on the existing load_and_stream paths, so it
requires passing --load-and-stream to the refresh command.
Also, it is not compatible with the --primary-replica-only option.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#23861
(cherry picked from commit c570941692)
Starting with 2025.1, ScyllaDB versions are no longer called "Enterprise",
but the OS support page still uses that label.
This commit fixes that by replacing "Enterprise" with "ScyllaDB".
This update is required since we've removed "Enterprise" from everywhere else,
including the commands, so having it here is confusing.
Fixes https://github.com/scylladb/scylladb/issues/24179Closesscylladb/scylladb#24181
(cherry picked from commit 2d7db0867c)
Closesscylladb/scylladb#24204
This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page.
The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files.
This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases.
Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment.
Closesscylladb/scylladb#23957
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.
Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.
Fixes: scylladb/scylladb#23278Fixes: scylladb/scylladb#22869
Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga
Closesscylladb/scylladb#23800
* github.com:scylladb/scylladb:
doc: changing topology when changing snitches is no longer supported
test: cluster: introduce test_no_dc_rack_change
storage_service: don't update DC/rack in update_topology_with_local_metadata
main: make dc and rack immutable after bootstrap
test: cluster: remove test_snitch_change
Update the "How to Switch Snitches" document to indicate that changing
topology (i.e. changing node's DC or rack) while changing the snitch is
no longer supported.
Remove a note which said that switching snitches is not supported with
tablets. It was introduced because of the concern that switching a
snitch might change DC or rack of the node, for which our current tablet
load balancer is completely unprepated. Now that changing DC/rack is
forbidden, there doesn't seem to be anything related to snitches which
could cause trouble for tablets.
Current protocol extension that sends tablet info to drivers only does
that if the driver selects a non-replica coordinator for a routable
request. It works well if some node on the replica list is replaced by
other node, or if some replicas are removed from the list. Driver will
at some point send a request to stale replica, and receive new list in
response.
The issue is with extending the list with new replicas. In that case old
replicas are all still correct, so driver will not select any wrong
replica, and will not receive the new list. As far as I know that only
scenario where this could happen is RF increase.
It could be to some degree worked around in the drivers, but it would
add significant complexity (definitely more than any other invalidations
we introduced) while still not being ideal solution. This scenario
should be rare enough, and the consequences of not handling it minor
enough (new replicas not being used as coordinators) that it does not
warrant driver-side solution. Instead this commit adds info about this
to documentation, advising users to restart applications after replica
lists are extended.
It is worth noting that if new tablet feedback protocol extension is
implemented then this problem goes away. See issue #21664.
Closesscylladb/scylladb#23447
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.
To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.
Closesscylladb/scylladb#23584
* github.com:scylladb/scylladb:
tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
tablet-mon.py: Center tablet id text properly in the vertical axis
tablet-mon.py: Show migration stage tag in table mode only when migrating
virtual-tables: Introduce system.load_per_node
virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
docs: virtual-tables: Fix instructions
service: tablets: Keep load_stats inside tablet_allocator
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.
In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.
Fixes: https://github.com/scylladb/scylladb/issues/17174.
Needs backport to 2025.1 to prevent out of space during streaming
Closesscylladb/scylladb#23187
* github.com:scylladb/scylladb:
test: add test for rebuild with repair
locator: service: move to rebuild_v2 transition if cluster is upgraded
locator: service: add transition to rebuild_repair stage for rebuild_v2
locator: service: add rebuild_repair tablet transition stage
locator: add maybe_get_primary_replica
locator: service: add rebuild_v2 tablet transition kind
gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
Can be used to query per-node stats about load as seen by the load
balancer.
In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.
The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.
The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.
Fixes: https://github.com/scylladb/scylladb/issues/22409.
Needs backport to 2025.1 that introduces the new tablet repair API
Closesscylladb/scylladb#22905
* github.com:scylladb/scylladb:
docs: nodetool: update repair and add tablet-repair docs
test: nodetool: add tests for cluster repair command
nodetool: add cluster repair command
nodetool: repair: extract getting hosts and dcs to functions
nodetool: repair: warn about repairing tablet keyspaces
nodetool: repair: move keyspace_uses_tablets function
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.