Commit Graph

11801 Commits

Author SHA1 Message Date
Pavel Emelyanov
df6991edd3 test: Do not duplicate sstable twice
The statistics_rewrite test case copies an sstable from resources two
times:

- first time -- explicitly by listing resource components and copying
  files to the test temp dir
- second time -- implicitly, by calling create_links() linking copied
  files by new set in the staging/ subdirectory

The 2nd step is not needed and the history of changes justifies that.

The test itself appeared with 70b793e4d3 and it only contained the 2nd
"copying" -- test linked files from resource directory and then worked
in the newly created set.

Later, commit 59c57861ae added the first step and copied the files
from resource into test temp dir. At this point linking copied files
because pointless, but was preserved. Let's remove it now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21097
2024-10-18 08:31:08 +03:00
Kefu Chai
2e4be56112 build: cmake: link Seastar with Seastar::<COMPONENT>
before this change, we link against the targets defined in Seastar's
source tree. but these targets are not part of Seastar's public
interface -- they are not exposed by Seastar's CMake config files.

so, let link against the target names qualified by the library module
name. this also prepares for the transition to using Seastar without
including it directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Botond Dénes
e1d8cddd09 test/boost/mutation_test: add test for multishard permit safety
Add a test checking that the multishard reader will not deadlock, when
created with an admitted permit, on a semaphore with a single count
resource.
2024-10-17 08:47:50 -04:00
Botond Dénes
5a3fd69374 test/lib/reader_lifecycle_policy: add semaphore factory to constructor
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
2024-10-17 08:47:50 -04:00
Botond Dénes
c8598e21e8 test/lib/reader_lifecycle_policy: rename factory_function
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.
2024-10-17 08:47:50 -04:00
Raphael S. Carvalho
f3ab5e1f1e tests: Fix perf test for load balancer
Broken after introduction of zero-token nodes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#21156
2024-10-17 14:02:31 +02:00
Kamil Braun
f02afefd34 Merge 'raft: consider the gossiper state then sending the group0 state id' from Emil Maskovsky
Skip the advertisement of the group0 state id in case the gossiper is
not active (ready).

Sending the application state when the gossiper is not active caused
a warning being shown in the log about the local endpoint not being
found in the gossiper endpoint state map on a (graceful) node restart.

The local endpoint is initialized on the gossiper startup, so we skip
the state id advertisement until the startup is finished.

Fixes: scylladb/scylladb#21117

No backport: Fixes an issue that is currently only present in master

Closes scylladb/scylladb#21119

* github.com:scylladb/scylladb:
  raft: consider the gossiper state then sending the group0 state id
  raft: add the test for GROUP0_STATE_ID gossip application state
2024-10-17 13:41:15 +03:00
Alexey Novikov
b965729f0a replica: implement memtable_flush_period_in_ms schema option
implement cassandra original schema option memtable_flush_period_in_ms:
Milliseconds before memtables associated with the table are flushed.

there are few things concerning this patch:
* milliseconds look strange and scary for this option. Unlike Cassandra
  we use 60000ms (1min) minimum value for this option.
* This is limitation of Cassandra but it is impossible to set this option
  for system tables. However sometimes it could be very useful to use
  automatic flushing for such a tables: some system tables have small
  traffic and as a result prevent tombstone garbage collection.

Fixes #20270

Closes scylladb/scylladb#20999
2024-10-17 13:41:15 +03:00
Emil Maskovsky
3f1af268c2 raft: consider the gossiper state then sending the group0 state id
Skip the advertisement of the group0 state id in case the gossiper is
not active (ready).

Sending the application state when the gossiper is not active caused
a warning being shown in the log about the local endpoint not being
found in the gossiper endpoint state map on a (graceful) node restart.

The local endpoint is initialized on the gossiper startup, so we skip
the state id advertisement until the startup is finished.

Fixes: scylladb/scylladb#21117
2024-10-16 19:26:25 +02:00
Emil Maskovsky
65d3d4fd93 raft: add the test for GROUP0_STATE_ID gossip application state
Test that the GROUP0_STATE_ID gossip application state is not causing
the "endpoint_state_map does not contain endpoint" error.

Refs: scylladb/scylladb#21117
2024-10-16 19:21:14 +02:00
Calle Wilund
f2ef75c3da commitlog_test: Up timeout for large entry tests
Fixes #21150

Apparently, on some CI, in debug, these tests can time out (large alloc)
without actually failing what they do. Up the timeout (could consider removing
as well, but...) so they hopefully pass.

Closes scylladb/scylladb#21151
2024-10-16 18:13:04 +03:00
Nadav Har'El
ee0e7a7adf mv: test that operations that should not be allowed on a view, aren't
This patch adds test/cql-pytest tests which verify that all CQL operations
that shouldn't be allowed on a materialized view, actually aren't:

* All operations writing to a table - INSERT, UPDATE, BATCH, DELETE,
  and TRUNCATE - should be rejected when asked to operate on a view.

* All operations with "TABLE" in their name (DROP TABLE, ALTER TABLE,
  DESC TABLE) should be rejected on a view - the ".. MATERIALIZED VIEW"
  operation should be used instead.

* A materialized view cannot get materialized views or indexes of its
  own.

All tests pass on Cassandra (Cassandra 4 or above is needed for the
"DESC" test), and all but one pass on Scylla - Scylla does allow
"DESC TABLE" on a materialized view, unlike Cassandra. I opened an
issue to track that difference: Refs #21026

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21028
2024-10-16 13:43:36 +03:00
Avi Kivity
820509026f schema: replace boost ranges with std ranges
To reduce dependency load, use std ranges instead of boost ranges.

The std::ranges::{lower,upper}_bound don't support heterogeneous lookup,
but a more natural solution is to use a projection to search for the name,
so we use that and the custom comparator is removed.

Many callers are converted as well due to poor interoperability between
boost ranges and std ranges.
2024-10-15 16:42:54 +03:00
Piotr Dulikowski
a380a2efd9 test/test_view_build_status: properly wait for v2 in migration test
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.

However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.

In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.

Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.

Fixes: scylladb/scylladb#21060

Closes scylladb/scylladb#21091
2024-10-15 14:57:47 +03:00
Pavel Emelyanov
63725b10a8 Merge 'cql: create default superuser if it doesn't exist' from Paweł Zakrzewski
This change reorganizes the way standard_role_manager startup is handled: role_manager::ensure_superuser_is_created() is added, which returns a future that resolves once the superuser is available. We wait for this future before starting the CQL server.

There is a change in behavior auth::do_after_system_ready is potentially an infinite loop, and we await its result.

Fixes #10481

Reason for no backports: it's not a regresson and it's an issue that may only affect a tiny time window during the cluster startup.

Closes scylladb/scylladb#20137

* github.com:scylladb/scylladb:
  test: test_restart_cluster: create the test
  auth: standard_role_manager allows awaiting superuser creation
  auth: coroutinize the standard_role_manager start() function
  auth: don't start server until the superuser is created
2024-10-15 14:56:04 +03:00
Tomasz Grabiec
3e438d23e1 Merge 'Check system.tablets update before putting it into the table' from Pavel Emelyanov
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.

fixes #20043

Closes scylladb/scylladb#21020

* github.com:scylladb/scylladb:
  tablets: Validate system.tablets update
  group0_client: Introduce change validation
  group0_client: Add shared_token_metadata dependency
2024-10-15 00:38:59 +02:00
Piotr Smaron
3969ffb39f test: fix flaky test_multidc_alter_tablets_rf
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: scylladb/scylladb#21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

Closes scylladb/scylladb#21056
2024-10-14 16:18:44 +02:00
Avi Kivity
c286ddab38 test: lib: rest_client: use 'http' scheme even when connecting via a unix socket
aiohttp 3.10.5 complains when 'unix+http' is used for a unix-domain
socket. USe 'http', which work with 3.10.5 and the toolchain's 3.9.5.

Closes scylladb/scylladb#21080
2024-10-14 15:32:56 +02:00
Calle Wilund
8eaf00ff11 test::topology: Add test for TLS upgrade and downgrade of internode encryption
Test a rolling upgrade of cluster while active.
Note: This is a unit test version of dtest test. Has the big drawback of not
being able to use cassandra-stress to work and verify the cluster and results

Test moves from none to all to none encryption while writing and then checking
written data.
2024-10-13 23:54:06 +00:00
Avi Kivity
db14a01901 Merge 'Use table id as system.sstables partition key' from Pavel Emelyanov
The system.sstables (a.k.a. sstables registry) primary key is "string location" as partition key and "uuid generation" as clustering one. The "location" part was taken from table.config.datadir value which, in turn, a string containing path to on-disk files if the table was located locally, e.g. /var/lib/scylla/data/ks/cf-abc123 one. Recently [1] the datadir was moved from table config onto storage options, but this string is still used as registry key.

Other than being owned by a table with ID, sstables are accessed by restore-from-object-storage code [2]. To make it work, both storage driver and sstable_directory helper class maintain two formats of object prefixes for sstables components. For S3-backed sstables having a record in registry, the path used is s3://bucket/generation/component. For restore code there are user-provided prefixes that do not match the aforementioned pattern. The selection between those two is now made by checking sstable state, which is not obvious and may cause troubles for tiered storage driver.

This patch changes  the registry schema so that partition key becomes "uuid owner" and is set to be table.id() value. This is to stop using the local path by S3 backed sstables. Also this change makes it possible for storage driver and sstable directory to rely on the storage options only to tell different bucket prefixes formats from each other.

As a side effect, the make_s3_object_name() helper, that generates the proper object name, becomes explicit for restore-from-S3 usage. Now it relies on the sstable::filename() calling this->prefix() behind the scenes and the latter to return the user-provided prefix, which is pretty fragile construction.

No need to backport (and it's not going to be easy to do it), storage options feature is still experimental

Refs #20675 [1]
Refs #20305 [2]

Closes scylladb/scylladb#20998

* github.com:scylladb/scylladb:
  sstables: Flatten S3 object name making
  sstable_directory: Flatten directory lister creation
  treewide: Rename sstable registry location field to be owner
  system_keyspace: Change sstables registry partition key type
  sstables: Keep location variant on s3 backend too
  storage_options: Use variant on S3 options
  sstables: Split sstable::filename() helper
  sstables: Add s3_storage::owner() helper
2024-10-13 20:08:43 +03:00
Patryk Jędrzejczak
18d3a6480d test: test_read_required_hosts: run with the raft-based topology
When we made the raft-based topology mandatory, all boost test
tests started using it. Then, `test_read_required_hosts` started
failing. We left investigating it for later and started running it
with `force-gossip-topology-changes` to make it pass.

Currently, the test doesn't fail with the raft-based topology
anymore. Hence, we remove the FIXME and run the test with a normal
config.

We don't know when and why the test stopped failing. Investigating
it wouldn't be easy, since we don't even know why it failed in the
first place. We suspect that there was some bug that is now fixed.

This patch only fixes a test, there is no need to backport it.

Fixes scylladb/scylladb#18463

Closes scylladb/scylladb#20960
2024-10-11 17:01:20 +02:00
Kamil Braun
96070bb5b3 Merge 'storage_proxy: Add conditions checking to avoid UB in speculating read executors.' from Sergey Zolotukhin
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in  filter_for_query(): the map is considered incorrect if the list  of replicas contains a node from a data center whose replication factor is 0.

 Please note: This PR does not fix the issue found in scylladb/scylladb#20282;   it only adds condition checks to prevent undefined behavior in cases of  inconsistent inputs.

Refs scylladb/scylladb#20625

As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.

Closes scylladb/scylladb#20851

* github.com:scylladb/scylladb:
  Add conditions checking for get_read_executor
  Avoid an extra call to block_for in db::filter_for_query.
  Improve code readability in consistency_level.cc and storage_proxy.cc
  tools: Add build_info header with functions providing build type information
  tests: Add tests for alter table with RF=1 to RF=0
2024-10-11 15:02:02 +02:00
Paweł Zakrzewski
900a6706b8 test: test_restart_cluster: create the test
The purpose of this test that the cluster is able to boot up again after
a full cluster shutdown, thus exhibiting no issues when connecting to
raft group 0 that is larger than one.
2024-10-11 13:25:07 +02:00
Pavel Emelyanov
031893259a treewide: Rename sstable registry location field to be owner
This is sort of continuation of the previous patch. The partition key in
the registry is now table_id, not string, and is better called "owner",
not "location". This patch is s/location/owner/ over specific places
that include field name in the schema, argument names in registry
maintenance classes and tests accessing the selected row fields by name.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 14:11:28 +03:00
Pavel Emelyanov
3315e3a2a9 system_keyspace: Change sstables registry partition key type
Today, the system.sstables schema uses string as partition key. Callers,
in turn, use table's datadir value to reference entries in it. That's
wrong, S3-backed sstables don't have any local paths to work with. The
table's ID is better in this role.

This patch only changes the field type to be table_id and fixes the
callers to provide one. In particular, see init_table_storage() change
-- instead of generating a datadir string, it sets table.id() as the
options' location. Other fixed places are tests. Internally, this id
value is propagated via s3_storage::owner() method, that's fixed as
well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:48:09 +03:00
Pavel Emelyanov
1181b6b082 storage_options: Use variant on S3 options
Describing S3 storage for an sstables nowadays has two options -- via
sstables registry entry and by using the direct prefix string. The
former is used when putting a keyspace on S3. In this case each sstable
has the corresponding entry in the system.sstables table. The latter is
used by "restore from object storage" code. In that case, sstables don't
have entries in the registry, but are accessed by a specific S3 object
path.

This patch reflects this difference by making s3_options::location be
variant of string prefix and table_id owner. The owner needs more
explanation, here it is.

Today, the system.sstables schema defines partition key to be "string
location" and clustering key to be "UUID generation". The partition key
is table's datadir string, but it's wrong to use it this way. Next
patches will change the partition key to be table's ID (there's table_id
type for it), and before doing it storage options must be prepared to
carry it onboard. This patch does it, but the table_id alternative of
the location is still unused, the rest of the code keeps using the
string location to reference a row in the registry table. Next patches
will eventually make use of the table_id value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:04:52 +03:00
Kamil Braun
4d99cd2055 Merge 'raft: fast tombstone GC for group0-managed tables' from Emil Maskovsky
Add the gossip state for broadcasting the nodes state_id.

Implemented the Group0 state broadcaster (based on the gossip) that will broadcast the state id of each node and check the minimal state id for the tombstone GC.

When there is a change in the tombstone GC minimal state id, the state broadcaster will update the tombstone GC time for the group0-managed tables.

The main component of the change is the newly added `group0_state_id_handler` that keeps track, broadcasts and receives the last group0 state_ids across all nodes and sets the tombstone GC deletion time accordingly:
* on each group0 change applied, the state_id handler broadcasts the state_id as a gossip state (only if the value has changed)
* the handler checks for the node state ids every refresh period (configurable, 1h by default)
* on every check, the handler figures out the lowest state_id (timeuuid), which is state_id that all of the nodes already have
* the timestamp of this minimum state_id is then used to set the tombstone GC deletion time
* the tombstone GC calculation then uses that deletion time to provide the GC time back to the callers, e.g. when doing the compaction
* (as the time for tombstone GC calculation has the 1s granularity we actually deduce 1s from the determined timestamp, because it can happen that there were some newer mutations received in the same second that were not distributed across the nodes yet)

This change introduces a new flag to the static schema descriptor (`is_group0_table`) that is being checked for this newly added mode in the tombstone GC. We also add a check (in non-release builds only) on every group0 modification that the table has this flag set.

The group0 tombstone GC handling is similar to the "repair" tombstone GC mode in a sense (that the tombstone GC time is determined according to a reconciliation action), however it is not explicitly visible to (nor editable by) the user. And also the tombstone GC calculation is much simpler than the "repair" mode calculation - for example, we always use the whole range (as opposed to the "repair" mode that can have specific repair times set for specific ranges).

We use the group0 configuration to determine the set of nodes (both current and previous in case of joint configuration) - we need to make sure that we account for all the group0 nodes (if any node didn't provide the state_id yet, the current check round will be skipped, i.e. no GC will be done until all known nodes provide their state_id timestamp value).

Also note that the group0 state_id handling works on all nodes independently, i.e. each node might have its own (possibly different) state depending on the gossip application state propagation. This is however not a problem, as some nodes might be behind, but they will catch up eventually, and this solution has the benefit of being distributed (as opposed to having a central point to handle the state, like for example the topology coordinator that has been considered in the early stages of the design).

Fixes: scylladb/scylla#15607

New feature, should not be backported.

Closes scylladb/scylladb#20394

* github.com:scylladb/scylladb:
  raft: add the check for the group0 tables
  raft: fast tombstone GC for group0-managed tables
  tombstone_gc: refactor the repair map
  raft: flag the group0-managed tables
  gossip: broadcast the group0 state id
  raft/test: add test for the group0 tombstone GC
  treewide: code cleanup and refactoring
2024-10-11 11:52:27 +02:00
Sergey Zolotukhin
132358dc92 tests: Add tests for alter table with RF=1 to RF=0
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.

Test for scylladb/scylladb#20625
2024-10-11 09:38:24 +02:00
Pavel Emelyanov
f09fe4f351 group0_client: Add shared_token_metadata dependency
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 12:27:46 +03:00
Botond Dénes
86fd9ce8fd schema/schema: break circular dependency with replica::database
The schema module (everything in schema/) is supposed to be towards the
leafs in the ScyllaDB inter-module dependency graph. In other words, it
should not depend on many other modules. On the other hand, almost the
entire codebase depends on the schema module itself.
Currently there is a circular dependency between schema and
replica::database, as the latter is a required argument for
schema::describe(). This is bad, not just because of the dependency mess
it introduces, but also because now schema::describe() can only be used
by code which has a reference to the database handy.

This patch breaks this circular dependency, by introducing the
schema_describe_helper interface and providing an implementation for it
in database.hh.

There is another circular dependency: schema <-> replica::table. This is
not addressed by this patch.

Closes scylladb/scylladb#20893
2024-10-10 10:07:26 +03:00
Benny Halevy
3a12ad96c7 sstables: scylla_metadata: add sstable identifier
Keep a copy of the sstable uuid generation in a new
scylla_metadata sstable_identifier attribute.

If the SSTable happens to have a numerical generation
just create a new time-uuid and log a message about that.

Dump this new attribute in scylla sstable dump tool.

And add a unit test to verify that the written (and then
loaded) sstable identifier matches the sstable's generation.

The motivatrion for this change stems from backup
deduplication.  In essence, an sstable may already have been
backed up in a previous snapshot, and we don't want to
abck it up again if it's already present on external storage.

Today this is based on rclone that compares files checksums,
but once scylla will backup the sstables using the native
object-storage stack (#19890), we would like to use the sstable
globally-unique identifier for deduplication.  Although the
uuid-generation is encoded in the sstable path, the latter
may change, e.g. due to intra-node migration, so keep a copy
of the original unique identifier in scylla-metadata, and that
attribute would survive file-based or intra-node migrations.

Fixes scylladb/scylladb#20459

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21002
2024-10-10 08:52:46 +03:00
Avi Kivity
b66479ea98 Merge 'compaction: fix potential data resurrection with file-based migration' from Ferenc Szili
When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:

1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1

If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.

This change fixes this problem by disabling tombstone garbage collection for pending replicas.

This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.

This change has to be backported to all active versions: 6.0, 6.1 and 6.2, as well as Enterprise 2024.2

Closes scylladb/scylladb#20788

* github.com:scylladb/scylladb:
  test: test tombstone GC disabled on pending replica
  tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
  database::table: add tombstone_gc_enabled(locator::tablet_id)
2024-10-09 21:49:49 +03:00
Avi Kivity
bb1867c7c7 Merge 'sstables: Add digest checking in the validation path of the sstable layer' from Nikos Dragazis
This PR builds upon the PR for checksum validation (#20207) to further enhance scrub's corruption detection capabilities by validating digests as well. The digest (full checksum) is the checksum over the entire data, as opposed to per-chunk checksums which apply to individual chunks. Until now, digests were not examined on any code paths. This PR integrates digest checking into the compressed/checksummed data sources as an optional feature and enables it only through the validation path of the sstable layer (`sstable::validate()`). The validation path is used by the following tools:

* scrub in validate mode
* `sstable validate`

All other reads, including normal user reads, are unaffected by this change.

The PR consists of:
* Extensions to the compressed and checksummed data sources to support digest checking. The data sources receive the expected digest as a parameter and calculate the actual digest incrementally across multiple get() calls. The check happens on the get() call that reaches EOF and results to an exception if the digest is invalid. A digest check requires reading the whole file range. Therefore, a partial read or skip() is treated as an internal error.
* A new shareable digest component loaded on demand by the validation code. No lifecycle management.
* Grouping of old scrub/validate tests for compressed and uncompressed SSTables to reduce code duplication.
* scrub/validate tests for SSTables with valid checksums but invalid digests, and SSTables with no digests at all.
* scrub/validate tests with 3.x Cassandra SSTables to ensure compatibility.

Refs #19058.

New feature, no backport is needed.

Closes scylladb/scylladb#20720

* github.com:scylladb/scylladb:
  test: Test scrub/validate with SSTables from Cassandra
  compaction: Make quarantine optional for perform_sstable_scrub()
  test: Make random schema optional in scrub_test_framework
  test: Add tests for invalid digests
  test: Merge scrub/validate tests for compressed and uncompressed cases
  sstables: Verify digests on validation path
  sstables: Check if digest component exists
  sstables: Add digest in the SSTable components
  sstables: Add digest check in compressed data source
  sstables: Add digest check in checksummed data source
2024-10-09 21:33:08 +03:00
Nadav Har'El
a1999cd5d5 cql-pytest: fix run-cassandra on systems with default Java 8
The test/cql-ptest/run-cassandra prefers to use Java 11 if installed on
the system because this is the only version of Java that all modern
versions of Cassandra run on (Cassandra 3 and 4 can run on Java 8 and 11,
Cassandra 5 can run on Java 11 and 17).

However, in our search order we tried the "java" in the user's path
first, before trying Java 11. This means that if the user for some
reason had the ancient Java 8 (which is now a decade old) as his
default "java" got that, instead of Java 11, and couldn't run Cassandra 5.

While at it, update the comments to reflect the new reality that
Cassandra 5 needs Java 17 or 11 - *not* 11 or 8 as the older Cassandra.
We should eventually change the code logic as well (searching for
versions that depend on the Cassandra version - not always Java 8 and
11), but let's do it later. This patch already fixes a real bug for
developers that did install Java 11 but their default "java" pointed to
Java 8.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21001
2024-10-09 20:51:56 +03:00
Botond Dénes
3e468608e7 Merge 'Collect sstables on boot from all datadirs (and don't collect from S3 twice)' from Pavel Emelyanov
There's a long-pending issue in distributed loader. When it populates sstables on boot it loops over table.config.all_datadirs, but ignores the loop cursor (the datadir itslef), instead loading sstables from table.config.dir, which is 0th element of all_datadirs. There's a test for that, but it's also broken. Effectively collection happens from table.config.dir several times. For local sstables that's just wasted work and potentially lost sstables (but nobody seems to configure more than 1 datadir anyway). For S3 sstables it's also wasted work and incorrectness.

The fix is for both -- populator and test. The former is to use all_datadirs to construct sstable_directory. To make it happen, creation of sstable_directory now depends on the storage options, the loop is moved into the branch that creates sstable_directory for local storage type. The test fix is to make sure that some sstables in non-default datadir before running population code.

Closes scylladb/scylladb#20819

* github.com:scylladb/scylladb:
  test: Fix test_multiple_data_dirs
  distributed_loader: Indentation fix after previous patch
  distributed_loader: Use correct datadir to collect local sstable
  distributed_loader: Move all-datadirs loop to local storage collecting
  distributed_loader: Collect table subdirs based on its storage options
  distributed_loader: Indentation fix after previous patch
  distributed_loader: Squash loop of collect_subdir into one method
  distributed_loader: Convert map of directories into a vector
  distributed_loader: Make start_subdir() method work with directory
  distributed_loader: Drop local reference variable
  distributed_loader: Split start_subdir()
  distributed_loader: Remove allow-offstrategy argument
  distributed_loader: Make populate() method work with directory
  distributed_loader: Remove check for sstable_directory presense
  distributed_loader: Out-line table_populator() methods
  distributed_loader: Print storage options, not datadir
  distributed_loader: Print prepared message
  sstable_directory: Add sstable_state argument ot one of constructors
  sstable_directory: Add state() method
2024-10-09 14:43:34 +03:00
Lakshmi Narayanan Sreethar
69c385f540 compaction: make drain wait for compactions to stop during shutdown
During shutdown, the compaction_manager starts stopping ongoing
compaction tasks through `really_do_stop()` method as soon as it
receives a signal from the abort source. Later, when the database object
shuts down, it calls `compaction_manager::drain` to ensure that all
compaction tasks have stopped. However, `compaction_manager::drain` is
currently implemented in such a way that, during shutdown, it
effectively becomes a no-op because the compaction_manager has already
initiated the stopping of tasks. As a result the caller assumes that all
the compaction tasks have stopped and proceeds to close all the tables.
This can lead to race conditions where table closures overlap with
compaction tasks that are still running, resulting in exceptions like :

```
exception during mutation write to 127.0.0.1:
utils::internal::nested_exception<std::runtime_error> (Could not write
mutation system:compaction_history
(pk{0010b70d31705e0411efb2edf6467f094c8b}) to commitlog):
seastar::gate_closed_exception (gate closed)
```

This commit fixes the issue by updating `compaction_manager::drain` to
invoke `stop_ongoing_compactions` even during shutdown to ensure that it
waits for the ongoing compaction tasks to complete. The
`stop_ongoing_compactions` method will also send a stop request to these
tasks before waiting, but the request will be ignored by the tasks as
they would have already received one earlier from `really_do_stop()`.

Fixes #20197

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#20715
2024-10-09 12:08:32 +03:00
Pavel Emelyanov
17ec416178 Merge 'Make sure S3 upload completion parses possible error' from Ernest Zaslavsky
fixes #20517
Adds `aws_error` which possibly can contain errors from the S3 response body. Adds to the multipart upload completion a check for possible error and issues a retry if the error is retryable

Closes scylladb/scylladb#20518

* github.com:scylladb/scylladb:
  test: add complete_multipart_upload completion tests
  code: s3 client error handling
  code: add response parsing and error handling to the complete_multipart_upload
  code: Introduce AWS errors parsing
2024-10-09 12:01:27 +03:00
Piotr Smaron
e0c1a51642 cql/tablets: handle MVs in ALTER tablets KEYSPACE
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.

Fixes: #20240

Closes scylladb/scylladb#21007
2024-10-09 10:51:18 +02:00
Emil Maskovsky
0c9308cf48 raft: add the check for the group0 tables
Added the runtime check to ensure that all the tables that are used with
the group0 commands are marked as group0 tables.
2024-10-08 21:08:11 +02:00
Emil Maskovsky
a03e98d6e8 raft: fast tombstone GC for group0-managed tables
Set the tombstone GC time for group0-managed tables to the minimal state
id of the group0 nodes.

The check is being done based on a timer, iterating through each node
(according to the group0 topology configuration) and taking the minimum
across all nodes.

This miminum timestamp is then be used to set the tombstone GC time
for the tombstone GC of all the group0-managed tables.

Fixes: scylladb/scylla#15607
2024-10-08 21:07:30 +02:00
Emil Maskovsky
fa45fdf5f7 raft/test: add test for the group0 tombstone GC
Test that the group0 fast tombstone GC works correctly.
2024-10-08 20:53:54 +02:00
Emil Maskovsky
a840949ea0 treewide: code cleanup and refactoring
Fix the clang-tidy warnings, code cleanup and improvements.

Applied the clang format to the updated places.
2024-10-08 20:53:54 +02:00
Nadav Har'El
b4df07df71 Merge 'cql3: Print arguments and return type without frozen when describing UDF' from Dawid Mędrek
Scylla doesn't allow for the types of arguments or the return type of a UDF
to be frozen. As a result, before these changes, create statements
produced to restore UDFs as part of `DESCRIBE` statements could not
be executed.

Fixes scylladb/scylladb#20256

Backport: necessary as the restore process may not work correctly without these changes. The affected versions span from 5.2 to the current master, but we only want to apply the fix to the live versions, so 6.0, 6.1, and 6.2.

Closes scylladb/scylladb#20816

* github.com:scylladb/scylladb:
  cql3/functions/user_function: Print arguments and return type without frozen
  cql3/functions/user_function: Use fmt to format create statement
2024-10-08 16:05:28 +03:00
Kamil Braun
2d9b8f269f Merge 'cql: improve validating RF's change in ALTER tablets KS' from Piotr Smaron
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.

Fixes: #20039

Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.

Closes scylladb/scylladb#20208

* github.com:scylladb/scylladb:
  cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
  cql: join new and old KS options in ALTER tablets KS
  cql: fix validation of ALTERing RFs in tablets KS
  cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
  cql: validate RF change for new DCs in ALTER tablets KS
  cql: extend test_alter_tablet_keyspace_rf
  cql: refactor test_tablets::test_alter_tablet_keyspace
  cql: remove unused helper function from test_tablets
2024-10-08 14:33:45 +02:00
Avi Kivity
48ea51029f Merge 'time_window_compaction_strategy: estimated_pending_compactions: reestimate compactions rather than using cached value' from Benny Halevy
Currently, `estimated_pending_compactions` uses a precalculated value calculated by `update_estimated_compaction_by_tasks`, which, in turn, is called by `get_compaction_candidates`.  That means that, if `estimated_pending_compactions` is called, e.g. right after major compaction, it will return an outdated value that was calculated prior to major compaction, and so, it is no longer relevant.

Instead, just recalculate the value in `estimated_pending_compactions` and drop `update_estimated_compaction_by_tasks`.

* Enhancement, no backport required

Closes scylladb/scylladb#20892

* github.com:scylladb/scylladb:
  test: cql-pytest: test_compaction: add test_compactionstats_after_major_compaction
  test/cql-pytest: rename test_compaction{_tombstone_gc,}
  time_window_compaction_strategy: estimated_pending_compactions: reestimate compactions rather than using cached value
2024-10-08 13:29:51 +03:00
Dawid Mędrek
8582ed513b cql3/functions/user_function: Print arguments and return type without frozen
Scylla doesn't allow for the types of arguments or the return type
to be frozen. As a result, before these changes, create statements
produced to restore UDFs as part of `DESCRIBE` statements could not
be executed.

We fix that and add a reproducer test and another one to verify that
the implementation is correct.
2024-10-07 20:53:10 +02:00
Nadav Har'El
45ccceb137 alternator: add "dc" and "rack" options to "/localnodes" request
Before this patch, the "/localnodes" HTTP request to the Alternator server
lists all the live nodes of the current DC. This patch adds two optional
parameters to this query:

  dc: allows to list the live nodes of a specific named DC instead of the
      current DC of the server.

  rack: allows to restrict the results to just the nodes belonging to a
      specific named rack.

For both options, if no live node exists in the given dc or rack (in
particular, if such a dc or rack doesn't even exist), an empty list is
returned - it's not an error.

The default, if dc or rack is not specified - remains exactly as it is
today - look at the current DC (the one of the node being request), and
do not restrict the list to any specific rack.

We expect the new options that we added here to be useful for two use cases:

1. A client that knows of *some* Scylla node (belonging to an unknown DC),
   but wants to list the nodes in *its* DC, which it knows by name.

2. A client in a multi-rack DC (e.g., multi-AZ region in AWS) that wants
   to send requests to nodes in its own rack (which it knows by name),
   to avoid cross-rack networking costs.

Note that in both cases, this requires clients to know the names of DCs
and AZs via some out-of-band means. The client can also get a list of DCs
and racks using the system.local system table, as the tests included in
this patch demonstrate.

This patch includes two set of tests for these new options: One in the the
single-node test/alternator framework that has a single dc and rack but
can still check the case of an unknown dc or rack (in which case an empty
list is returned). The second test is in the topology framework, and runs
an 8-node cluster with two DCs, two racks, and two nodes in each, and
checks all the combinations of "/localnodes" requests with and without
dc and rack options. This test also resolves a longstanding TODO that
asked for such a multi-DC test for "/localnodes" to be written.

Fixes #12147

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20915
2024-10-07 20:53:47 +03:00
Pavel Emelyanov
8bfbc563cc test: Remove sstable factory from test_min_max_clustering_key()
The helper makes sstables from env directly. Callers may not create the
factor after that. Less code the better.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20983
2024-10-07 20:08:05 +03:00
Piotr Smaron
ee56bbfe61 cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.
2024-10-07 17:02:50 +02:00
Piotr Smaron
2aabe7f09c cql: join new and old KS options in ALTER tablets KS
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.
2024-10-07 17:02:45 +02:00