Commit Graph

49252 Commits

Author SHA1 Message Date
Dawid Mędrek
d2c5268196 cql3: Produce CREATE MATERIALIZED VIEW statement when describing MV of index
Before this change, executing `DESCRIBE MATERIALIZED VIEW` on the underlying
materialized view of a secondary index would produce a `CREATE INDEX` statement.
It was not only confusing, but it also prevented from learning about
the definition of the view. The only way to do so was to query system tables.

We change that behavior and produce a `CREATE MATERIALIZED VIEW` statement
instead. The statement is printed as a comment to implicitly convey that
the user should not attempt to execute it to restore the view. A short comment
is provided to make it clearer.

Before this commit:

```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;

CREATE INDEX i ON ks.t(v);
```

After this commit:

```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;

/* Do NOT execute this statement! It's only for informational purposes.
   This materialized view is the underlying materialized view of a secondary
   index. It can be restored via restoring the index.

CREATE MATERIALIZED VIEW ks.i_index [...];

*/
```

Note that describing the base table has not been affected and still works
as follows:

```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE TABLE ks.t;

CREATE TABLE ks.t (
    p int,
    v int,
    PRIMARY KEY (p)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'IncrementalCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE'
    AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};

CREATE INDEX i ON ks.t(v);
```

We also provide two reproducers of scylladb/scylladb#24610.

Fixes scylladb/scylladb#24610

Closes scylladb/scylladb#25697
2025-09-03 15:21:37 +02:00
Piotr Dulikowski
f95808cbe7 Merge 'cdc/generation: Clone topology_description asynchronously' from Dawid Mędrek
An instance of `cdc::topology_description` can be quite big. The vector
it consists of stores as many `token_range_description`s as there are
vnodes, and the size of each `token_range_description` is O(#shards).

Because of that, copying an instance of the type can lead to reactor
stalls. To prevent that, we introduce an asynchronous function copying
the contents on the object.

Reactor stalls were detected in the call to `map_reduce` in
`generation_service::legacy_do_handle_cdc_generation`, so let's start
using the new function there.

A similar scenario occurs in `generation_service::handle_cdc_generation`,
so we modify it too.

Unfortunately, it doesn't seem viable to provide a reproducer of said
problem.

Fixes scylladb/scylladb#24522

Backport: none. Reactor stalls are not critical.

Closes scylladb/scylladb#25730

* github.com:scylladb/scylladb:
  cdc/generation: Delete copy constructors of topology_description
  cdc/generation: Clone topology_description asynchronously
2025-09-03 13:41:58 +02:00
Botond Dénes
6116f9e11b Merge 'Compaction tasks progress' from Aleksandra Martyniuk
Determine the progress of compaction tasks that have
children.

The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)

If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.

All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.

Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.

Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.

New feature, no backport needed

Closes scylladb/scylladb#15158

* github.com:scylladb/scylladb:
  test: add compaction task progress test
  compaction: set progress unit for compaction tasks
  compaction: find expected workload for reshard tasks
  compaction: find expected workload for global cleanup compaction tasks
  compaction: find expected workload for global major compaction tasks
  compaction: find expected workload for keyspace compaction tasks
  compaction: find expected workload for shard compaction tasks
  compaction: find expected workload for table compaction tasks
  compaction: return empty progress when compaction_size isn't set
  compaction: update compaction_data::compaction_size at once
  tasks: do not check expected workload for done task
2025-09-03 13:23:42 +03:00
Pavel Emelyanov
b0aa2d61d9 Merge 'cql3: add default replication factor to create_keyspace_statement' from Dario Mirovic
When creating a new keyspace, replication factor must be stated.
For example:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };`

This patch changes it in the following way - if there is no
replication factor specified, use default replication factor.
Default replication factor is equal to the number of racks that
are not arbiter-only, i.e. racks that have at least one non-arbiter node.
The following syntax is now valid:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };`
`CREATE KEYSPACE ks WITH REPLICATION { };`

Fixes #16028

Backport is not needed. This is an enhancement for future releases.

Closes scylladb/scylladb#25570

* github.com:scylladb/scylladb:
  docs/cql: update documentation for default replication factor
  test/cqlpy: add keyspace creation default replication factor tests
  cql3: add default replication factor to `create_keyspace_statement`
2025-09-03 12:31:53 +03:00
Pavel Emelyanov
c0808c90b0 api: Use validate_table() helper in /storage_service/tokens_endpoint handler
The handler validates if the given ks:cf pair exists in the database,
then finds the table id to process further. There's a helper that does
both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25669
2025-09-03 11:44:50 +03:00
Pavel Emelyanov
b5610050a1 api: Make GET/storage_service/drain handler work on storage service
POSTing on the same URL launches storage_service::drain(), so GETing on
it should (not that it's restriced somehow, but still) work on the same
service. This changes removes one more user of http_context::database
which in turn will allow removding database reference from context
eventually.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25677
2025-09-03 11:40:39 +03:00
Radosław Cybulski
7b3d42f83e Remove unused boost macro definitions
Closes scylladb/scylladb#25742
2025-09-03 10:06:33 +03:00
Radosław Cybulski
c242234552 Revert "build: add precompiled headers to CMakeLists.txt"
This reverts commit 01bb7b629a.

Closes scylladb/scylladb#25735
2025-09-03 09:46:00 +03:00
Calle Wilund
bc20861afb system_keyspace: Prune dropped tables from truncation on start/drop
Fixes #25683

Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.

Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).

Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.

Small unit test included.

Closes scylladb/scylladb#25699
2025-09-03 07:25:34 +03:00
Sergey Zolotukhin
13392a40d4 gossiper: check for a race condition in do_apply_state_locally
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.

This change adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.

Fixes scylladb/scylladb#25702
Fixes scylladb/scylladb#25621
Ref scylladb/scylla-enterprise#5613

Closes scylladb/scylladb#25727
2025-09-02 20:44:21 +02:00
Piotr Dulikowski
78ef334333 Merge 'Move "cache" API endpoints registration closer to column_family ones ' from Pavel Emelyanov
These two "blocks" of endpoints have different URL prefixes, but work with the same "service", which is sharded<replica::database>. The latter block had already been fixed to carry the sharded<database>& around (#25467), now it's the "cache" turn. However, since these endpoints also work with the database, there's no need in dedicated top-level set/unset machinery (similarly, gossiper has two API set/unset blocks that come together, see #19425), it's enough to just set/unset them next to each other.

Ongoing http_context dependency cleanup, no need to backport

Closes scylladb/scylladb#25674

* github.com:scylladb/scylladb:
  api: Capture and use db in cache_service handlers
  api: Add sharded<database>& arg to set_cache_service()
  api: Squash (un)set_cache_service into ..._column_family
  api: Coroutinize set_server_column_family()
2025-09-02 13:59:02 +02:00
Avi Kivity
7ed261fc52 Merge 'Inital GCP object storage support' from Calle Wilund
Adds infrastructure and client for interaction with GCP object storage services.

Note: this is just a client object usable for creating, listing, deleting and up/downloading of objects to/from said storage service. It makes no attempt at actually inserting it into the sstable storage flow. That can come later.

This PR breaks out GCP auth and some general REST call functionality into shared routines. Not all code is 100% reused, but at least some.

Test is added, though could be more comprehensive (feel free to suggest a test vector).
Test can run in either local mock server mode (default), or against actual GCP.
See `test/boost/gcp_object_storage_test.cc` for explanation on the config environment vars.
Default is to run the test against a temporary docker deamon.

Closes scylladb/scylladb#24629

* github.com:scylladb/scylladb:
  test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage
  proc-utils: Re-export waiting types from seastar
  proc-utils: Inherit environment from current process
  utils::gcp::object_storage: Add client for GCP object storage
  utils::http: Add optional external credentials to dns_connection_factory init
  utils::rest: Break out request wrapper and send logic
  encryption::gcp_host: Use shared gcp credentials + REST helpers
  utils::gcp: Move/add gcp credentials management to shared file
  utils::rest::client: Add formatter for seastar::http::reply
  utils::rest::client: Add helper routines for simple REST calls
  utils::http: Make shared system trust certificates public
2025-09-02 14:38:09 +03:00
Avi Kivity
fe308de8df Merge 'treewide: Add missing #pragma once' from Ernest Zaslavsky
Add missing #pragma once and license boilerplate to include headers.

Consider adding a CI step to catch missing header guards early. It can be done easily by running `cpplint` like below
```
 find . -path ./seastar -prune -o -path ./venv -prune -o -path ./idl -prune -o -type f \( -name "*.h" -o -name "*.hh" -o -name "*.hpp" \) -print0 | xargs -0 cpplint 2>&1 | grep "header guard found"
```

No backport is needed, the change is not "functional"

Closes scylladb/scylladb#25768

* github.com:scylladb/scylladb:
  treewide: Add missing license boilerplate
  treewide: Add missing `#pragma once`
2025-09-02 13:18:04 +03:00
Piotr Dulikowski
762d9ef68f Merge 'cdc: Set tombstone_gc when creating log table' from Dawid Mędrek
Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the
schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately,
that didn't happen when creating CDC log tables, and so we might have missed
some of the properties that would normally be set to some value, even if the
default one.

One particular example of that phenomenon was `tombstone_gc`. For better or
worse, it's not a "standalone property" of a table, but rather part of
`extensions`. [Somewhat related issue: scylladb/scylladb#9722]

That may have and did cause trouble. Consider this scenario:

1. A CDC log table is created.
2. The table does NOT have any value of `tombstone_gc` set.
3. The user edits the table via `ALTER TABLE`. That statement treats the log
   table just like any other one (at least as far as the relevant portion of the
   logic is concerned). Among other things, it uses
   `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc`
   property is set to some value:
   * the default one if the user doesn't specify it in the statement,
   * a custom one if they do.

Why is that a problem?

First of all, it's confusing. When we perform a schema backup and a table uses
CDC, we include an ALTER statement for its corresponding CDC log table (for more
context, see issue scylladb/scylladb#18467 or commit
scylladb/scylladb@f12edbdd95).

There are two consequences for the user here:
1. If the log table had NOT been altered ever since it was created, the
   statement will miss the `tombstone_gc` property as if it couldn't be set for
   it at all. That's confusing!
2. If the log table HAD in fact been altered after its creation, the statement
   will include the `tombstone_gc` property. That's even more confusing (why was
   it not present the first time, but it is now?).

The `tombstone_gc` property should always be set to avoid confusion and
problematic edge cases in tests and to simply be consistent with how other
schema entities work.

The solution we employ is that we always set the property to the default
value. That includes the case when we reattach the log table to the base;
consider the following scenario:

1. Create a table with CDC enabled.
2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`.
3. Change the `tombstone_gc` property of the log table.
4. Reattach the log table to the base in the same way as in step 2.

The expected result would be that the new value of `tombstone_gc` would be
preserved after reattaching the log table. However, that's not what will
happen. We decide to stay consistent with how other properties of a log
table behave, and we reset them after every reattachment. We might change that
in the future: see issue scylladb/scylladb#25523.

Two reproducer tests of scylladb/scylladb#25187 are included in the changes.

Backport: The problem is not critical, so it may not be necessary to backport the changes.
That's to be discussed.

Closes scylladb/scylladb#25521

* github.com:scylladb/scylladb:
  cdc: Set tombstone_gc when creating log table
  tombstone_gc: Add overload of get_default_tombstone_gc_mode
  tombstone_gc: Rename get_default_tombstonesonte_gc_mode
2025-09-02 10:20:11 +02:00
Tomasz Grabiec
a7f10b585e Merge 'drop table: fix crash on drop table with concurrent cleanup' from Ferenc Szili
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash

Fixes: #25706

This needs to be backported to all supported versions with tablets

Closes scylladb/scylladb#25708

* github.com:scylladb/scylladb:
  test: reproducer and test for drop with concurrent cleanup
  truncate: check for closed storage group's gate in discard_sstables
2025-09-02 00:02:14 +02:00
Calle Wilund
21adfd8a60 test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage
Allows testing using either local mock server (installed or using docker),
or real GCP project (not tested as of writing this).

v2: Try podman if docker unavail
v3: Ensure we check log output on fake-gcs, because when using podman, the
    published port will be connectible even though the actual server is not
    up yet.
v4: Use ephermal port forward in docker/podman to allow us running parallel
    instances. Also adjust credentials and port finding in test.
v5: Re-ensure no parallel tests for this: We seem to time out in podman
    trying to fetch image for X parallel tests
v6: Remove the ephermal port stuff. Because of course this does not work
    with our podman-in-podman. Do brute-force port speculation instead.
v7: Up timeout for server start to allow docker pull.
v8: Fix string check error
v9: Add explicit docker image version
2025-09-01 18:14:20 +00:00
Calle Wilund
5ead6ec420 proc-utils: Re-export waiting types from seastar
Just to make directly accessible from wrapper type
2025-09-01 18:03:44 +00:00
Calle Wilund
8169327553 proc-utils: Inherit environment from current process
In most cases, when launching a process from tests, we will want to
inherit our own env. Add option (default true) to do so.
2025-09-01 18:03:44 +00:00
Calle Wilund
4a5b547a86 utils::gcp::object_storage: Add client for GCP object storage
Adds a minial client for GCP object storage operations:

* Create buckets
* Delete buckets
* List bucket content
* Copy/move bucket content
* Delete bucket content
* Upload bucket content
* Download bucket content
2025-09-01 18:03:44 +00:00
Calle Wilund
8f54b709ce utils::http: Add optional external credentials to dns_connection_factory init
Also allow creating the object using an endpoint expression.
Note: this moves code to the .cc file, because it introduces a few
more lines, and I feel we have to much stuff in headers as is.
2025-09-01 18:03:44 +00:00
Calle Wilund
0e9e1f7738 utils::rest: Break out request wrapper and send logic
Allows sharing some of the wrapping and logic outside the
single-call object/routine paths, using it also with an external
seastar::http::client, i.e. caching resources across several calls.
2025-09-01 18:03:44 +00:00
Calle Wilund
fe4ab7f7bf encryption::gcp_host: Use shared gcp credentials + REST helpers
Removes code in favour of transplanted shared util code.
2025-09-01 18:03:44 +00:00
Calle Wilund
2b7ad605b3 utils::gcp: Move/add gcp credentials management to shared file
Copied from encryption::gcp_host. Light-weight impl of gcp credentials
management.
2025-09-01 18:03:44 +00:00
Calle Wilund
f6d7c7e300 utils::rest::client: Add formatter for seastar::http::reply 2025-09-01 18:03:44 +00:00
Calle Wilund
cc1e659abd utils::rest::client: Add helper routines for simple REST calls
Packing headers and unpacking response to json. Usable for esp. gcp
interaction.
2025-09-01 18:03:43 +00:00
Calle Wilund
886fcf1759 utils::http: Make shared system trust certificates public
So other clients/factories can share.
2025-09-01 18:03:43 +00:00
Karol Nowacki
3086d15999 cql3: Fix crash on ANN OF query when TRACING ON is enabled
Executing a vector search (SELECT with ANN OF ordering) query with `TRACING ON` enabled
caused a node to crash due to a null pointer dereference.

This occurred because a vector index does not have an associated view
table, making its `_view_schema` member null. The implementation
attempted to enable tracing on this null view schema, leading to the
crash.

The fix adds a null check for `_view_schema` before attempting to
enable tracing on the view (index) table.

A regression test is included to prevent this from happening again.

Fixes: VECTOR-179

Closes scylladb/scylladb#25500
2025-09-01 17:26:54 +03:00
Avi Kivity
41880bc893 cql3: statement_restrictions: forbid querying a single-column inequality restriction on a multi-column restriction
CQL supports multi-column inequality restrictions in the form

  (ck1, ck2, ck3) >= (:v1, :v2, :v3)

These restriction shape is only allowed on clustering keys, and
is translated into a partition_slice allowing the primary index
to efficiently select the part of the partition that satisfies the
restriction.

The possible_lhs_values() values function allows extracting
single-column restrictions from this and similar tuple restrictions.
For example, the multi-column restriction

  (ck1, ck2, ck3) = (:v1, :v2, :v3)

implies that ck2 = :v2. If we have an index on ck2, and if we don't
further have a restriction on the partition key, then it is
advantageous to use the index to select rows, and then filter
on ck1 and ck3 to satisfy the full restriction.

For the inquality restriction, we can only infer a restriction on the
first column due to lexicographical comparison. We can see that, given

  (ck1, ck2, ck3) >= (:v1, :v2, :v3)

then

  ck1 >= :v1
  ck2 = unbounded
  ck3 = unbounded

and possible_lhs_values() indeed computes this.

However, this is never used in practice, and it makes further refactoring
difficult. If we want to convert an boolean factor of the where clause
to a predicate on a column or tuple of columns, we cannot do so because
we can actually generate two predicates: one on the tuple and one on the
first column.

Since it's not used, remove it.

This code was first introduced in d33053b841 ("cql3/restrictions: Add
free functions over new classes")
(search for "if (column_index_on_lhs > 0) {").

It does not directly correspond to pre-expression code.

Closes scylladb/scylladb#25757
2025-09-01 17:21:26 +03:00
Artsiom Mishuta
5910ad3c6d test.py: apply the nightly label on test_topology_recovery_basic
This test is for the old gossip-based recovery procedure, which is an almost obsolete feature that won't change anymore.

Closes scylladb/scylladb#25694
2025-09-01 14:16:29 +02:00
Emil Maskovsky
5dac4b38fb test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721

Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.

Closes scylladb/scylladb#25685
2025-09-01 13:59:47 +02:00
Ernest Zaslavsky
0e4292adb4 treewide: Add missing license boilerplate
Add missing license boilerplate to include headers
2025-09-01 14:58:32 +03:00
Ernest Zaslavsky
19345e539f treewide: Add missing #pragma once
Add missing `#pragma once` to include headers
2025-09-01 14:58:21 +03:00
Petr Gusev
2e757d6de4 cas: pass timeout_if_partially_accepted := write to accept_proposal()
Write requests cannot be safely retried if some replicas respond with
accepts and others with rejects. In this case, the coordinator is
uncertain about the outcome of the LWT: a subsequent LWT may either
complete the Paxos round (if a quorum observed the accept) or overwrite it
(if a quorum did not). If the original LWT was actually completed by
later rounds and the coordinator retried it, the write could be applied
twice, potentially overwriting effects of other LWTs that slipped in
between. Read requests do not have this problem, so they
can be safely retried.

Before this commit, handler->accept_proposal was called with
timeout_if_partially_accepted := true. This caused both read and write
requests to throw an "uncertainty" timeout to the user in the case
of the contention described above. After this commit, we throw an
"uncertainty" timeout only for write requests, while read requests
are instead retried in the loop in sp::cas.

Closes scylladb/scylladb#25602
2025-09-01 14:31:04 +03:00
Pavel Emelyanov
840cdab627 api: Move /load and /metrics/load handlers code to column_family.cc
Both handlers need database to proceed and thus need to be registered
(and unregistered) in a group that captures database for its handlers.

Once moved, the used get_cf_stats() method can be marked local to
column_family.cc file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25671
2025-09-01 08:11:00 +02:00
Dawid Mędrek
fc50e9d0a4 test/perf: Require smp=1 in perf_cache_eviction
Trying to run the test with more than one shard results in a failure
when generating sharding metadata:

```
ERROR 2025-08-27 16:00:17,551 [shard  0:main] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /tmp/scylla-c9fa42fe/ks/cf-2938a030834e11f0a561ffa33feb022d/me-3gt6_12wh_1gifk2ijgeu1ovc1m5-big-Data.db). Aborting
```

Let's require that the test be run with a single shard.

Closes scylladb/scylladb#25703
2025-09-01 08:59:35 +03:00
Nadav Har'El
6d1abc5b2c utils/base64: fix misleading code and comment (no functional change)
utils/base64.cc had some strange code with a strange comment in
base64_begins_with().

The code had

        base.substr(operand.size() - 4, operand.size())

The comment claims that this is "last 4 bytes of base64-encoded string",
but this comment is misleading - operand is typically shorter than base
(this this whole point of the base64_begins_with()), so the real
intention of the code is not to find the *last* 4 bytes of base, but rather
the *next* four bytes after the (operand.size() - 4) which we already copied.
These four bytes that may need the full power of base64_decode_string()
because they may or may not contain padding.

But, if we really want the next 4 bytes, why pass operand.size() as the
length of the substring? operand.size() is at least 4 (it's a mutiple of
4, and if it's 0 we returned earlier), but it could me more. We don't
need more, we just need 4. It's not really wrong to take more than 4 (so
this patch doesn't *fix* any bug), but can be wasteful. So this code
should be:

        base.substr(operand.size() - 4, 4)

We already have in test/boost/alternator_unit_test.cc a test,
test_base64_begins_with that takes encoded base64 strings up to 12
characters in length (corresponding to decoded strings up to 8 chars),
and substrings from length 0 to the base string's length, and check
that test_base64_begins_with succeeds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25712
2025-09-01 08:57:50 +03:00
Andrei Chekun
e55c8a9936 test.py: modify run to use different junit output filenames
Currently, run will execute twice pytest without modifying the path of the
JUnit XML report. This leads that the second execution of the pytest
will override the report. This PR fixing this issue so both reports will
be stored.

Closes scylladb/scylladb#25726
2025-09-01 08:56:48 +03:00
Ernest Zaslavsky
05154e131a cleanup: Add missing #pragma once
Add missing `#pragma once` to include header

Closes scylladb/scylladb#25761
2025-09-01 06:41:57 +03:00
Botond Dénes
fbff8d3b2d Merge 'vector_store_client: disable Nagle's algorithm on the http client' from Pawel Pery
Nagle’s algorithm and Delayed ACK’s algorithm are enabled by default on sockets in Linux. As a result we can experience 40ms latency on simply waiting for ACK on the client side. Disabling the Nagle’s algorithm (using TCP_NODELAY) should fix the issue (client won’t wait 40ms for ACKs).

This change sets `TCP_NODELAY` on every socket created by the `http_client`.

Checking for dead peers or network is helpful in maintaining a lifetime of the http client. This change also sets TCP_KEEPALIVE option on the http client's socket.

Fixes: VECTOR-169

Closes scylladb/scylladb#25401

* github.com:scylladb/scylladb:
  vector_store_client: set keepalive for the http client's socket
  vector_store_client: disable Nagle's algorithm on the http client
2025-09-01 06:26:06 +03:00
Jenkins Promoter
619b4102bd Update pgo profiles - x86_64 2025-09-01 05:08:56 +03:00
Jenkins Promoter
783f866bd3 Update pgo profiles - aarch64 2025-09-01 05:05:17 +03:00
Avi Kivity
dfc7957a73 Merge 'test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints' from Benny Halevy
Following up on 6129411a5e
improve test_vnode_keyspace_describe_ring be verifying that the
endpoints listed by describe_ring match those returned by the
`natural_endpoints` api (for random tokens).
The latter are calculated using an independent code path
directly from the effective_replication_map.

* test exists currently only on master, no backport required

Closes scylladb/scylladb#25610

* github.com:scylladb/scylladb:
  test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints
  test/pylib/rest_client: add natural_endpoints function
2025-08-31 20:36:15 +03:00
Avi Kivity
bae66cc0d8 Merge 'types: add byte-comparable format support for collections' from Lakshmi Narayanan Sreethar
This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types.

This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md

Refs https://github.com/scylladb/scylladb/issues/19407

New feature - backport not required.

Closes scylladb/scylladb#25603

* github.com:scylladb/scylladb:
  types/comparable_bytes: add compatibility testcases for collection types
  types/comparable_bytes: update compatibility testcase to support collection types
  types/comparable_bytes: support empty type
  types/comparable_bytes: support reversed types
  types/comparable_bytes: support vector cql3 type
  types/comparable_bytes: support tuple and UDT cql3 type
  types/comparable_bytes: support map cql3 type
  types/comparable_bytes: support set and list cql3 types
  types/comparable_bytes: introduce encode/decode_component
  types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes
2025-08-31 15:53:27 +03:00
Avi Kivity
600349e29a Merge 'tasks: return task::impl from make_and_start_task ' from Aleksandra Martyniuk
Currently, make_and_start_task returns a pointer to task_manager::task
that hides the implementation details. If we need to access
the implementation (e.g. because we want a task to "return" a value),
we need to make and start task step by step openly.

Return task_manager::task::impl from make_and_start_task. Use it
where possible.

Fixes: https://github.com/scylladb/scylladb/issues/22146.

Optimization; no backport

Closes scylladb/scylladb#25743

* github.com:scylladb/scylladb:
  tasks: return task::impl from make_and_start_task
  compaction: use current_task_type
  repair: add new param to tablet_repair_task_impl
  repair: add new params to shard_repair_task_impl
  repair: pass argument by value
2025-08-31 15:44:37 +03:00
Nadav Har'El
ff91027eac utils, alternator: fix detection of invalid base-64
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.

The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.

Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.

Fixes #25701

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25705
2025-08-31 15:38:01 +03:00
Avi Kivity
1f4c9b1528 Merge 'system_keyspace: add peers cache to get_ip_from_peers_table' from Petr Gusev
The gossiper can call `storage_service::on_change` frequently (see  scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.

This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.

This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.

Fixes scylladb/scylladb#25660

backport: this patch needs to be backported to all supported versions (2025.1/2/3).

Closes scylladb/scylladb#25658

* github.com:scylladb/scylladb:
  storage_service: move get_host_id_to_ip_map to system_keyspace
  system_keyspace: use peers cache in get_ip_from_peers_table
  storage_service: move get_ip_from_peers_table to system_keyspace
2025-08-31 15:34:35 +03:00
Piotr Wieczorek
5add43e15c alternator: streams: Address minor incompatibilities with DynamoDB in GetRecords response.
This commit adds missing fields to GetRecords responses: `awsRegion` and
`eventVersion`. We also considered changing `eventSource` from
`scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield
inside the `dynamodb` field.

We set `awsRegion` to the datacenter's name of the node that received
the request. This is in line with the AWS documentation, except that
Scylla has no direct equivalent of a region, so we use the datacenter's
name, which is analogous to DynamoDB's concept of region.

The field `eventVersion` determines the structure of a Record. It is
updated whenever the structure changes. We think that adding a field
`userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla
doesn't support this field (#11523), hence we use the older 1.0 version.

We have decided to leave `eventSource` as is, since it's easy to modify
it in case of problems to `aws:dynamodb` used by DynamoDB.

Not setting `SizeBytes` subfield inside the `dynamodb` field was
dictated by the lack of apparent use cases. The documentation is unclear
about how `SizeBytes` is calculated and after experimenting a little
bit, I haven't found an obvious pattern.

Fixes: #6931

Closes scylladb/scylladb#24903
2025-08-31 14:55:47 +03:00
Avi Kivity
bf9a963582 utils: mark crc barrett tables const
They're marked constinit, but constinit does not imply const. Since
they're not supposed to be modified, mark them const too.

Closes scylladb/scylladb#25539
2025-08-31 11:37:39 +03:00
Avi Kivity
bc5773f777 Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
      - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
      - applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations

The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.

Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e.  `--space-limited-dirs`.
When not provided, tests are skipped.

Test scenarios:

1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2

The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).

The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.

`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:

Read path (before)

```
instructions_per_op:
	mean=   39661.51 standard-deviation=34.53
	median= 39655.39 median-absolute-deviation=23.33
	maximum=39708.71 minimum=39622.61
```

Read path (after)

```
instructions_per_op:
	mean=   39691.68 standard-deviation=34.54
	median= 39683.14 median-absolute-deviation=11.94
	maximum=39749.32 minimum=39656.63
```

Write path (before):

```
instructions_per_op:
	mean=   50942.86 standard-deviation=97.69
	median= 50974.11 median-absolute-deviation=34.25
	maximum=51019.23 minimum=50771.60
```

Write path (after):

```
instructions_per_op:
	mean=   51000.15 standard-deviation=115.04
	median= 51043.93 median-absolute-deviation=52.19
	maximum=51065.81 minimum=50795.00
```

Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871

No backport, as it is a new feature.

Closes scylladb/scylladb#23917

* github.com:scylladb/scylladb:
  tests/cluster: Add new storage tests
  test/scylla_cluster: Override workdir when passed via cmdline
  streaming: Reject incoming migrations
  storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
  repair_service: Add a facility to disable the service
  compaction_manager: Subscribe to out of space controller
  compaction_manager: Replace enabled/disabled states with running state
  database: Add critical_disk_utilization mode database can be moved to
  disk_space_monitor: add subscription API for threshold-based disk space monitoring
  docs: Add feature documentation
  config: Add critical_disk_utilization_level option
  replica/exceptions: Add a new custom replica exception
2025-08-30 18:47:57 +03:00
Petr Gusev
898531fe7c client_state: decoroutinize check_internal_table_permissions
This function is on a hot path, better avoid allocating
coroutine frames.

Fixes scylladb/scylladb#25501

Closes scylladb/scylladb#25689
2025-08-30 18:46:54 +03:00