Commit Graph

49863 Commits

Author SHA1 Message Date
Pavel Emelyanov
42657105a3 api: Coroutinize get_built_indexes handler code
"While at it". It looks much simpler this way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-03 13:50:58 +03:00
Pavel Emelyanov
f77f9db96c api: Remove system_keyspace ref from column_family API block
This reference was only needed to facilitate get_built_indexes handler
to work. Now it's gone and the sys.ks. reference is no longer needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-03 13:50:22 +03:00
Pavel Emelyanov
95b616d0e5 api: Move get_built_indexes from column_family to view_builder
The handler effectively works with the view_builder and should be
registerd in the block that has this service captured.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-03 13:49:33 +03:00
Botond Dénes
9310d3eff3 scylla-gdb.py: small-objects: don't cast free_object to void*
When walking the free-list of a pool or a span, the small-object code
casts the dereferenced `free_object*` to `void*`. This is unnecessary,
just use the `next` field of the `free_object` to look up the next free
object. I think this monkey business with `void*` was done to speed up
walking the free-list, but recently we've seen small-object --summarize
fail in CI, and it could be related.

Fixes: #25733

Closes scylladb/scylladb#26339
2025-10-02 13:36:49 +03:00
Botond Dénes
9d08a380db Merge 'Fix getendpoints command for compound keys containing ':'' from Taras Veretilnyk
Before, the `nodetool getendpoints` expected the key as one string separated by : (for example 1:val:ue). This caused errors if any part of the key had a colon because it was unclear whether a colon was a separator or part of the key.

This change adds a new API endpoint, `/storage_service/natural_endpoints/v2/{keyspace}`, which accepts composite partition keys as multiple key_component query parameters (e.g., ?key_component=1&key_component=val:ue). The `nodetool getendpoints` command was updated to support a new `--key-components` option, allowing users to pass key components as an array. The client and test infrastructure were extended to support multiple values for a query parameter, and tests were added to verify correct behavior with composite keys.

The previous method of passing partition keys as colon-separated strings is preserved for backward compatibility.

Backport is not required, since this change relies on recent Seastar updates

Fixes #16596

Closes scylladb/scylladb#26169

* github.com:scylladb/scylladb:
  docs: document --key-components option for getendpoints
  test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints
  nodetool: Introduce new option --key-components to specify compound partition keys as array
  rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components
  api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'
  rest_api_mock: support duplicate query parameters
  test/rest_api: support multiple query values per key in RestApiSession.send()
  nodetool: add support of new seastar query_parameters_type to scylla_rest_client
2025-10-02 09:04:40 +03:00
Aleksandra Martyniuk
0e73ce202e test: wait for cql in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.

Fixes: #26328

Closes scylladb/scylladb#26349
2025-10-02 06:41:36 +03:00
Avi Kivity
7230a04799 dht, sstables: replace vector with chunked_vector when computing sstable shards
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).

Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.

Fixes #24198.

Closes scylladb/scylladb#26353
2025-10-02 00:47:42 +02:00
Michał Jadwiszczak
d92628e3bd test/cluster/test_view_building_coordinator: skip reproducer instead of xfail
The reproducer for issue scylladb/scylladb#26244 takes some time
and since the test is failing, there is no point in wasting resources on
it.
We can change the xfail mark to skip.

Refs scylladb/scylladb#26244

Closes scylladb/scylladb#26350
2025-10-01 18:33:05 +02:00
Taras Veretilnyk
6d8224b726 docs: document --key-components option for getendpoints 2025-10-01 15:53:25 +02:00
Taras Veretilnyk
6381c63d65 test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints
Adds a parameterized test to verify that multiple --key-components arguments
are handled correctly by nodetool's getendpoints command. Ensures the
constructed REST request includes all key_component values in the expected format.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
78888dd76c nodetool: Introduce new option --key-components to specify compound partition keys as array
Allows getendpoints to accept components of partition key using the --key-components option.
Key components are passed as an array and sent to the new /natural_endpoints/v2/{keyspace} endpoint.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
2456ebd7c2 rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components
Adds a test case for the `/storage_service/natural_endpoints/v2/{keyspace}` endpoint,
verifying that it correctly resolves natural endpoints for a composite partition key
passed as multiple `key_component` query parameters.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
89d474ba59 api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'
The original `/storage_service/natural_endpoints` endpoint uses colon-separated strings for composite keys,
which causes ambiguity when key components contained colons.

This commits adds a new `/storage_service/natural_endpoints/v2/{keyspace}` endpoint that accepts partition key components
via repeated `key_component` query parameters to avoid this issue.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
65ade28a9c rest_api_mock: support duplicate query parameters
Previously, only the last value of a repeated query parameter was captured,
which could cause inaccurate request matching in tests. This update ensures
that all values are preserved by storing duplicates as lists in the `params` dict.
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
b60afeaa46 test/rest_api: support multiple query values per key in RestApiSession.send()
Previously, the send() method in RestApiSession only supported one value per query parameter key.
This patch updates it to support passing lists of values, allowing the same key to appear multiple
times in the query string (e.g. ?key=value1&key=value2).
2025-10-01 15:53:25 +02:00
Taras Veretilnyk
53883958d6 nodetool: add support of new seastar query_parameters_type to scylla_rest_client 2025-10-01 15:52:18 +02:00
Avi Kivity
15fa1c1c7e Merge 'sstables/trie: translate all key cells in one go, not lazily' from Michał Chojnowski
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The possible benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.

In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
and lazy_comparable_bytes_from_ring_position
so that they performs the key translation eagerly,
all components to a single bytes_ostream in one synchronous call.

perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type):
```
Before:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch        9233   109.930us     0.000ns   109.930us   109.930us    4356.000       0.000   2615394.3    614709.6

After:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch       50952    19.487us     0.000ns    19.487us    19.487us     198.000       0.000    603120.1    109042.9
```

Enhancement, backport not required.

Closes scylladb/scylladb#26302

* github.com:scylladb/scylladb:
  sstables/trie: BTI-translate the entire partition key at once
  sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
  sstables/trie: perform the BTI-encoding of position_in_partition eagerly
  types/comparable_bytes: add comparable_bytes_from_compound
  test/perf: add perf_bti_key_translation
2025-10-01 14:59:06 +03:00
Botond Dénes
bdca5600ef Merge 'Prevent stalls due to large tablet mutations' from Benny Halevy
Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet.
With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings.

This series takes a different approach than allowing freeze to yield.
`tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation.  Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation.

In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size.

Closes scylladb/scylladb#18162

* github.com:scylladb/scylladb:
  schema_tables: convert_schema_to_mutations: simplify check for system keyspace
  tablets: read_tablet_mutations: use unfreeze_and_split_gently
  storage_service: merge_topology_snapshot: freeze snp.mutations gently
  mutation: async_utils: add unfreeze_and_split_gently
  mutation: add for_each_split_mutation
  tablets: tablet_map_to_mutations: maybe split tablets mutation
  tablets: tablet_map_to_mutations: accept process_func
  perf-tablets: change default tables and tablets-per-table
  perf-tablets: abort on unhandled exception
2025-10-01 07:04:09 +03:00
Ernest Zaslavsky
043d2dfb30 treewide: seastar module update
Seastar module update
```
9c07020a Merge 'http: Introduce retry strategy machinery for http client (take two)' from Ernest Zaslavsky
58404b81 http: check for abort at start of `make_request`
35a9e086 http: support per-call `retry_strategy` in `make_request`
96538b92 http: integrate `retry_strategy` into HTTP client
77c3ba14 http: initial implementation of `retry_strategy`
b9b9e7bf memory: Call finish_allocation() at the end of allocate_aligned()
2052c200 Merge 'file: coroutinize some functions' from Avi Kivity
7b65e50c file: reindent after coroutinization
837f64b5 file: coroutinize dma_read_impl()
9220607b file: coroutinize dma_read_exactly_impl()
d1414541 file: coroutinize set_lifetime_hint_impl()
94d8fd08 file: coroutinize get_lifetime_hint_impl()
392efff4 file: coroutinize maybe_read_eof()
e68a3173 file: do_dma_read_bulk: remove "rstate" local
14ac42cd file: coroutinize do_dma_read_bulk()
5446cbab net: Use future::then_unpack() helper to unpack tuples
9e88c4d8 posix-stack: Initialize unique_ptr-s with new result directly
51fb302e rpc: connection::process() use structured binding
be2c2b54 http: Explicitly deprecate request::content
```

Closes scylladb/scylladb#26342
2025-10-01 06:44:31 +03:00
Jenkins Promoter
f8c02a420d Update pgo profiles - aarch64 2025-10-01 05:32:35 +03:00
Jenkins Promoter
b45a57f65e Update pgo profiles - x86_64 2025-10-01 04:54:14 +03:00
Luis Freitas
884c584faf Update ScyllaDB version to: 2026.1.0-dev 2025-09-30 18:54:09 +03:00
Benny Halevy
1ceb49f6c1 schema_tables: convert_schema_to_mutations: simplify check for system keyspace
Currently, the function unfreezes each schema mutation partition
and then checks if it's for a system keyspace.
This isn't really needed since we can check the partition key
using the frozen_mutation, skip it if the partition is for a system keyspace.

Note that the constructed partition_key just copies
the frozen partition_key_view, without copying or deserializing the
actual key contents.

Also, reserve `results` capacity using the queried
partitions' size to prevent reallocations of the results
vector.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
b17a36c071 tablets: read_tablet_mutations: use unfreeze_and_split_gently
Split the tablets mutations by number of rows, based on
`min_tablets_in_mutation` (currently calibrated to 1024),
similar to the splitting done in
`storage_service::merge_topology_snapshot`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
25b68fd211 storage_service: merge_topology_snapshot: freeze snp.mutations gently
We don't need to store all snp.mutations in a vector
and then freeze the whole vector.  They can be frozen
one at a time and collected into a vector, while
maybe yielding between each mutation to prevent
stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
fd38cfaf69 mutation: async_utils: add unfreeze_and_split_gently
Unfreeze the frozen_mutation, possibly splitting it
based on max_rows.  The process_mutation function
is called for each split mutation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
faa0ee9844 mutation: add for_each_split_mutation
Allows processing of the split mutations one at a time.
This can reduce memory footprint as the caller
won't have to store a vector of the split mutations
and then convert it (e.g. freeze the mutations
or convert them to canonical mutations).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
d21984d0cc tablets: tablet_map_to_mutations: maybe split tablets mutation
Split the generated tablets mutation if we run out of task
quota to prevent stalls, both when preparing the mutations
and later on when freezing/unfreezing them or converting
them to canonical_mutation and back.

Note that this will convert large mutation to long
vectors of mutations.  A followup change is considered
to convert std::vector:s of mutations to chunked_vector
to prevent large allocations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
aaddff5211 tablets: tablet_map_to_mutations: accept process_func
Prepare for generating several mutations for the
tablet_map by calling process_func for each generated mutation.

This allows the caller to directly freeze those mutations
one at a time into a vector of frozen mutations or simililarly
convert them into canonical mutations.

Next patch will split large tablet mutations to prevent stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:38 +03:00
Avi Kivity
5d1846d783 dist: scylla_raid_setup: don't override XFS block size on modern kernels
In 6977064693 ("dist: scylla_raid_setup:
reduce xfs block size to 1k"), we reduced the XFS block size to 1k when
possible. This is because commitlog wants to write the smallest amount
of padding it can, and older Linux could only write a multiple of the
block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than
a filesystem block.

However, this doesn't play well with some SSDs that have 512 byte
logical sector size and 4096 byte physical sector size - it causes them
to issue read-modify-writes.

To improve the situation, if we detect that the kernel is recent enough,
format the filesystem with its default block size, which should be optimal.

Note that commitlog will still issue sub-4k writes, which can translate
to RMW. There, we believe that the amplification is reduced since
sequential sub-physical-sector writes can be merged, and that the overhead
from commitlog space amplification is worse than the RMW overhead.

Tested on AWS i4i.large. fsqual report:

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   4096
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

The sub-block overwrite cases are GOOD.

In comparison, the fsqual report for 1k (similar):

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   1024
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

Fixes #25441.

[1] ed1128c2d0

Closes scylladb/scylladb#25445
2025-09-30 17:14:36 +03:00
Benny Halevy
3c07e0e877 perf-tablets: change default tables and tablets-per-table
tablets-per-table must be a power of 2, so round up 10000 to 16K.
also, reduce number of tables to have a total of about 100K
tablets, otherwise we hit the maximum commitlog mutation size
limit in save_tablet_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:07:06 +03:00
Benny Halevy
2c3fb341e9 perf-tablets: abort on unhandled exception
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:07:06 +03:00
Nadav Har'El
926089746b message: move RPC compression from utils/ to message/
The directory utils/ is supposed to contain general-purpose utility
classes and functions, which are either already used across the project,
or are designed to be used across the project.

This patch moves 8 files out of utils/:

    utils/advanced_rpc_compressor.hh
    utils/advanced_rpc_compressor.cc
    utils/advanced_rpc_compressor_protocol.hh
    utils/stream_compressor.hh
    utils/stream_compressor.cc
    utils/dict_trainer.cc
    utils/dict_trainer.hh
    utils/shared_dict.hh

These 8 files together implement the compression feature of RPC.
None of them are used by any other Scylla component (e.g., sstables have
a different compression), or are ready to be used by another component,
so this patch moves all of them into message/, where RPC is implemented.

Theoretically, we may want in the future to use this cluster of classes
for some other component, but even then, we shouldn't just have these
files individually in utils/ - these are not useful stand-alone
utilities. One cannot use "shared_dict.hh" assuming it is some sort of
general-purpose shared hash table or something - it is completely
specific to compression and zstd, and specifically to its use in those
other classes.

Beyond moving these 8 files, this patch also contains changes to:
1. Fix includes to the 5 moved header files (.hh).
2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt
   for the three moved source files (.cc).
3. In the moved files, change from the "utils::" namespace, to the
   "netw::" namespace used by RPC. Also needed to change a bunch
   of callers for the new namespace. Also, had to add "utils::"
   explicitly in several places which previously assumed the
   current namespace is "utils::".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25149
2025-09-30 17:03:09 +03:00
Pavel Emelyanov
269aaee1b4 Merge 'test: dtest: test_limits.py: migrate from dtest' from Dario Mirovic
This PR migrates limits tests from dtest to this repository.

One reason is that there is an ongoing effort to migrate tests from dtest to here.

Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.

Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.

scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)

Fixes #25097

This is a migration of existing tests to this repository. No need for backport.

Closes scylladb/scylladb#26077

* github.com:scylladb/scylladb:
  test: dtest: limits_test.py: test_max_cells log level
  test: dtest: limits_test.py: make the tests work
  test: dtest: test_limits.py: remove test that are not being migrated
  test: dtest: copy unmodified limits_test.py
2025-09-30 16:57:32 +03:00
Piotr Dulikowski
5e5a3c7ec5 view_building_worker.cc: fix spelling (commiting -> committing)
The typo is reported by GitHub action on each PR, so let's fix it to
reduce the noise for everybody.

Closes scylladb/scylladb#26329
2025-09-30 16:47:03 +03:00
Emil Maskovsky
b0de054439 docs: fix typos and spelling errors
Corrected spelling mistakes, typos, and minor wording issues to improve
the developer documentation.

No backport: There is no functional change, and the doc is mostly
relevant to master, so it doesn't need to be backported.

Closes scylladb/scylladb#26332
2025-09-30 13:16:49 +02:00
Avi Kivity
72609b5f69 Merge 'mv: generate view updates on pending replica' from Michael Litvak
Generate view updates from a pending base replica if it's a reading
replica, i.e. it's in the last stage of transition write_both_read_new
before becoming the new base replica.

Previously we didn't generate view updates on a pending replica. The
problem with that is that when a base token is migrated from one replica
B1 to another B2, at one stage we generate view updates only from B1,
then at the next stage we generate view updates only from B2. During
this transition, it can happen that for some write neither B1 nor B2
generate view update, because each one sees the other as the base
replica.

We fix this by generating view updates from both base replicas in the
phase before the transition. We can generate view updates on the pending
replica in this case, even if it requires read-before-write, because
it's in a stage where it contains all data and serves reads.

Fixes https://github.com/scylladb/scylladb/issues/24292

backport not needed - the issue mostly affects MV with tablets which is still experimental

Closes scylladb/scylladb#25904

* github.com:scylladb/scylladb:
  test: mv: test view update during topology operations
  mv: generate view updates on both shards in intranode migration
  mv: generate view updates on pending replica
2025-09-30 13:17:16 +03:00
Piotr Wieczorek
4be0bdbc07 alternator: Don't emit a redundant REMOVE event in Alternator Streams for PutItem calls
Until now, every PutItem operation appeared in the Alternator Streams as
two events - a REMOVE and a MODIFY. DynamoDB Streams emits only INSERT
or MODIFY, depending on whether a row was replaced, or created anew. A
related issue scylladb#6918 concerns distinguishing the mutation type properly.

This was because each call to PutItem emitted the two CDC rows, returned
by GetRecords. Since this patch, we use a collection tombstone for the
`:attrs` column, and a separate tombstone for each regular column in the
table's schema. We don't expect that new tables would have any other
regular column, except for the `:attrs` and keys, but we may encounter
them in in upgraded tables which had old GSIs or LSIs.

Fixes: scylladb#6930.

Closes scylladb/scylladb#24991
2025-09-30 13:12:16 +03:00
Michał Hudobski
ae4d4908ba configure: increase SCHEDULING_GROUPS_COUNT to 20
We would like to have an additional service level
available for users of the Vector Store service,
which would allow us to de/prioritize vector
operations as needed. To allow that, we increase
the number of scheduling groups from 19 to 20
and adjust the related test accordingly.

Closes scylladb/scylladb#26316
2025-09-30 12:41:28 +03:00
Nadav Har'El
38002718a9 cqlpy: improve testing for "duration" column type
We had very rudimentary tests for the "duration" CQL type in the cqlpy
framework - just for reproducing issue #8001. But we left two
alternative formats, and a lot of corner cases, untested.
So this patch aims to add the missing tests - to exhaustively cover
the "duration" literal formats and their capabilities.

Some of the examples tested in the new test are inspired by Cassandra's
unit test test/unit/org/apache/cassandra/cql3/DurationTest.java and the
corner cases that this file covers. However, the new tests are not direct
translation of that file because DurationTest.java was not a CQL test -
it was a unit test of Cassandra's internal "Duration" type, so could not
be directly translated into a CQL-based test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25092
2025-09-30 12:25:02 +03:00
Botond Dénes
efd99bb0af Merge 'Return tablet ranges from range_to_endpoint_map API' from Pavel Emelyanov
The handler in question when called for tablets-enabled keyspace, returns ranges that are inconsistent with those from system.tablets. Like this:

system.tablets:
```
    TabletReplicas(last_token=-4611686018427387905, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 0)])
    TabletReplicas(last_token=-1, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)])
    TabletReplicas(last_token=4611686018427387903, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 1), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)])
    TabletReplicas(last_token=9223372036854775807, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 1), ('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0)])
```

range_to_endpoint_map:
```
    {'key': ['-9069053676502949657', '-8925522303269734226'], 'value': ['127.110.40.2', '127.110.40.3']}
    {'key': ['-8925522303269734226', '-8868737574445419305'], 'value': ['127.110.40.2', '127.110.40.3']}
    ...
    {'key': ['-337928553869203886', '-288500562444694340'], 'value': ['127.110.40.1', '127.110.40.3']}
    {'key': ['-288500562444694340', '105026475358661740'], 'value': ['127.110.40.1', '127.110.40.3']}
    {'key': ['105026475358661740', '611365860935890281'], 'value': ['127.110.40.1', '127.110.40.3']}
    ...
    {'key': ['8307064440200319556', '9117218379311179096'], 'value': ['127.110.40.2', '127.110.40.1']}
    {'key': ['9117218379311179096', '9125431458286674075'], 'value': ['127.110.40.2', '127.110.40.1']}
```

Not only the number of ranges differs, but also separating tokens do not match (e.g. tokens -2 and 0 belong to different tablets according to system.tablets, but fall into the same "range" in the API result).

The source of confusion is that despite storage_service::get_range_to_address_map() is given correct e.r.m. pointer from the table, it still uses token_metadata::sorted_token() to work with. The fix is -- when the e.r.m. is per-table, the tokens should be get from token_metadata's tablet_map (e.g. compare this to storage_service::effective_ownership() -- it grabs tokens differently for vnodes/tables cases).

This PR fixes the mentioned problem and adds validation test. The test also checks /storage_service/describe_ring endpoint that happens to return correct set of values.

The API is very ancient, so the bug is present in all versions with tablets

Fixes #26331

Closes scylladb/scylladb#26231

* github.com:scylladb/scylladb:
  test: Add validation of data returned by /storage_service endpoints
  test,lib: Add range_to_endpoint_map() method to rest client
  api: Indentation fix after previous patches
  storage_service: Get tablet tokens if e.r.m. is per-table
  storage_service,api: Get e.r.m. inside get_range_to_address_map()
  storage_service: Calculate tokens on stack
2025-09-30 11:20:35 +03:00
Michał Jadwiszczak
3bbbbf419b test/cluster/test_view_building_coordinator: add reproducer for staging sstables with tablet merge
The test verifies if staging sstables are processed correctly after
tablet merge.

Refs scylladb/scylladb#26244

Closes scylladb/scylladb#26286
2025-09-30 09:05:31 +02:00
Avi Kivity
4d9271df98 Merge 'sstables: introduce sstable version ms' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.

This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.

(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).

The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.

This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.

Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.

`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                70       18.0     18        128
500001  1       1               647       19.0     19        132
0       1000000 1               748       15.0     15        116
0       500000  2               372       29.0     29        284
0       250000  4               227       56.0     56        504
0       125000  8               116      106.0    106        928
0       62500   16               67      195.0    195       1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                51        5.1      5         20
500001  1       1                64        5.3      5         20
0       1000000 1               679        4.0      4         16
0       500000  2               492        8.0      8         88
0       250000  4               804       16.0     16        232
0       125000  8               409       31.0     31        516
0       62500   16               97       54.0     54       1056
```

Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.

External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.

New functionality, no backport needed.

Closes scylladb/scylladb#26215

* github.com:scylladb/scylladb:
  test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
  test/cluster: add test_bti_index.py
  test: prepare bypass_cache_test.py for `ms` sstables
  sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
  test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
  tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
  db/config: expose "ms" format to the users via database config
  test: in Python tests, prepare some sstable filename regexes for `ms`
  sstables: add `ms` to `all_sstable_versions`
  test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
  test/lib/index_reader_assertions: skip some row index checks for BTI indexes
  test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
  test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
  test/resource: add `ms` sample sstable files for relevant tests
  test/boost/sstable_compaction_test: prepare for `ms` sstables.
  test/boost/index_reader_test: prepare for `ms` sstables
  test/boost/bloom_filter_tests: prepare for `ms` sstables
  test/boost/sstable_datafile_test: prepare for `ms` sstables
  test/boost/sstable_test: prepare for `ms` sstables.
  sstables: introduce `ms` sstable format version
  tools/scylla-sstable: default to "preferred" sstable version, not "highest"
  sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
  sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
  sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
  sstables/mx: make Index and Summary components optional
  sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
  sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
  sstables: implement an alternative way to rebuild bloom filters for sstables without Index
  utils/bloom_filter: add `add(const hashed_key&)`
  sstables: adapt estimated_keys_for_range to sstables without Summary
  sstables: make `sstable::estimated_keys_for_range` asynchronous
  sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
  replica/database: add table::estimated_partitions_in_range()
  sstables/mx: implement sstable::has_partition_key using a regular read
  sstables: use BTI index for queries, when present and enabled
  sstables/mx/writer: populate BTI index files
  sstables: create and open BTI index files, when enabled
  sstables: introduce Partition and Rows component types
  sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`
2025-09-30 09:40:02 +03:00
Piotr Dulikowski
4581c72430 Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev
`SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables.

We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now.

Fixes https://github.com/scylladb/scylladb/issues/26258

backports: not needed (a new feature)

Closes scylladb/scylladb#26284

* github.com:scylladb/scylladb:
  cql_test_env.cc: log exception when callback throws
  lwt: prohibit for tablet-based views and cdc logs
  tablets: disallow chains of colocated tables
  database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda
2025-09-30 07:15:16 +02:00
Michał Chojnowski
771a82969e test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
Adds a test for the bloom filter rebuild mechanism in `ms` sstables.
2025-09-29 22:15:26 +02:00
Michał Chojnowski
c1e6cd58fa test/cluster: add test_bti_index.py
Add a test which checks that `ms` sstables can be enabled and disabled.
2025-09-29 22:15:25 +02:00
Michał Chojnowski
7a2dc9cbfa test: prepare bypass_cache_test.py for ms sstables
The test looks at metrics to confirm whether queries
hit the row cache, the index cache or the disk,
depending on various settings.

BIG index readers use a two level, read-through index cache,
where the higher layer stores parsed "index pages"
of Index.db, while the lower layer is a cache of
raw 4kiB file pages of Index.db.

Therefore, if we want to count index cache hits,
the appropriate metric to check in this case is
`scylla_sstables_index_page_hits",
which counts hits in the higher layer.
This is done by the test.

However, BTI index readers don't have an equivalent of the higher cache
layer. Their cache only stores the raw 4 kiB pages, and the hits are
counted in `scylla_sstables_index_page_cache_hits`. (The same metric
is incremented by the lower layer of the BIG index cache).

Before this commit, the test would fail with `ms` sstables,
because their reads don't increment `scylla_sstables_index_page_hits`.
In this commit we adapt the test so that it instead checks
`scylla_sstables_index_page_cache_hits` for `ms` sstables.
2025-09-29 22:15:25 +02:00
Michał Chojnowski
621bfbe6d9 sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
test_column_family.py::test_sstables_by_key_reader_closed
injects a failure into `index_reader::advance_lower_and_check_if_present`.
To preserve this tests when BTI indexes are made the default,
we have to add a corresponding error injection to
`bti_index_reader::advance_lower_and_check_if_present`.
2025-09-29 22:15:25 +02:00
Michał Chojnowski
182c8ce87b test/cqlpy/test_sstable_validation.py: prepare the test for ms sstables
BIG sstables and BTI sstables use different code paths for
validating the Data file against the index. So we want to test
both types of indexes, not just the default one.

This patch changes the test so that it explicitly tests both
`me` and `ms` instead of only testing the default format.

Note that we disable some tests for BTI indexes:
the tests which check that validation detects mismatches
between the row index ("promoted index") and the Data file.
This is because currently iteration over the row
index in BTI isn't implemented at the moment,
so for BTI the validation behaves as if there was no row indexes.
2025-09-29 22:15:25 +02:00
Michał Chojnowski
aed1cb6f65 tools/scylla-sstable: add --sstable-version=? to scylla sstable write
A useful option in general, and I'll need it to test multiple versions
in `test_sstable_validation.py`.
2025-09-29 22:15:25 +02:00