Commit Graph

4179 Commits

Author SHA1 Message Date
Pavel Emelyanov
86b3e9b50b code: Move checked-file-impl.hh to util/
fixes: #22100

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23123
2025-03-06 10:22:05 +02:00
Pavel Emelyanov
e7d1ea3ab6 commitlog: Use shorter input stream creation overload
There's one that doesn't need the offset argument when it's 0

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23140
2025-03-06 08:06:42 +01:00
Botond Dénes
71d8b7aa9f querier: demote tombstone warning for range-scans to debug level
Range scans are expected to go though lots of tombstones, no need to
spam the logs about this. The tombstone warning log is demoted to debug
level, if somebody wants to see it they can bump the logger to debug
level.

Fixes: https://github.com/scylladb/scylladb/issues/23093

Closes scylladb/scylladb#23094
2025-03-04 10:38:06 +03:00
Botond Dénes
6f7a069bce Merge 'Label basic metrics' from Amnon Heiman
This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar)

The idea in this series is to create an equivalent of levels with a label.
First, label a subset of the metrics used by the dashboards.
Second, the per-table metrics that are now off by default will be marked with a different label.
The following specific optional features: CDC, CAS, and Alternator have a dedicated label now.
This will allow users to disable all metrics of features that are not in use.

All the rest of the metrics are left unlabeled.

Without any changes, users would get the same metrics they are getting today.
But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies).

The labels are not reported based on the seastar feature of hiding labels that start with an underscore.

Closes scylladb/scylladb#12246

* github.com:scylladb/scylladb:
  db/view/view.cc: label metrics with basic_level
  transport/server.cc: label metrics with basic_level
  service/storage_proxy.cc: label metrics with basic_level and cas
  main.cc: label metrics with basic_level
  streaming/stream_manager.cc: label metrics with basic_level
  repair/repair.cc: label metrics with basic_level
  service/storage_service.cc: label metrics with basic_level
  gms/gossiper.cc: label metrics with basic_level
  replica/database.cc: label metrics with basic_level
  cdc/log.cc: label metrics with basic_level and cdc
  alternator: label metrics with basic_level and alternator
  row_cache.cc: label metrics with basic_level
  query_processor.cc: label metrics with basic_level
  sstables.cc: label metrics with basic_level
  utils/logalloc.cc label metrics with basic_level
  commitlog.cc: label metrics with basic_level
  compaction_manager.cc: label metrics with basic_level
  Adding the __level and features labels
2025-03-04 09:32:11 +02:00
Calle Wilund
2f10205714 config: Enable optional TLS1.3 session ticket usage in cert setup
Refs #22916

Adds an "enable_session_tickets" option to TLS setup for our server
endpoints (not documented for internode RPC, as we don't handle it
on the client side there), allowing enabling of TLS3 client session
ticket, i.e. quicker reconnect.

Session tickets are valid within a time frame or until a node
restarts, whichever comes first.

v2:
Use "TLS1.3" in help message

Closes scylladb/scylladb#22928
2025-03-04 09:30:53 +02:00
Amnon Heiman
19a414598b db/view/view.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_view_builder_builds_in_progress

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
f40dc4e5c4 row_cache.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_cache_bytes_total
scylla_cache_bytes_used
scylla_cache_partition_evictions
scylla_cache_partition_hits
scylla_cache_partition_insertions
scylla_cache_partition_merges
scylla_cache_partition_misses
scylla_cache_partition_removals
scylla_cache_range_tombstone_reads
scylla_cache_reads
scylla_cache_reads_with_misses
scylla_cache_row_evictions
scylla_cache_row_hits
scylla_cache_row_insertions
scylla_cache_row_misses
scylla_cache_row_removals
scylla_cache_rows
scylla_cache_rows_merged_from_memtable
scylla_cache_row_tombstone_reads

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:38 +02:00
Amnon Heiman
6826b98c88 commitlog.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_commitlog_segments
scylla_commitlog_allocating_segments
scylla_commitlog_unused_segments
scylla_commitlog_alloc
scylla_commitlog_flush
scylla_commitlog_bytes_written
scylla_commitlog_pending_allocations
scylla_commitlog_requests_blocked_memory
scylla_commitlog_flush_limit_exceeded
scylla_commitlog_disk_total_bytes
scylla_commitlog_disk_active_bytes
scylla_commitlog_disk_slack_end_bytes
2025-03-03 16:58:38 +02:00
Kefu Chai
6e4cb20a69 tree: implement boost::accumulate with std::ranges library
Replace boost::accumulate() calls with std::ranges facilities. This
change reduces external dependencies and modernizes the codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23062
2025-02-26 23:22:02 +02:00
Botond Dénes
5d63ef4d15 Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund
Fixes #22314

Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.

Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.

Closes scylladb/scylladb#22327

* github.com:scylladb/scylladb:
  tools: Add standard extensions and propagate to schema load
  cql_test_env: Use add all extensions instead of inidividually
  main: Move extensions adding to function
  tomstone_gc: Make validate work for tools
2025-02-26 13:52:47 +02:00
Andrzej Jackowski
b4f0a5149a db: cql3: add comments regarding unsafe interval<clustering_key_prefix>
class clustering_range is a range of Clustering Key Prefixes implemented
as interval<clustering_key_prefix>. However, due to the nature of
Clustering Key Prefix, the ordering of clustering_range is complex and
does not satisfy the invariant of interval<>. To be more specific, as a
comment in interval<> implementation states: “The end bound can never be
smaller than the start bound”. As a range of CKP violates the invariant,
some algorithms, like intersection(), can return incorrect results.
For more details refer to scylladb#8157, scylladb#21604, scylladb#22817.

This commit:
 - Add a WARNING comment to discourage usage of clustering_range
 - Add WARNING comments to potentially incorrect uses of
   interval<clustering_key_prefix> non-trivial methods
 - Add a FIXME comment to incorrect use of
   interval<clustering_key_prefix_view>::deoverlap and WARNING comments
   to related interval<clustering_key_prefix_view> misuse.

Closes scylladb/scylladb#22913
2025-02-26 12:01:28 +01:00
Kefu Chai
9fdbe0e74b tree: Remove unused boost headers
This commit eliminates unused boost header includes from the tree.

Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22997
2025-02-25 10:32:32 +03:00
Kefu Chai
42335baec5 backup_task: Use INFO level for upload abort during shutdown
When a backup upload is aborted due to instance shutdown, change the log
level from ERROR to INFO since this is expected behavior. Previously,
 `abort_requested_exception` during upload would trigger an ERROR log, causing
test failures since error logs indicate unexpected issues.

This change:
- Catches `abort_requested_exception` specifically during file uploads
- Logs these shutdown-triggered aborts at INFO level instead of ERROR
- Aligns with how `abort_requested_exception` is handled elsewhere in the service

This prevents false test failures while still informing administrators
about aborted uploads during shutdown.

Fixes scylladb/scylladb#22391

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22995
2025-02-25 10:32:10 +03:00
Avi Kivity
d99df7af6c Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec
This series achieves two things:

1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards

    This will result in new tables having at least 10 tablet replicas per
    shard by default.

    We want this to reduce tablet load imbalance due to differences in
    tablet count per shard, where some shards have 1 tablet and some
    shards have 2 tablets. With higher tablet count per shard, this
    difference-by-one is less relevant.

    Fixes https://github.com/scylladb/scylladb/issues/21967

2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count

    The per-shard goal is enforced by controlling average per-shard tablet replica
    count in a given DC, which is controlled by per-table tablet
    count. This is effective in respecting the limit on individual shards
    as long as tablet replicas are distributed evenly between shards.
    There is no attempt to move tablets around in order to enforce limits
    on individual shards in case of imbalance between shards.

    If the average per-shard tablet count exceeds the limit, all tables
    which contribute to it (have replicas in the DC) are scaled down
    by the same factor. Due to rounding up to the nearest power of 2,
    we may overshoot the per-shard goal by at most a factor of 2.

    The scaling is applied after computing desired tablet count due to
    all other factors: per-table tablet count hints, defaults, average tablet size.

    If different DCs want different scale factors of a given table, the
    lowest scale factor is chosen for a given table.

    When creating a new table, its tablet count is determined by tablet
    scheduler using the scheduler logic, as if the table was already created.
    So any scaling due to per-shard tablet count goal is reflected immediately
    when creating a table. It may however still take some time for the system
    to shrink existing tables. We don't reject requests to create new tables.

    Fixes #21458

Closes scylladb/scylladb#22522

* github.com:scylladb/scylladb:
  config, tablets: Allow tablets_initial_scale_factor to be a fraction
  test: tablets_test: Test scaling when creating lots of tables
  test: tablets_test: Test tablet count changes on per-table option and config changes
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
  config: Make tablets_initial_scale_factor live-updateable
  tablets: load_balancer: Pick initial_scale_factor from config
  tablets, load_balancer: Fix and improve logging of resize decisions
  tablets, load_balancer: Log reason for target tablet count
  tablets: load_balancer: Move hints processing to tablet scheduler
  tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
  tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
  tablets: load_balancer: Determine desired count from size separately from count from options
  tablets: load_balancer: Determine resize decision from target tablet count
  tablets: load_balancer: Allow splits even if table stats not available
  tablets: load_balancer: Extract make_sizing_plan()
  tablets: Add formatter for resize_decision::way_type
  tablets: load_balancer: Simplify resize_urgency_cmp()
  tablets: load_balancer: Keep config items as instance members
  locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()
  tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
  tablets: Set default initial tablet count scale to 10
  tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
  tablets: load_balancer: Extract get_schema_and_rs()
  tablets: load_balancer: Drop test_mode
2025-02-24 17:59:26 +02:00
Patryk Jędrzejczak
78c227c521 Merge 'raft topology: Add support for raft topology init to happen before group0 initialization' from Abhinav Kumar Jha
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.

This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
manner.

Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also
performs a refactoring to populate all init data to system tables even
before group0 creation(via `raft_initialize_discovery_leader` function).
Now because `raft_initialize_discovery_leader` happens before the group
0 creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.

Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.

By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#21114

Closes scylladb/scylladb#22484

* https://github.com/scylladb/scylladb:
  storage_service: Remove the variable _manage_topology_change_kind_from_group0
  storage_service: fix indentation after the previous commit
  raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
  service/raft: Refactor mutation writing helper functions.
2025-02-20 14:42:39 +01:00
Tomasz Grabiec
1a7023c85a config, tablets: Allow tablets_initial_scale_factor to be a fraction
We may want fewer than 1 tablets per shard in large clusters.

The per-table option is a fraction, so for consistency, this should be
too.
2025-02-19 16:29:08 +01:00
Tomasz Grabiec
3d01ce3707 config: Make tablets_initial_scale_factor live-updateable 2025-02-19 16:29:08 +01:00
Tomasz Grabiec
f1bda8d4c1 tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.

There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.

If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.

If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.

The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
f043c83ba5 tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
Currently the scale is applied post rounding up of tablet count so
that tablet count per shard is at least 1. In order to be able to use
the scale to increase tablet count per shard, we need to apply it
prior to division by RF, otherwise we will overshoot per-shard tablet
replica count.

Example:

  4 nodes, -c1, rf=3, initial_tablets_scale=10

  Before: initial_tablet_count=20, tablet-per-shard=15
  After: initial_tablet_count=14, tablets-per-shard=10.5
2025-02-19 14:38:50 +01:00
Tomasz Grabiec
2463e524ed tablets: Set default initial tablet count scale to 10
This will result in new tables having at least 10 tablet replicas per
shard by default.

We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.

Fixes #21967

In some tests, we explicity set the initial scale to 1 as some of the
existing tests assume 1 compaction group per shard.

test.py uses a lower default. Having many tablets per shard slows down
certain topology operations like decommission/replace/removenode,
where the running time is proportional to tablet count, not data size,
because constant cost (latency) of migration dominates. This latency
is due to group0 operations and barriers. This is especially
pronounced in debug mode. Scheduler allows at most 2 migrations per
shard, so this latency becomes a determining factor for decommission
speed.

To avoid this problem in tests, we use lower default for tablet count per
shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we
could compensate by allowing more concurrency when migrating small
tablets, but there's no infrastructure for that yet.

I observed that with 10 tablets per shard, debug-mode
topology_custom.mv/test_mv_topology_change starts to time-out during
removenode (30 s).
2025-02-19 14:38:50 +01:00
Kefu Chai
aa8c27b872 db: prevent accidental copies of result_set_row by making it move-only
result_set_row is a heavyweight object containing multiple cell types:
regular columns, partition keys, and static values. To prevent expensive
accidental copies, delete the copy constructor and replace it with:

1. A move constructor for efficient vector reallocation
2. An explicit copy() method when copies are actually needed

This change reduces overhead in some non-hot paths by eliminating implicit
deep copies. Please note, previously, in `create_view_from_mutation()`,
we kept a copy of `result_set_row`, and then reused `table_rs` for
holding the mutation for `scylla_tables`. Because we don't copy
the `result_set_row` in this change, in order to avoid invalidating
the `row` after reusing `table_rs` in the outer scope, we define a
new `table_rs` shadowing the one in the out scope.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22741
2025-02-17 09:48:08 +02:00
Kefu Chai
7ff0d7ba98 tree: Remove unused boost headers
This commit eliminates unused boost header includes from the tree.

Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22857
2025-02-15 20:32:22 +02:00
Abhinav Jha
e491950c47 raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
In the current scenario, topology_change_kind variable, was been handled using
 _manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)

Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.

Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.

By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.

Fixes: scylladb/scylladb#21114
2025-02-14 16:56:17 +05:30
Botond Dénes
f808f84a45 db/config: improve description of repair_multishard_reader_enable_read_ahead
The current description has a typo and in general not informative enough
on when this option should be used.

Closes scylladb/scylladb#21758
2025-02-11 22:16:09 +02:00
Pavel Emelyanov
529ff3efa5 Merge 'Alternator: implement UpdateTable operation to add or delete GSI' from Nadav Har'El
In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is **not** the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes #11567.

We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes #11471.

This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. **It should be backported to 2025.1**, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1.

Closes scylladb/scylladb#21989

* github.com:scylladb/scylladb:
  alternator: fix view build on oversized GSI key attribute
  mv: clean up do_delete_old_entry
  test/alternator: unflake test for IndexStatus
  test/alternator: work around unrelated bug causing test flakiness
  docs/alternator: adding a GSI is no longer an unimplemented feature
  test/alternator: remove xfail from all tests for issue 11567
  alternator: overhaul implementation of GSIs and support UpdateTable
  mv: support regular_column_transformation key columns in view
  alternator: add new materialized-view computed column for item in map
  build: in cmake build, schema needs alternator
  build: build tests with Alternator
  alternator: add function serialized_value_if_type()
  mv: introduce regular_column_transformation, a new type of computed column
  alternator: add IndexStatus/Backfilling in DescribeTable
  alternator: add "LimitExceededException" error type
  docs/alternator: document two more unimplemented Alternator features
2025-02-11 10:02:01 +03:00
Avi Kivity
c212f5a296 db/config: forward-declare boost options_description_easy_init
Reduces large dependency pull from boost.

Closes scylladb/scylladb#22748
2025-02-10 15:08:11 +02:00
Kefu Chai
0185aa458b build: cmake: remove trailing comma in db/CMakeLists.txt source list
In c5668d99, a new source file row_cache.cc was added to the `db` target,
but with an extraneous trailing comma. In CMake's target_sources(),
source files should be space-separated - any comma is interpreted as
part of the filename, causing build failures like:
```
  CMake Error at db/CMakeLists.txt:2 (target_sources):
    Cannot find source file:
      row_cache.cc,
```
Fix the issue by removing the trailing comma.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22754
2025-02-09 17:28:47 +02:00
Avi Kivity
9712390336 Merge 'Add per-table tablet options in schema' from Benny Halevy
This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.

* New feature, no backport required

Closes scylladb/scylladb#22090

* github.com:scylladb/scylladb:
  tablets: resize_decision: get rid of initial_decision
  tablet_allocator: consider tablet options for resize decision
  tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
  network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
  network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
  network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
  cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
  cql3/create_keyspace_statement: add deprecation warning for initial tablets
  test: cqlpy: test_tablets: add tests for per-table tablet options
  schema: add per-table tablet options
  feature_service: add TABLET_OPTIONS cluster schema feature
2025-02-08 20:32:19 +02:00
Kefu Chai
a6f703414a db: switch from boost::adaptors::indirected to std::views
replace boost::adaptors::indirected using std::views::transform for
less header dependency.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22731
2025-02-08 17:36:46 +02:00
Avi Kivity
861fb58e14 Merge 'vector: add support for vector type' from Dawid Pawlik
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.

The patch contains:
- implementation of vector_type_impl class
- necessary functionalities similar to other data types
- support for serialization and deserialization of vectors
- support for Lua and JSON format
- valid CQL syntax for `vector<>` type
- `type_parser` support for vectors
- expression adjustments such as:
    - add `collection_constructor::style_type::vector`
    - rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector`
- vector type encoding (for drivers)
- unit tests
- cassandra compatibility tests
- necessary documentation

Co-authored-by: @janpiotrlakomy

Fixes https://github.com/scylladb/scylladb/issues/19455

Closes scylladb/scylladb#22488

* github.com:scylladb/scylladb:
  docs: add vector type documentation
  cassandra_tests: translate tests covering the vector type
  type_codec: add vector type encoding
  boost/expr_test: add vector expression tests
  expression: adjust collection constructor list style
  expression: add vector style type
  test/boost: add vector type cql_env boost tests
  test/boost: add vector type_parser tests
  type_parser: support vector type
  cql3: add vector type syntax
  types: implement vector_type_impl
2025-02-06 20:36:50 +02:00
Kefu Chai
5c7ad745fd db: do not include unused headers
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

also, took this opportunity to remove an unused namespace alias. and
add an include which is used actually. please note,
`std::ranges::pop_heap()` and friends are actually provided by
`<algorithm>` not `<ranges>`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22716
2025-02-06 13:38:19 +02:00
Nadav Har'El
cae8a7222e alternator: fix view build on oversized GSI key attribute
Before this patch, the regular_column_transformation constructor, which
we used in Alternator GSIs to generates a view key from a regular-column
cell, accepted a cell of any size. As a reviewer (Avi) noticed, very
long cells are possible, well beyond what Scylla allows for keys (64KB),
and because regular_column_transformation stores such values in a
contiguous "bytes" object it can cause stalls.

But allowing oversized attributes creates an even more accute problem:
While view building (backfilling in DynamoDB jargon), if we encounter
an oversized (>64KB) key, the view building step will fail and the
entire view building will hang forever.

This patch fixes both problems by adding to regular_column_transformation's
constructor the check that if the cell is 64KB or larger, an empty value
is returned for the key. This causes the backfilling to silently skip
this item, which is what we expect to happen (backfilling cannot do
anything to fix or reject the pre-existing items in the best table).

A test test_gsi_updatetable.py::test_gsi_backfill_oversized_key is
introduced to reproduce this problem and its fix. The test adds a 65KB
attribute to a base table, and then adds GSIs to this table with this
attribute as its partition key or its sort key. Before this patch, the
backfilling process for the new GSIs hangs, and never completes.
After this patch, the backfilling completes and as expected contains
other base-table items but not the item with the oversized attribute.
The new test also passes on DynamoDB.

However, while implementing this fix I realized that issue #10347 also
exists for GSIs. Issue #10347 is about the fact that DynamoDB limits
partition key and sort key attributes to 2048 and 1024 bytes,
respectively. In the fix described above we only handled the accute case
of lengths above 64 KB, but we should actually skip items whose GSI
keys are over 2048 or 1024 bytes - not 64KB. This extra checking is
not handled in this patch, and is part of a wider existing issue:
Refs #10347

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:50 +01:00
Nadav Har'El
7a0027bacc mv: clean up do_delete_old_entry
The function do_delete_old_entry() had an if() which was supposedly for
the case of collection column indexing, and which our previous patch
that improved this function to support caller-specified deletion_ts
left behind.

As a reviewer noticed, the new tombstone-setting code was in an "else"
of that existing if(), and it wasn't clear what happens if we get to that
else in the collection column indexing. So I reviewed the code and added
breakpoints and realized that in fact, do_delete_old_entry() is never
called for the collection-indexing case, which has its own
update_entry_for_computed_column() which view_updates::generate_update()
calls instead of the do_delete_old_entry() function and its friends.
So it appears that do_delete_old_entry() doesn't need that special
case at all, which simplifies it.

We should eventually simplify this code further. In particular, the
function generate_update() already knows the key of the rows it
adds or deletes so do_delete_old_entry() and its friends don't need
to call get_view_rows() to get it again. But these simplifications
and other will need to come in a later patch series, this one is
already long enough :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:49 +01:00
Nadav Har'El
bc7b5926d2 mv: support regular_column_transformation key columns in view
In an earlier patch, we introduced regular_column_transformation,
a new type of computed column that does a computation on a cell in
regular column in the base and returns a potentially transformed cell
(value or deletion, timestamp and ttl). In *this* patch, we wire the
materialized view code to support this new kind of computed column that
is usable as a materialized-view key column. This new type of computed
column is not yet used in this patch - this will come in the next
patch, where we will use it for Alternator GSIs.

Before this patch, the logic of deciding when the view update needs
to create a new row or delete a new one, and which timestamp and ttl
to give to the new row, could depend on one (or two - in Alternator)
cells read from base-table regular columns. In this patch, this logic
is rewritten - the notion of "base table regular columns" is generalized
to the notion of "updatable view key columns" - these are view key
columns that an update may change - because they really are base regular
columns, or a computed function of one (regular_column_transformation).

In some sense, the new code is easier to understand - there is no longer
a separate "compute_row_marker()" function, rather the top-level
generate_update() is now in charge of finding the "updatable view key
columns" and calculate the row marker (timestamp and ttl) as part
of deciding what needs to be done.

But unfortunately the code still has separate code paths for "collection
secondary indexing", and also for old-style column_computation (basically,
only token_column_computation). Perhaps in the future this can be further
simplified.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:49 +01:00
Nadav Har'El
c8ea9f8470 mv: introduce regular_column_transformation, a new type of computed column
In the patches that follow, we want Alternator to be able to use as a
key for a materialized view (GSI) not a real column from the schema,
but rather an attribute value deserialized from a member of the ":attrs"
map.

For this, we need the ability for materialized view to define a key
column which is computed as function of a real column (":attrs").

We already have an MV feature which we called "computed column"
(column_computation), but it is wholy inadequate for this job:
column_computation can only take a partition key, and produce a value -
while we need it to take a regular column (one member of ":attrs"),
not just the partition key, and return a cell - value or deletion,
timestamp and TTL.

So in this patch we introduce a new type of computed column, which we
called "regular_column_transformation" since it intends to perform some
sort of transformation on a single column (or more accurately, a single
atomic cell). The limitation that this function transforms a single
column only is important - if we had a function of multiple columns,
we wouldn't know which timestamp or ttl it should use for the result
if the two columns had different timestamps or TTLs.

The new class isn't wired to anything yet: The MV code cannot handle
it yet, and the Alternator code will not use it yet.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:48 +01:00
Benny Halevy
c5668d99c9 schema: add per-table tablet options
Unlike with vnodes, each tablet is served only by a single
shard, and it is associated with a memtable that, when
flushed, it creates sstables which token-range is confined
to the tablet owning them.

On one hand, this allows for far better agility and elasticity
since migration of tablets between nodes or shards does not
require rewriting most if not all of the sstables, as required
with vnodes (at the cleanup phase).

Having too few tablets might limit performance due not
being served by all shards or by imbalance between shards
caused by quantization.  The number of tabelts per table has to be
a power of 2 with the current design, and when divided by the
number of shards, some shards will serve N tablets, while others
may serve N+1, and when N is small N+1/N may be significantly
larger than 1. For example, with N=1, some shards will serve
2 tablet replicas and some will serve only 1, causing an imbalance
of 100%.

Now, simply allocating a lot more tablets for each table may
theoretically address this problem, but practically:
a. Each tablet has memory overhead and having too many tablets
in the system with many tables and many tablets for each of them
may overwhelm the system's and cause out-of-memory errors.
b. Too-small tablets cause a proliferation of small sstables
that are less efficient to acces, have higher metadata overhead
(due to per-sstable overhead), and might exhaust the system's
open file-descriptors limitations.

The options introduced in this change can help the user tune
the system in two ways:
1. Sizing the table to prevent unnecessary tablet splits
and migrations.  This can be done when the table is created,
or later on, using ALTER TABLE.
2. Controlling min_per_shard_tablet_count to improve
tablet balancing, for hot tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:55:51 +02:00
Benny Halevy
ad8b0649ff feature_service: add TABLET_OPTIONS cluster schema feature
To be used for enabling per-table tablet options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:55:51 +02:00
Botond Dénes
3d12451d1f db/config: reader_concurrency_semaphore_cpu_concurrency: bump default to 2
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.

Fixes: #22450

Closes scylladb/scylladb#22679
2025-02-05 21:52:20 +02:00
Ran Regev
edd56a2c1c moved cache files to db
As requested in #22097, moved the files
and fixed other includes and build system.

Fixes: #22097
Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#22495
2025-02-04 12:21:31 +03:00
Pavel Emelyanov
e47c7d5255 Merge 'config: Improve internode_compression option validation and documentation' from Kefu Chai
This PR enhances the internode_compression configuration option in two ways:

1. Add validation for option values
   Previously, we silently defaulted to 'none' when given invalid values. Now we
   explicitly validate against the three supported values (all, dc, none) and
   reject invalid inputs. This provides better error messages when users
   misconfigure the option.

2. Fix documentation rendering
   The help text for this option previously used C++ escape sequences which
   rendered incorrectly in Sphinx-generated HTML. We now use bullet points with
   '*' prefix to list the available values, matching our documentation style
   for other config options. This ensures consistent rendering in both CLI
   and HTML outputs.

Note: The current documentation format puts type/default/liveness information
in the same bullet list as option values. This affects other config options
as well and will need to be addressed in a separate change.

---

this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport.

Closes scylladb/scylladb#22548

* github.com:scylladb/scylladb:
  config: validate internode_compression option values
  config: start available options with '*'
2025-02-04 10:17:23 +03:00
Michael Litvak
6d34125eb7 view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493
2025-01-30 14:35:18 +02:00
Botond Dénes
af46894bb7 Merge 'Rack aware view pairing' from Benny Halevy
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map

We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
  is a multiple of the number of racks and the minimum number of nodes
  per rack in the dc is greater than or equal to rf / nr_racks.

  In this case (that includes the single rack case), all racks would
  have the same number of replicas, so we first filter all replicas
  by dc and rack, retaining their ordinality in the process, and
  finally, we pair between the base replicas and view replicas,
  that are in the same rack, using their original order in the
  tablet-map replica set.

  For example, nr_racks=2, rf=4:
  base_replicas = { N00, N01, N10, N11 }
  view_replicas = { N11, N12, N01, N02 }
  pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
  Note that we don't optimize for self-pairing if it breaks pairing ordinality.

- Complex rack-aware pairing: when the replication factor is not
  a multiple of nr_racks.  In this case, we attempt best-match
  pairing in all racks, using the minimum number of base or view replicas
  in each rack (given their global ordinality), while pairing all the other
  replicas, across racks, sorted by their ordinality.

  For example, nr_racks=4, rf=3:
  base_replicas = { N00, N10, N20 }
  view_replicas = { N11, N21, N31 }
  pairing would be: { N00, N31 }\*, { N10, N11 }, { N20, N21 }
  \* cross-rack pair

  If we'd simply stable-sort both base and view replicas by rack,
  we might end up with much worse pairing across racks:
  { N00, N11 }\*, { N10, N21 }\*, { N20, N31 }\*
  \* cross-rack pair

Fixes scylladb/scylladb#17147

* This is an improvement so no backport is required

Closes scylladb/scylladb#21453

* github.com:scylladb/scylladb:
  network_topology_strategy_test: add tablets rack_aware_view_pairing tests
  view: get_view_natural_endpoint: implement rack-aware pairing for tablets
  view: get_view_natural_endpoint: handle case when there are too few view replicas
  view: get_view_natural_endpoint: track replica locator::nodes
  locator: topology: consult local_dc_rack if node not found by host_id
  locator: node: add dc and rack getters
  feature_service: add tablet_rack_aware_view_pairing feature
  view: get_view_natural_endpoint: refactor predicate function
  view: get_view_natural_endpoint: clarify documentation
  view: mutate_MV: optimize remote_endpoints filtering check
  view: mutate_MV: lookup base and view erms synchronously
  view: mutate_MV: calculate keyspace-dependent flags once
2025-01-30 11:32:19 +02:00
Botond Dénes
8e89f2e88e Merge 'audit: make categories, tables, and keyspaces liveupdatable' from Andrzej Jackowski
This change:
 - Remove code that prevented audit from starting if audit_categories,
   audit_tables, and audit_keyspaces are not configured
 - Set liveness::LiveUpdate for audit_categories, audit_tables,
   and audit_keyspaces
 - Keep const reference to db::config in audit, so current config values
   can be obtained by audit implementation
 - Implement function audit::update_config to parse given string, update
   audit datastructures when needed, and log the changes.
 - Add observers to call audit::update_config when categories,
   tables, or keyspaces configuration changes

New functionality, so no backport needed.
Fixes https://github.com/scylladb/scylla-enterprise/issues/1789

Closes scylladb/scylladb#22449

* github.com:scylladb/scylladb:
  audit: make categories, tables, and keyspaces liveupdatable
  audit: move static parsing functions above audit constructors
  audit: move statement_category to string conversion to static function
  audit: start audit even with empty categories/tables/keyspaces
2025-01-30 11:28:49 +02:00
Kefu Chai
8d0cabb392 config: validate internode_compression option values
Previously, the internode_compression option silently defaulted to 'none'
for any unrecognized value instead of validating input. It only compared
against 'all' and 'dc', making it error-prone.

Add explicit validation for the three supported values:
- all
- dc
- none

This ensures invalid values are rejected both in command line and YAML
configuration, providing better error messages to users.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-29 14:52:35 +08:00
Kefu Chai
a81416862a config: start available options with '*'
use '*' prefix for config option values instead of escape sequences

The custom Sphinx extension that generates documentation from config.cc
help messages has issues with C++ escape sequences. For example,
"\tall: All traffic" renders incorrectly as "tall: All traffic" in HTML
output.

Instead of using escape sequences, switch to bullet-point style with '*'
prefix which works better in both CLI and HTML rendering. This matches
our existing documentation style for available option values in other
configs.

Note: This change puts type/default/liveness info in the same bullet
list as option values. This limitation affects other similar config
options and will need to be addressed comprehensively in a future
change.

Refs scylladb/scylladb#22423
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-29 14:52:35 +08:00
Dawid Pawlik
9954eb0ed7 type_parser: support vector type
This change is introduced due to lack of support for vector class name,
used by type_parser to create data_type based on given class name
(especially compound class name with inner types or other parameters).

Add function that parses vector type parameters from a class name.
2025-01-28 21:14:49 +01:00
Pavel Emelyanov
4d8d7f1f1d Merge 'backup_task: remove a component once it is uploaded ' from Kefu Chai
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).

This change removes SSTable components from the snapshot directory once
they are successfully uploaded to the target location. This prevents
re-uploading the same files and reduces disk usage.

This change only "Refs" https://github.com/scylladb/scylladb/issues/20655, because, we can further optimize
the backup process, consider:

- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.

Fixes https://github.com/scylladb/scylladb/issues/21799
Refs https://github.com/scylladb/scylladb/issues/20655

---

the backup API is not used in production yet, so no need to backport.

Closes scylladb/scylladb#22285

* github.com:scylladb/scylladb:
  backup_task: remove a component once it is uploaded
  backup_task: extract component upload logic into dedicated function
  snapshot-ctl: change snapshot_ctl::run_snapshot_modify_operation() to regular func
2025-01-28 14:27:50 +03:00
Kefu Chai
57b14220ce tree: remove unused "#include"s
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`,
use `using seastar::future` to include `future` in the global namespace,
this makes `readers/mutation_reader.hh` a header exposing `future<>`,
but this is not a good practice, because, unlike `seastarx.hh` or
`seastar/core/future.hh`, `reader/mutation_reader.hh`  is not
responsible for exposing seastar declarations. so, we trade the
using statement for `#include "seastarx.hh"` in that file to decouple
the source files including it from this header because of this statement.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22439
2025-01-28 14:12:06 +03:00
Andrzej Jackowski
5651cc49ed audit: make categories, tables, and keyspaces liveupdatable
This change:
 - Set liveness::LiveUpdate for audit_categories, audit_tables,
   and audit_keyspaces
 - Keep const reference to db::config in audit, so current config values
   can be obtained by audit implementation
 - Implement function audit::update_config to parse given string, update
   audit datastructures when needed, and log the changes.
 - Add observers to call audit::update_config when categories,
   tables, or keyspaces configuration changes

Fixes scylladb/scylla-enterprise#1789
2025-01-27 11:37:13 +01:00
Avi Kivity
a23a3110b5 utils: config_file: forward_declare boost::program_options classes
Avoid pulling in boost dependencies when all we need is the class name.

Closes scylladb/scylladb#22453
2025-01-27 10:45:43 +03:00