Range scans are expected to go though lots of tombstones, no need to
spam the logs about this. The tombstone warning log is demoted to debug
level, if somebody wants to see it they can bump the logger to debug
level.
Fixes: https://github.com/scylladb/scylladb/issues/23093Closesscylladb/scylladb#23094
This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar)
The idea in this series is to create an equivalent of levels with a label.
First, label a subset of the metrics used by the dashboards.
Second, the per-table metrics that are now off by default will be marked with a different label.
The following specific optional features: CDC, CAS, and Alternator have a dedicated label now.
This will allow users to disable all metrics of features that are not in use.
All the rest of the metrics are left unlabeled.
Without any changes, users would get the same metrics they are getting today.
But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies).
The labels are not reported based on the seastar feature of hiding labels that start with an underscore.
Closesscylladb/scylladb#12246
* github.com:scylladb/scylladb:
db/view/view.cc: label metrics with basic_level
transport/server.cc: label metrics with basic_level
service/storage_proxy.cc: label metrics with basic_level and cas
main.cc: label metrics with basic_level
streaming/stream_manager.cc: label metrics with basic_level
repair/repair.cc: label metrics with basic_level
service/storage_service.cc: label metrics with basic_level
gms/gossiper.cc: label metrics with basic_level
replica/database.cc: label metrics with basic_level
cdc/log.cc: label metrics with basic_level and cdc
alternator: label metrics with basic_level and alternator
row_cache.cc: label metrics with basic_level
query_processor.cc: label metrics with basic_level
sstables.cc: label metrics with basic_level
utils/logalloc.cc label metrics with basic_level
commitlog.cc: label metrics with basic_level
compaction_manager.cc: label metrics with basic_level
Adding the __level and features labels
Refs #22916
Adds an "enable_session_tickets" option to TLS setup for our server
endpoints (not documented for internode RPC, as we don't handle it
on the client side there), allowing enabling of TLS3 client session
ticket, i.e. quicker reconnect.
Session tickets are valid within a time frame or until a node
restarts, whichever comes first.
v2:
Use "TLS1.3" in help message
Closesscylladb/scylladb#22928
The following metrics will be marked with basic_level label:
scylla_commitlog_segments
scylla_commitlog_allocating_segments
scylla_commitlog_unused_segments
scylla_commitlog_alloc
scylla_commitlog_flush
scylla_commitlog_bytes_written
scylla_commitlog_pending_allocations
scylla_commitlog_requests_blocked_memory
scylla_commitlog_flush_limit_exceeded
scylla_commitlog_disk_total_bytes
scylla_commitlog_disk_active_bytes
scylla_commitlog_disk_slack_end_bytes
Replace boost::accumulate() calls with std::ranges facilities. This
change reduces external dependencies and modernizes the codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23062
Fixes#22314
Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.
Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.
Closesscylladb/scylladb#22327
* github.com:scylladb/scylladb:
tools: Add standard extensions and propagate to schema load
cql_test_env: Use add all extensions instead of inidividually
main: Move extensions adding to function
tomstone_gc: Make validate work for tools
class clustering_range is a range of Clustering Key Prefixes implemented
as interval<clustering_key_prefix>. However, due to the nature of
Clustering Key Prefix, the ordering of clustering_range is complex and
does not satisfy the invariant of interval<>. To be more specific, as a
comment in interval<> implementation states: “The end bound can never be
smaller than the start bound”. As a range of CKP violates the invariant,
some algorithms, like intersection(), can return incorrect results.
For more details refer to scylladb#8157, scylladb#21604, scylladb#22817.
This commit:
- Add a WARNING comment to discourage usage of clustering_range
- Add WARNING comments to potentially incorrect uses of
interval<clustering_key_prefix> non-trivial methods
- Add a FIXME comment to incorrect use of
interval<clustering_key_prefix_view>::deoverlap and WARNING comments
to related interval<clustering_key_prefix_view> misuse.
Closesscylladb/scylladb#22913
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22997
When a backup upload is aborted due to instance shutdown, change the log
level from ERROR to INFO since this is expected behavior. Previously,
`abort_requested_exception` during upload would trigger an ERROR log, causing
test failures since error logs indicate unexpected issues.
This change:
- Catches `abort_requested_exception` specifically during file uploads
- Logs these shutdown-triggered aborts at INFO level instead of ERROR
- Aligns with how `abort_requested_exception` is handled elsewhere in the service
This prevents false test failures while still informing administrators
about aborted uploads during shutdown.
Fixesscylladb/scylladb#22391
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22995
This series achieves two things:
1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards
This will result in new tables having at least 10 tablet replicas per
shard by default.
We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.
Fixes https://github.com/scylladb/scylladb/issues/21967
2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count
The per-shard goal is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
The scaling is applied after computing desired tablet count due to
all other factors: per-table tablet count hints, defaults, average tablet size.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
When creating a new table, its tablet count is determined by tablet
scheduler using the scheduler logic, as if the table was already created.
So any scaling due to per-shard tablet count goal is reflected immediately
when creating a table. It may however still take some time for the system
to shrink existing tables. We don't reject requests to create new tables.
Fixes#21458Closesscylladb/scylladb#22522
* github.com:scylladb/scylladb:
config, tablets: Allow tablets_initial_scale_factor to be a fraction
test: tablets_test: Test scaling when creating lots of tables
test: tablets_test: Test tablet count changes on per-table option and config changes
test: tablets_test: Add support for auto-split mode
test: cql_test_env: Expose db config
config: Make tablets_initial_scale_factor live-updateable
tablets: load_balancer: Pick initial_scale_factor from config
tablets, load_balancer: Fix and improve logging of resize decisions
tablets, load_balancer: Log reason for target tablet count
tablets: load_balancer: Move hints processing to tablet scheduler
tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
tablets: load_balancer: Determine desired count from size separately from count from options
tablets: load_balancer: Determine resize decision from target tablet count
tablets: load_balancer: Allow splits even if table stats not available
tablets: load_balancer: Extract make_sizing_plan()
tablets: Add formatter for resize_decision::way_type
tablets: load_balancer: Simplify resize_urgency_cmp()
tablets: load_balancer: Keep config items as instance members
locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()
tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
tablets: Set default initial tablet count scale to 10
tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
tablets: load_balancer: Extract get_schema_and_rs()
tablets: load_balancer: Drop test_mode
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.
This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
manner.
Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also
performs a refactoring to populate all init data to system tables even
before group0 creation(via `raft_initialize_discovery_leader` function).
Now because `raft_initialize_discovery_leader` happens before the group
0 creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.
Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.
By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#21114Closesscylladb/scylladb#22484
* https://github.com/scylladb/scylladb:
storage_service: Remove the variable _manage_topology_change_kind_from_group0
storage_service: fix indentation after the previous commit
raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
service/raft: Refactor mutation writing helper functions.
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
Currently the scale is applied post rounding up of tablet count so
that tablet count per shard is at least 1. In order to be able to use
the scale to increase tablet count per shard, we need to apply it
prior to division by RF, otherwise we will overshoot per-shard tablet
replica count.
Example:
4 nodes, -c1, rf=3, initial_tablets_scale=10
Before: initial_tablet_count=20, tablet-per-shard=15
After: initial_tablet_count=14, tablets-per-shard=10.5
This will result in new tables having at least 10 tablet replicas per
shard by default.
We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.
Fixes#21967
In some tests, we explicity set the initial scale to 1 as some of the
existing tests assume 1 compaction group per shard.
test.py uses a lower default. Having many tablets per shard slows down
certain topology operations like decommission/replace/removenode,
where the running time is proportional to tablet count, not data size,
because constant cost (latency) of migration dominates. This latency
is due to group0 operations and barriers. This is especially
pronounced in debug mode. Scheduler allows at most 2 migrations per
shard, so this latency becomes a determining factor for decommission
speed.
To avoid this problem in tests, we use lower default for tablet count per
shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we
could compensate by allowing more concurrency when migrating small
tablets, but there's no infrastructure for that yet.
I observed that with 10 tablets per shard, debug-mode
topology_custom.mv/test_mv_topology_change starts to time-out during
removenode (30 s).
result_set_row is a heavyweight object containing multiple cell types:
regular columns, partition keys, and static values. To prevent expensive
accidental copies, delete the copy constructor and replace it with:
1. A move constructor for efficient vector reallocation
2. An explicit copy() method when copies are actually needed
This change reduces overhead in some non-hot paths by eliminating implicit
deep copies. Please note, previously, in `create_view_from_mutation()`,
we kept a copy of `result_set_row`, and then reused `table_rs` for
holding the mutation for `scylla_tables`. Because we don't copy
the `result_set_row` in this change, in order to avoid invalidating
the `row` after reusing `table_rs` in the outer scope, we define a
new `table_rs` shadowing the one in the out scope.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22741
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22857
In the current scenario, topology_change_kind variable, was been handled using
_manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)
Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.
Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.
By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.
Fixes: scylladb/scylladb#21114
In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is **not** the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes#11567.
We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes#11471.
This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. **It should be backported to 2025.1**, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1.
Closesscylladb/scylladb#21989
* github.com:scylladb/scylladb:
alternator: fix view build on oversized GSI key attribute
mv: clean up do_delete_old_entry
test/alternator: unflake test for IndexStatus
test/alternator: work around unrelated bug causing test flakiness
docs/alternator: adding a GSI is no longer an unimplemented feature
test/alternator: remove xfail from all tests for issue 11567
alternator: overhaul implementation of GSIs and support UpdateTable
mv: support regular_column_transformation key columns in view
alternator: add new materialized-view computed column for item in map
build: in cmake build, schema needs alternator
build: build tests with Alternator
alternator: add function serialized_value_if_type()
mv: introduce regular_column_transformation, a new type of computed column
alternator: add IndexStatus/Backfilling in DescribeTable
alternator: add "LimitExceededException" error type
docs/alternator: document two more unimplemented Alternator features
In c5668d99, a new source file row_cache.cc was added to the `db` target,
but with an extraneous trailing comma. In CMake's target_sources(),
source files should be space-separated - any comma is interpreted as
part of the filename, causing build failures like:
```
CMake Error at db/CMakeLists.txt:2 (target_sources):
Cannot find source file:
row_cache.cc,
```
Fix the issue by removing the trailing comma.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22754
This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.
* New feature, no backport required
Closesscylladb/scylladb#22090
* github.com:scylladb/scylladb:
tablets: resize_decision: get rid of initial_decision
tablet_allocator: consider tablet options for resize decision
tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
cql3/create_keyspace_statement: add deprecation warning for initial tablets
test: cqlpy: test_tablets: add tests for per-table tablet options
schema: add per-table tablet options
feature_service: add TABLET_OPTIONS cluster schema feature
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.
The patch contains:
- implementation of vector_type_impl class
- necessary functionalities similar to other data types
- support for serialization and deserialization of vectors
- support for Lua and JSON format
- valid CQL syntax for `vector<>` type
- `type_parser` support for vectors
- expression adjustments such as:
- add `collection_constructor::style_type::vector`
- rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector`
- vector type encoding (for drivers)
- unit tests
- cassandra compatibility tests
- necessary documentation
Co-authored-by: @janpiotrlakomy
Fixes https://github.com/scylladb/scylladb/issues/19455Closesscylladb/scylladb#22488
* github.com:scylladb/scylladb:
docs: add vector type documentation
cassandra_tests: translate tests covering the vector type
type_codec: add vector type encoding
boost/expr_test: add vector expression tests
expression: adjust collection constructor list style
expression: add vector style type
test/boost: add vector type cql_env boost tests
test/boost: add vector type_parser tests
type_parser: support vector type
cql3: add vector type syntax
types: implement vector_type_impl
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
also, took this opportunity to remove an unused namespace alias. and
add an include which is used actually. please note,
`std::ranges::pop_heap()` and friends are actually provided by
`<algorithm>` not `<ranges>`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22716
Before this patch, the regular_column_transformation constructor, which
we used in Alternator GSIs to generates a view key from a regular-column
cell, accepted a cell of any size. As a reviewer (Avi) noticed, very
long cells are possible, well beyond what Scylla allows for keys (64KB),
and because regular_column_transformation stores such values in a
contiguous "bytes" object it can cause stalls.
But allowing oversized attributes creates an even more accute problem:
While view building (backfilling in DynamoDB jargon), if we encounter
an oversized (>64KB) key, the view building step will fail and the
entire view building will hang forever.
This patch fixes both problems by adding to regular_column_transformation's
constructor the check that if the cell is 64KB or larger, an empty value
is returned for the key. This causes the backfilling to silently skip
this item, which is what we expect to happen (backfilling cannot do
anything to fix or reject the pre-existing items in the best table).
A test test_gsi_updatetable.py::test_gsi_backfill_oversized_key is
introduced to reproduce this problem and its fix. The test adds a 65KB
attribute to a base table, and then adds GSIs to this table with this
attribute as its partition key or its sort key. Before this patch, the
backfilling process for the new GSIs hangs, and never completes.
After this patch, the backfilling completes and as expected contains
other base-table items but not the item with the oversized attribute.
The new test also passes on DynamoDB.
However, while implementing this fix I realized that issue #10347 also
exists for GSIs. Issue #10347 is about the fact that DynamoDB limits
partition key and sort key attributes to 2048 and 1024 bytes,
respectively. In the fix described above we only handled the accute case
of lengths above 64 KB, but we should actually skip items whose GSI
keys are over 2048 or 1024 bytes - not 64KB. This extra checking is
not handled in this patch, and is part of a wider existing issue:
Refs #10347
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function do_delete_old_entry() had an if() which was supposedly for
the case of collection column indexing, and which our previous patch
that improved this function to support caller-specified deletion_ts
left behind.
As a reviewer noticed, the new tombstone-setting code was in an "else"
of that existing if(), and it wasn't clear what happens if we get to that
else in the collection column indexing. So I reviewed the code and added
breakpoints and realized that in fact, do_delete_old_entry() is never
called for the collection-indexing case, which has its own
update_entry_for_computed_column() which view_updates::generate_update()
calls instead of the do_delete_old_entry() function and its friends.
So it appears that do_delete_old_entry() doesn't need that special
case at all, which simplifies it.
We should eventually simplify this code further. In particular, the
function generate_update() already knows the key of the rows it
adds or deletes so do_delete_old_entry() and its friends don't need
to call get_view_rows() to get it again. But these simplifications
and other will need to come in a later patch series, this one is
already long enough :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In an earlier patch, we introduced regular_column_transformation,
a new type of computed column that does a computation on a cell in
regular column in the base and returns a potentially transformed cell
(value or deletion, timestamp and ttl). In *this* patch, we wire the
materialized view code to support this new kind of computed column that
is usable as a materialized-view key column. This new type of computed
column is not yet used in this patch - this will come in the next
patch, where we will use it for Alternator GSIs.
Before this patch, the logic of deciding when the view update needs
to create a new row or delete a new one, and which timestamp and ttl
to give to the new row, could depend on one (or two - in Alternator)
cells read from base-table regular columns. In this patch, this logic
is rewritten - the notion of "base table regular columns" is generalized
to the notion of "updatable view key columns" - these are view key
columns that an update may change - because they really are base regular
columns, or a computed function of one (regular_column_transformation).
In some sense, the new code is easier to understand - there is no longer
a separate "compute_row_marker()" function, rather the top-level
generate_update() is now in charge of finding the "updatable view key
columns" and calculate the row marker (timestamp and ttl) as part
of deciding what needs to be done.
But unfortunately the code still has separate code paths for "collection
secondary indexing", and also for old-style column_computation (basically,
only token_column_computation). Perhaps in the future this can be further
simplified.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the patches that follow, we want Alternator to be able to use as a
key for a materialized view (GSI) not a real column from the schema,
but rather an attribute value deserialized from a member of the ":attrs"
map.
For this, we need the ability for materialized view to define a key
column which is computed as function of a real column (":attrs").
We already have an MV feature which we called "computed column"
(column_computation), but it is wholy inadequate for this job:
column_computation can only take a partition key, and produce a value -
while we need it to take a regular column (one member of ":attrs"),
not just the partition key, and return a cell - value or deletion,
timestamp and TTL.
So in this patch we introduce a new type of computed column, which we
called "regular_column_transformation" since it intends to perform some
sort of transformation on a single column (or more accurately, a single
atomic cell). The limitation that this function transforms a single
column only is important - if we had a function of multiple columns,
we wouldn't know which timestamp or ttl it should use for the result
if the two columns had different timestamps or TTLs.
The new class isn't wired to anything yet: The MV code cannot handle
it yet, and the Alternator code will not use it yet.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Unlike with vnodes, each tablet is served only by a single
shard, and it is associated with a memtable that, when
flushed, it creates sstables which token-range is confined
to the tablet owning them.
On one hand, this allows for far better agility and elasticity
since migration of tablets between nodes or shards does not
require rewriting most if not all of the sstables, as required
with vnodes (at the cleanup phase).
Having too few tablets might limit performance due not
being served by all shards or by imbalance between shards
caused by quantization. The number of tabelts per table has to be
a power of 2 with the current design, and when divided by the
number of shards, some shards will serve N tablets, while others
may serve N+1, and when N is small N+1/N may be significantly
larger than 1. For example, with N=1, some shards will serve
2 tablet replicas and some will serve only 1, causing an imbalance
of 100%.
Now, simply allocating a lot more tablets for each table may
theoretically address this problem, but practically:
a. Each tablet has memory overhead and having too many tablets
in the system with many tables and many tablets for each of them
may overwhelm the system's and cause out-of-memory errors.
b. Too-small tablets cause a proliferation of small sstables
that are less efficient to acces, have higher metadata overhead
(due to per-sstable overhead), and might exhaust the system's
open file-descriptors limitations.
The options introduced in this change can help the user tune
the system in two ways:
1. Sizing the table to prevent unnecessary tablet splits
and migrations. This can be done when the table is created,
or later on, using ALTER TABLE.
2. Controlling min_per_shard_tablet_count to improve
tablet balancing, for hot tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.
Fixes: #22450Closesscylladb/scylladb#22679
This PR enhances the internode_compression configuration option in two ways:
1. Add validation for option values
Previously, we silently defaulted to 'none' when given invalid values. Now we
explicitly validate against the three supported values (all, dc, none) and
reject invalid inputs. This provides better error messages when users
misconfigure the option.
2. Fix documentation rendering
The help text for this option previously used C++ escape sequences which
rendered incorrectly in Sphinx-generated HTML. We now use bullet points with
'*' prefix to list the available values, matching our documentation style
for other config options. This ensures consistent rendering in both CLI
and HTML outputs.
Note: The current documentation format puts type/default/liveness information
in the same bullet list as option values. This affects other config options
as well and will need to be addressed in a separate change.
---
this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport.
Closesscylladb/scylladb#22548
* github.com:scylladb/scylladb:
config: validate internode_compression option values
config: start available options with '*'
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.
A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.
The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.
To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.
Fixesscylladb/scylladb#21829Closesscylladb/scylladb#22493
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map
We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
is a multiple of the number of racks and the minimum number of nodes
per rack in the dc is greater than or equal to rf / nr_racks.
In this case (that includes the single rack case), all racks would
have the same number of replicas, so we first filter all replicas
by dc and rack, retaining their ordinality in the process, and
finally, we pair between the base replicas and view replicas,
that are in the same rack, using their original order in the
tablet-map replica set.
For example, nr_racks=2, rf=4:
base_replicas = { N00, N01, N10, N11 }
view_replicas = { N11, N12, N01, N02 }
pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
Note that we don't optimize for self-pairing if it breaks pairing ordinality.
- Complex rack-aware pairing: when the replication factor is not
a multiple of nr_racks. In this case, we attempt best-match
pairing in all racks, using the minimum number of base or view replicas
in each rack (given their global ordinality), while pairing all the other
replicas, across racks, sorted by their ordinality.
For example, nr_racks=4, rf=3:
base_replicas = { N00, N10, N20 }
view_replicas = { N11, N21, N31 }
pairing would be: { N00, N31 }\*, { N10, N11 }, { N20, N21 }
\* cross-rack pair
If we'd simply stable-sort both base and view replicas by rack,
we might end up with much worse pairing across racks:
{ N00, N11 }\*, { N10, N21 }\*, { N20, N31 }\*
\* cross-rack pair
Fixesscylladb/scylladb#17147
* This is an improvement so no backport is required
Closesscylladb/scylladb#21453
* github.com:scylladb/scylladb:
network_topology_strategy_test: add tablets rack_aware_view_pairing tests
view: get_view_natural_endpoint: implement rack-aware pairing for tablets
view: get_view_natural_endpoint: handle case when there are too few view replicas
view: get_view_natural_endpoint: track replica locator::nodes
locator: topology: consult local_dc_rack if node not found by host_id
locator: node: add dc and rack getters
feature_service: add tablet_rack_aware_view_pairing feature
view: get_view_natural_endpoint: refactor predicate function
view: get_view_natural_endpoint: clarify documentation
view: mutate_MV: optimize remote_endpoints filtering check
view: mutate_MV: lookup base and view erms synchronously
view: mutate_MV: calculate keyspace-dependent flags once
This change:
- Remove code that prevented audit from starting if audit_categories,
audit_tables, and audit_keyspaces are not configured
- Set liveness::LiveUpdate for audit_categories, audit_tables,
and audit_keyspaces
- Keep const reference to db::config in audit, so current config values
can be obtained by audit implementation
- Implement function audit::update_config to parse given string, update
audit datastructures when needed, and log the changes.
- Add observers to call audit::update_config when categories,
tables, or keyspaces configuration changes
New functionality, so no backport needed.
Fixes https://github.com/scylladb/scylla-enterprise/issues/1789Closesscylladb/scylladb#22449
* github.com:scylladb/scylladb:
audit: make categories, tables, and keyspaces liveupdatable
audit: move static parsing functions above audit constructors
audit: move statement_category to string conversion to static function
audit: start audit even with empty categories/tables/keyspaces
Previously, the internode_compression option silently defaulted to 'none'
for any unrecognized value instead of validating input. It only compared
against 'all' and 'dc', making it error-prone.
Add explicit validation for the three supported values:
- all
- dc
- none
This ensures invalid values are rejected both in command line and YAML
configuration, providing better error messages to users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
use '*' prefix for config option values instead of escape sequences
The custom Sphinx extension that generates documentation from config.cc
help messages has issues with C++ escape sequences. For example,
"\tall: All traffic" renders incorrectly as "tall: All traffic" in HTML
output.
Instead of using escape sequences, switch to bullet-point style with '*'
prefix which works better in both CLI and HTML rendering. This matches
our existing documentation style for available option values in other
configs.
Note: This change puts type/default/liveness info in the same bullet
list as option values. This limitation affects other similar config
options and will need to be addressed comprehensively in a future
change.
Refs scylladb/scylladb#22423
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This change is introduced due to lack of support for vector class name,
used by type_parser to create data_type based on given class name
(especially compound class name with inner types or other parameters).
Add function that parses vector type parameters from a class name.
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).
This change removes SSTable components from the snapshot directory once
they are successfully uploaded to the target location. This prevents
re-uploading the same files and reduces disk usage.
This change only "Refs" https://github.com/scylladb/scylladb/issues/20655, because, we can further optimize
the backup process, consider:
- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.
Fixes https://github.com/scylladb/scylladb/issues/21799
Refs https://github.com/scylladb/scylladb/issues/20655
---
the backup API is not used in production yet, so no need to backport.
Closesscylladb/scylladb#22285
* github.com:scylladb/scylladb:
backup_task: remove a component once it is uploaded
backup_task: extract component upload logic into dedicated function
snapshot-ctl: change snapshot_ctl::run_snapshot_modify_operation() to regular func
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`,
use `using seastar::future` to include `future` in the global namespace,
this makes `readers/mutation_reader.hh` a header exposing `future<>`,
but this is not a good practice, because, unlike `seastarx.hh` or
`seastar/core/future.hh`, `reader/mutation_reader.hh` is not
responsible for exposing seastar declarations. so, we trade the
using statement for `#include "seastarx.hh"` in that file to decouple
the source files including it from this header because of this statement.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22439
This change:
- Set liveness::LiveUpdate for audit_categories, audit_tables,
and audit_keyspaces
- Keep const reference to db::config in audit, so current config values
can be obtained by audit implementation
- Implement function audit::update_config to parse given string, update
audit datastructures when needed, and log the changes.
- Add observers to call audit::update_config when categories,
tables, or keyspaces configuration changes
Fixesscylladb/scylla-enterprise#1789