It has been observed to generate ~200 kiB allocations.
Since we have already been made aware of that, we can silence the warning
to clean up the logs.
Closesscylladb/scylladb#24360
Add system:table_creation_time tag with value - timestamp in milliseconds of creation table.
If the tag is present, it will used to fill creation timestamp value (when CreateTable or DescribeTable is called).
If the tag is missing, value 0 for timestamp will be substituted (in other words table was created on 1th january of 1970).
Update test to change how we make sure timestamp is actually used - we create two tables one after another and make sure their creation timestamp is in correct order.
Update tests, that work with tags to filter system tags out.
Fixes#5013Closesscylladb/scylladb#24007
This patch adds a couple of basic tests for system tables related to
secondary indexes - system."IndexInfo" and system_schema.indexes.
I wanted to understand these system tables better when writing
documentation for them - so wrote these tests. These tests can also
serve as regression tests that verify that we don't accidentally lose
support for these system tables. I checked that these tests also pass
in Cassandra 3, 4 and 5.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#24137
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`.
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649Closesscylladb/scylladb#20853
* github.com:scylladb/scylladb:
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
The existing `download_source` implementation optimizes performance
by keeping the connection to S3 open and draining data directly from
the socket. While this eliminates the overhead (60-100ms) of repeatedly
establishing new connections, it leads to rapid exhaustion of client-
side connections.
On a single shard, two `mx_readers` for load and stream are enough to
trigger this issue. Since each client typically holds two connections,
readers keeping index and data sources open can cause deadlocks where
processes stall due to unavailable connections.
Introduce `chunked_download_source`, a new S3 download method built on
`download_source`, to dynamically manage connections:
- Buffers data in 5MiB chunks using a producer-consumer model
- Closes connections once buffers reach capacity, returning them to
the pool for other clients
- Uses a filling fiber that resumes fetching once buffers are
consumed from the queue
Performance remains comparable to `download_source`, achieving
95MiB/s for sequential 1GiB downloads from S3. However, preloading
large chunks may cause read amplification.
Fixes: https://github.com/scylladb/scylladb/issues/23785Closesscylladb/scylladb#23880
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#24409Fixes: #24365
Fixes#24447
This factory type, which is really more a data holder/connection producer
per connection instance, creates, if using https, a new certificate_credentials
on every instance. Which when used by S3 client is per client and
scheduling groups.
Which eventually means that we will do a set_system_trust + "cold" handshake
for every tls connection created this way.
This will cause both IO and cold/expensive certificate checking -> possible
stalls/wasted CPU. Since the credentials object in question is literally a
"just trust system", it could very well be shared across the shard.
This PR adds a thread local static cached credentials object and uses this
instead. Could consider moving this to seastar, but maybe this is too much.
Closesscylladb/scylladb#24448
In test_cdc_generation_clearing we trigger events that update CDC
generations, verify the generations are updated as expected, and verify
the system topology and CDC generations are consistent on all nodes.
Before checking that all nodes are consistent and have the same CDC
generations, we need to consider that the changes are propagated through
raft and take some time to propagate to all nodes.
Currently, we wait for the change to be applied only on the first server
which runs the CDC generation publisher fiber and read the CDC
generations from this single node. The consistency check that follows
could fail if the change was not propagated to some other node yet.
To fix that, before checking consistency with all nodes, we execute a
read barrier on all nodes so they all see the same state as the leader.
Fixesscylladb/scylladb#24407Closesscylladb/scylladb#24433
The table::take_snapshot() touches the snapshot directory, which is
good. It happens on all shards, which is not that good, because all
shards just step on each other toes when doing it, the directory is not
sharded. Same for post-snapshot directory sync -- it can happen once,
after all shards finish creating snapshot links.
Move both, touching and syncing up one level. There's only one caller of
the method, so only one caller to update.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24154
For reasons, we want to be able to disallow dictionary-aware compressors
in chosen deployments.
This patch adds a knob for that. When the knob is disabled,
dictionary-aware compressors will be rejected in the validation
stage of CREATE and ALTER statements.
Closesscylladb/scylladb#24355
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.
1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.
So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.
Fixes#23771.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24426
This series introduces per-table metrics support for Alternator. It includes the following commits:
Add optional per-table metrics for Alternator
Introduces a shared_ptr-based mechanism that allows Alternator to register per-table metrics. These metrics follow the table's lifecycle, similar to how CQL metrics are handled. The use of shared_ptr ensures no direct dependency between table stats and Alternator.
Enable registration of stats objects per table
Adds support for registering a stats object using a keyspace and table name. Per-table metrics are prefixed with alternator_table to differentiate them from per-shard metrics. Metrics are reported once per node, and those not meaningful at the table level (e.g. create/delete) are excluded. All metrics use the skip_when_empty flag.
Update per-table metrics handling
Adds a helper function to retrieve the stats object from a table schema. Updates both per-shard and per-table metrics, resulting in some code duplication.
Add tests for per-table metrics
Extends existing tests to also validate the per-table metrics. These tests ensure that the new metrics are correctly registered and updated.
This series improves observability in Alternator by enabling fine-grained per-table metrics without disrupting existing per-shard metrics.
**No need to backport**
Fixes#19824Closesscylladb/scylladb#24046
* github.com:scylladb/scylladb:
alternator/test_metrics.py: Test the per-table metrics
alternator/executor.cc: Update per-table metrics
alternator/stats: Add per-table metrics
replica/database.hh: Add alternator per-table metrics
alternator/stats.hh: Introduce a per-table stats container
The helper in question converts an iterable collection to a vector of fmt::to_string()-s of the collection elements.
Patch the caller to use standard library and remove the helper.
Closesscylladb/scylladb#24357
* github.com:scylladb/scylladb:
api: Drop no longer used container_to_vec helper
api: Use std::ranges to stringify collections
api: Use std::ranges to convert std::set<sstring> to std::vector<string>
api: Use db::config::data_file_directories()' vector directly
api: Coroutinize get_live_endpoint()
The handler does
- gets host IDs from local token metadata
- for each ID gets the host IP and generates IP:ID std::pair
- converts the sequence of generated pairs into std::unordered_map
- converts the unordered map into vector of jsonable key:value objects
This patch removes the 3rd step and makes the needed jsonable object in
step 2 directly, thus eliminating the interposing unordered_map
creation.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24354
There are two places in the API that want to get the list of keyspace
names. For that they call database::get_keyspaces() and then extract
keys from the returned name to class keyspace map.
There's a database::get_all_keyspaces() method that does exactly that.
Remove the map_keys helper from the api/api.hh that becomes unused.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24353
Since we abort now on failure during schema commit
there is no need for cleanup as it only manages in-memory
state.
Explicit cf.stop was added to code paths outside of schema
merging to avoid unnecessary regressions.
The same order of creation/destruction is preserved as in the
original code, looking from single shard point of view.
create_types() is called on each shard separately, while in theory
we should be able reuse results similarly as diff_rows(). But we
don't introduce this optimization yet.
Now all keyspace related schema changes are observable
on given shard as they would be applied atomically.
This is achieved by commit_on_shard() function being
non-preemptive (no futures, no co_awaits).
In the future we'll extend this to the whole schema
and also other subsystems.
In this commit we make use of splitted functions introduced before.
Pattern is as follows:
- in merge_tables_and_views we call some preparatory functions
- in schema_applier::update we call non-yielding step
- in schema_applier::post_commit we call cleanups and other finalizing async
functions
Additionally we introduce frozen_schema_diff because converting
schema_ptr to global_schema_ptr triggers schema registration and
with atomic changes we need to place registration only in commit
phase. Schema freezing is the same method global_schema_ptr uses
to transport schema across shards (via schema_registry cache).
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.
Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)
- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
This is similar work as for drop_table in previous commit.
add_column_family_and_make_directory() behaves exactly the same
as before but calls to it in schema_applier will be replaced by
calls directly to split steps. Other usages will remain intact as
they don't need atomicity (like creating system tables at startup).
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.
We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()
prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.
We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.
Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
This will be the place for all atomic schema switching
operations.
Note that atomicity is observed only from single shard
point of view. All shards may switch at slightly different times
as global locking for this is not feasible.
Once we create types atomically the code which is before commit
may depend on newly added types, so it has to access both old and
new types. New storage called in_progress_types_storage was added.
The Alternator tests should pass on Alternator (of course), and almost always also on DynamoDB to verify that the tests themselves are correct and don't just enshrine Alternator's incorrect behavior. Although much less important, it is sometimes useful to be able to check if the test also pass on other DynamoDB clones, especially "DynamoDB Local" - Amazon's DynamoDB mock written in Java.
In issue https://github.com/scylladb/scylladb/issues/7775 we noted that some of our tests don't actually pass on DynamoDB Local, for different reasons, but at the time that issue was created most of the tests did work. However, checking now on a newer version of DynamoDB Local (2.6.1), I notice that _all_ tests failed because of some silly reasons that are easy to fix - and this is what the two patches in this series fix. After these fixes, most of the Alternator tests pass on DynamoDB Local. But not all of them - #7775 is still open.
No backport needed - these are just test framework improvements for developers.
Closesscylladb/scylladb#24361
* github.com:scylladb/scylladb:
test/alternator: any response from healthcheck means server is alive
test/alternator: fall back to legal-looking access key id
Both ScyllaDB's and Datastax's documentation suggest that when creating a
view with CREATE MATERIALIZED VIEW, its SELECT clause doesn't need to list
the view's primary key columns because those are selected automatically.
For example, our documentation has an example in
https://docs.scylladb.com/manual/stable/features/materialized-views.html
```
CREATE MATERIALIZED VIEW building_by_city2 AS
SELECT meters FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);
```
Note how the primary key columns - city and name - are not explicitly
SELECTed.
I just discovered that while this behavior was indeed true in Cassandra
3 (and still true in ScyllaDB), it actually got broken in Cassandra 4 and 5.
I reported this apprent regression to Cassandra (CASSANDRA-20701), and
proposing the regression test in this patch to ensure that Scylla can't
suffer a similar regression in the future.
The new test passes on ScyllaDB and Cassandra 3, but fails on Cassandra
4 and 5 (and therefore tagged with "cassandra_bug").
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#24399
Support for TTL-based data removal when using tablets.
The essence of this commit is a separate code path for finding token
ranges owned by the current shard for the cases when tablets are used
and not vnodes. At the same time, the vnodes-case is not touched not to
cause any regressions.
The TTL-caused data removal is normally performed by the primary
replica (both when using vnodes and tablets). For the tablets case,
the already-existing method tablet_map::get_primary_replica(tablet_id)
is used to know if a shard execuring the TTL-related data removal is
the primary replica for each tablet.
A new method tablet_map::get_secondary_replica(tablet_id) has been
added. It is needed by the data invalidation procedure to remove data
when the primary replica node is down - the data is then removed by the
secondary replica node. The mechanism is the same as in the vnodes case.
Since alternator now supports TTL, the test
`test_ttl_enable_error_with_tablets` has been removed.
Also, tests in the test_ttl.py have been made to run twice, once with
vnodes and once with tablets. When run with tablets, the due to lack of
support for LWT with tablets (#18068), tests use
'system:write_isolation' of 'unsafe_rmw'. This approach allows early
regression testing with tablets and is meant only as a tentative
solution.
Fixesscylladb/scylladb#16567Closesscylladb/scylladb#23662
This patch adds tests for the newly added per-table metrics. It mainly
redoes existing tests, but verifies that the per-table metrics are
updated correctly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds support for updating per-table metrics. It introduces a
helper function that retrieves the stats object from a table schema.
The code uses a lw_shared_ptr for the stats object to ensure safe updates
even if the table holding it has been deleted.
There is some duplication in the updated code, as both per-shard and
per-table metrics are updated.
The rmw_operation::execute function now accepts two stats objects: one
for the global metrics and one for the per-table metrics. The use of
execute was also modified—rather than modifying the WCU directly, a
parameter is used so both global and per-table stats can be updated.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch allows registering a stats object per table. The per-table
stats object needs its metrics registry to be part of the table's
lifecycle, but there could be a scenario in which a table is already
deleted while some Alternator operations are still in progress. To
handle this, the patch separates the registry from the metrics holder.
It is safe to modify a parameter that is not registered.
Metrics registration is performed via functions instead of the
constructor.
The registration accepts a keyspace and table name as parameters.
The per-table metrics use an alternator_table prefix to distinguish them
from their per-shard equivalents.
The metrics are aggregated and reported once per node. Metrics that do
not make sense to report per table (such as create and delete) are not
registered. All metrics are marked with skip_when_empty.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds optional per-table metrics for Alternator.
Like CQL, some of Alternator's statistics should be per-table. The
shared_ptr allows Alternator to register such metrics in a way that
makes them part of the table's lifecycle.
Using a shared_ptr does not create dependencies between the table_stats
and Alternator.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
A per-table stats container will be used to safely hold alternator
per-table stats.
It is build in a way that even if the metrics it holds are no longer
registered, it is still safe to use.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Register the current space_source_fn in an RAII
object that resets monitor._space_source to the
previous function when the RAII object is destroyed.
Use space_source_registration in database_test::
mutation_dump_generated_schema_deterministic_id_version
to prevent use-after-stack-return in the test.
Fixes#24314
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#24342
This commit adds the upgrade guide from version 2025.1 to 2025.2.
Also, it removes the upgrade guides existing for the previous version
that are irrelevant in 2025.2 (upgrade from OSS 6.2 and Enterprise 2024.x).
Note that the new guide does not include the "Enable Consistent Topology Updates" page,
as users upgrading to 2025.2 have consistent topology updates already enabled.
Fixes https://github.com/scylladb/scylladb/issues/24133
Fixes https://github.com/scylladb/scylladb/issues/24265Closesscylladb/scylladb#24266
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.
Fixes#20662Closesscylladb/scylladb#24106