The `generate_unavailable_localhost_port` function is not robust because it
can suffer from a race condition. It finds an available port but does not
keep it occupied, meaning another process could bind to it before the test
can use it.
The `unavailable_server` helper is a more robust solution. It creates a
server that listens on a port for its entire lifetime and immediately
closes any incoming connections. This guarantees the port remains
unavailable, making the test more reliable.
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.
The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.
Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
- new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.
Remove nightly mark.
Fixes: #26148.
Closesscylladb/scylladb#26209
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.
Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.
Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to reballance the cluster (plan making
only, sum of all iterations, no streaming):
Before: 16 min 25 s
After: 0 min 25 s
Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):
testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)
The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.
After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls.
Fixes#26016Closesscylladb/scylladb#26017
* github.com:scylladb/scylladb:
test: perf: perf-load-balancing: Add parallel-scaleout scenario
test: perf: perf-load-balancing: Convert to tool_app_template
tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.
It may also surprise consumers of the plan, like we saw in #25912.
So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.
Fixes#26038Closesscylladb/scylladb#26048
`test_long_query_timeout_erm` is slow because it has many parameterized
variants, and it verifies timeout behavior during ERM operations, which
are slow by nature.
This change speeds the test up by roughly 3× (319s -> 114s) by:
- Removing two of the five scenarios that were near duplicates.
- Shortening timeout values to reduce waiting time.
- Parallelizing waiting on server_log with asyncio.TaskGroup().
The two removed scenarios (`("SELECT", True, False)`,
`("SELECT_WHERE", True, False)`) were near duplicates to
`("SELECT_COUNT_WHERE", True, False)` scenario, because all three
scenarios use non-mapreduce query and triggers basically the same
system behavior. It is sufficient to keep only one of them, so the test
verifies three cases:
- One with nodes shutdown
- One with mapreduce query
- One with non-mapreduce query
Fixes: scylladb/scylla#24127Closesscylladb/scylladb#25987
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.
This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.
This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.
This patch also includes a test for this exact scenario, which is enforced by an injection.
Fixes https://github.com/scylladb/scylladb/issues/21492Closesscylladb/scylladb#24396
* github.com:scylladb/scylladb:
mv: handle mismatched base/view replica count caused by RF change
mv: save the nodes used for pairing calculations for later reuse
mv: move the decision about simple rack-aware pairing later
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.
To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.
A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.
Fixesscylladb/scylladb#24806Closesscylladb/scylladb#26057
Simulates reblancing on a single scale-out involving simultaneous
addition of multiple nodes per rack.
Default parameters create a cluster with 2 racks, 70 tables, 256
tablets/table, 10 nodes, 88 shards/node.
Adds 6 nodes in parallel (3 per rack).
Current result on my laptop:
testlog - Rebalance took 21.874 [s] after 82 iteration(s)
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.
Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.
Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to rebalance the cluster (plan making
only, sum of all iterations, no streaming):
Before: 16 min 25 s
After: 0 min 25 s
Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):
Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)
The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.
After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making).
The plan is twice as large because it combines the output of two subsequent (pre-patch)
plan-making calls.
Fixes#26016
All three tests could hit
https://github.com/scylladb/python-driver/issues/295. We use the
standard workaround for this issue: reconnecting the driver after
the rolling restart, and before sending any requests to local tables
(that can fail if the driver closes a connection to the node that
restarted last).
All three tests perform two rolling restarts, but the latter ones
already have the workaround.
Fixes#26005Closesscylladb/scylladb#26056
`line_to_row` is a test function that converts `syslog` audit log to
the format of `table` audit log so tests can use the same checks
for both types of audit. Because `syslog` audit doesn't have `date`
information, the field was filled with the current date. This behavior
broke the tests running at 23:59:59 because `line_to_row` returned
different results on different days.
Fixes: scylladb/scylladb#25509Closesscylladb/scylladb#26101
Issue #26079 noted that multiple Alternator tests are failing when run against DynamoDB. This pull request fixes many of them, in several small patches. In one case we need to avoid a DynamoDB bug that wasn't even the point of the original test (and we create a new test specifically for that DynamoDB bug). Another test exposed a real incompatibility with Alternator (#26103) but didn't need to be exposed in this specific test so again we split the test to one that passes, and another one that xfails on Alternator (not on DynamoDB). A bigger changed had to be done to the tags feature test - since August 2024, the TagResource operation became asynchronous which broke our tests, so we fix this.
Each of these changes are described in more detail in the individual patches.
Refs #26079. It doesn't fix it completely because there are some tests which remain flaky, and some tests which, surprisingly, pass on us-east-1 but fail on eu-north-1. We'll need to address the rest later.
No backports needed, we only run tests against DynamDB from master (when we rarely do...), not on old branches.
Closesscylladb/scylladb#26114
* github.com:scylladb/scylladb:
test/alternator: fix test_list_tables_paginated on DynamoDB
test/alternator: fix tests in test_tag.py on DynamoDB
test/alternator: fix test_health_only_works_for_root_path on DynamoDB
test/alternator: reproducer tests for faux GSI range key problem
test/alternator: fix test "test_17119a" to pass on DynamoDB
test/alternator: fix test to pass on DynamoDB
The boost/test/included/... directory is apparently internal and not
intended for user consumption.
Including it caused a One-Definition-Rule violation, due to
boost/test/impl/unit_test_parameters.ipp containing code like this:
```c++
namespace runtime_config {
// UTF parameters
std::string btrt_auto_start_dbg = "auto_start_dbg";
std::string btrt_break_exec_path = "break_exec_path";
std::string btrt_build_info = "build_info";
std::string btrt_catch_sys_errors = "catch_system_errors";
std::string btrt_color_output = "color_output";
std::string btrt_detect_fp_except = "detect_fp_exceptions";
std::string btrt_detect_mem_leaks = "detect_memory_leaks";
std::string btrt_list_content = "list_content";
```
This is defining variables in a header, and so can (and in fact does)
create duplicate variable definitions, which later cause trouble.
So far, we were protected from this trouble by -fvisibility=hidden, which
caused the duplicate definitions to be in fact not duplicate.
Fix this by correcting the include path away from <boost/test/included/>.
Closesscylladb/scylladb#26161
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.
This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.
This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.
This patch also includes a test for this exact scenario, which is enforced by an injection.
Fixes https://github.com/scylladb/scylladb/issues/21492
Vector search related implementation moved to a new module vector_search.
As the vector search functionality is going to be extended, it is better to keep it in a separate module.
The DNS resolution logic and its background task are moved out of the `vector_store_client` and into a new, dedicated class `vector_search::dns`.
This refactoring is the first step towards supporting DNS hostnames that resolve to multiple IP addresses.
References: VECTOR-187
No backport needed as this is refactoring.
Closesscylladb/scylladb#26052
* github.com:scylladb/scylladb:
vector_store_client_test: Verify DNS is not refreshed when disabled
vector_store_client: Extract DNS logic into a dedicated class
vector_search: Apply clang-format
vector_store_client: Move to vector_search module
Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with
CDC enabled fails with an error if the table already exists. Instead,
the query should succeed and be a no-op.
This regression was introduced by commit fed1048059. Previously, when
executing the query, we would first check if the table exists in
do_prepare_new_column_families_announcement. If it did, we would throw
an already_exists_exception, which was handled correctly; otherwise, we
would continue and create the CDC table in the
before_create_column_families notification.
The order of operations was changed in fed1048059, causing the
regression. Now, we first create the CDC schema and add it to the schema
list for creation, and then check for each of them if they already
exist. The problem is that when we create the CDC schema in
on_pre_create_column_families, it also checks if the CDC table already
exists. If it does, it throws an invalid_request_exception, which is not
caught and handled as expected.
This patch restores the previous order of operations: we first check if
the tables exist, and only then add the CDC schema in pre_create.
Fixes https://github.com/scylladb/scylladb/issues/26142
no backport - recent regression, not released yet
Closesscylladb/scylladb#26155
* github.com:scylladb/scylladb:
test: add test for creating table with CDC enabled if not exists
cdc: fix create table with cdc if not exists
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixes https://github.com/scylladb/scylladb/issues/22989
backport not needed - the issue is probably not common and there's a workaround
Closesscylladb/scylladb#25790
* github.com:scylladb/scylladb:
test: mv: add a test for view build interrupt during registration
view_builder: register view on all shards atomically
Vector search related implementation moved to a new module vector_search.
As the vector search functionality is going to be extended, it is
better to keep it in a separate module.
In 789a4a1ce7, we adjusted the test file
to work with the configuration option `rf_rack_valid_keyspaces`. Part of
the commit was making the two tables used in the test replicate in
separate data centers.
Unfortunately, that destroyed the point of the test because the tables
no longer competed for resources. We fix that by enforcing the same
replication factor for both tables.
We still accept different values of replication factor when provided
manually by the user (by `--rf1` and `--rf2` commandline options). Scylla
won't allow for creating RF-rack-invalid keyspaces, but there's no reason
to take away the flexibility the user of the test already has.
Fixesscylladb/scylladb#26026Closesscylladb/scylladb#26115
In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table.
This PR moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started.
The PR also includes a reproducer test.
Fixesscylladb/scylladb#25912
View building coordinator hasn't been released yet, so no backport is needed.
Closesscylladb/scylladb#26144
* github.com:scylladb/scylladb:
test/test_view_building_coordinator: add reproducer
topology_coordinator: abort view building a bit later in case of tablet migration
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.
This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.
This change must be included before the next release.
Otherwise, users will be affected by a breaking change.
References: VECTOR-187
Closesscylladb/scylladb#26033
Add a new test that reproduces issue #22989. The test starts view
building and interrupts it by restarting the node while some shards
registered their status and some didn't.
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixesscylladb/scylladb#22989
This patch corrects the column name formatting whenever
an "Undefined column name" exception is thrown.
Until now we used the `name()` function which
returns a bytes object. This resulted in a message
with a garbled ascii bytes column name instead of
a proper string. We switch to the `text()` function
that returns a sstring instead, making the message
readable.
Tests are adjusted to confirm this behavior.
Fixes: VECTOR-228
Closesscylladb/scylladb#26120
Adds a test which reproduces the issue described
in scylladb/scylladb#25912.
The test creates a situation where a single tablet is replicated across
multiple DCs / racks, and all those tablet replicas are eligible for
migration. The tablet load balancer is unpaused at that moment which
currently causes it to attempt to generate multiple migrations for
different tablet replicas of the same tablet. Before the fix for #25912,
this used to confuse the view build coordinator which would react to
each migration attempt, pausing view building work for each tablet
replica for which there was an attempt to migrate but only unpausing it
for the tablet replica that was actually migrated. After the fix, the
view build coordinator only reacts to the migration that has "won" so
the test successfully passes.
Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.
Most of the patch is generated with
for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done
and a small manual change in test/perf/perf.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26136
test_streaming_deadlock_removenode starts 10240 inserts at once,
overloading a node. Due to that test fails with timeout.
Limit inserts concurrency.
Fixes: #25945.
Closesscylladb/scylladb#26102
when cluster repair is run for an entire keyspace, nodetool makes a
repair api request for each table.
if the keyspace contains colocated tables, then the api request for the
colocated tables will fail, because currently scylla doesn't allow making
repair requests for specific colocated tables, but only for base tables.
if the request is to repair an entire keyspace then we can ignore this,
because we will make a repair request for all base tables, and this in
turn will repair also all the colocated tables in the keyspace.
however if specific tables are requested and some of them are colocated
then we should propagate the error to let the user know the request is
invalid.
Refs https://github.com/scylladb/scylladb/issues/24816
no backport - no colocated tablets in previous releases
Closesscylladb/scylladb#26051
* github.com:scylladb/scylladb:
nodetool: ignore repair request error of colocated tables
storage_service: improve error message on repair of colocated tables
Previously, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, we use top-level computed columns instead of regular ones.
This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all attributes in a row except for keys in new tables.
Refs https://github.com/scylladb/scylladb/pull/24991
Refs https://github.com/scylladb/scylladb/issues/6930Closesscylladb/scylladb#25796
* github.com:scylladb/scylladb:
alternator: Store LSI keys in :attrs for newly created tables
alternator/test: Add LSI tests based mostly on the existing GSI tests
Alternator, when creating gsi, adds artificially columns, that user
had not ask for. This patch prevents those columns from showing up in
DescribeTable's output.
Fixes#5320Closesscylladb/scylladb#25978
In the current scenario, the shard receiving the remove node REST api request performs condional lock depending on whether raft is enabled or not. Since non-zero shard returns false for `raft_topology_change_enabled()`, the requests routed to non zero shards are prone to this lock which is unnecessary and hampers the ability to perform concurrent operations, which is possible for raft enabled nodes.
This pr modifies the conditional lock logic and orchestrates the remove node execution logic directly to the shard0, hence the `raft_topology_change_enabled()` is now checked on the shard0 and execution is performed accordingly.
Earlier, `storage_service::find_raft_nodes_from_hoeps` code threw error upon observing any non topology member present in ignore_nodes. Since we are performing concurrent remove node operations, the timing can lead to one node being fully removed before the other node remove op begins processing which can lead to runtime error in storage_service::find_raft_nodes_from_hoeps. This error throw was added to prevent users from putting random non existent nodes in ignore_nodes list. Hence made changes in that function to account for already removed nodes and ignore those nodes instead of throwing error.
A test is also added to confirm the new behaviour, where concurrent remove node operations are now being performed seamlessly.
This pr doesn't fix a critical bug. No need to backport it.
Fixes: scylladb/scylladb#24737Closesscylladb/scylladb#25713
* https://github.com/scylladb/scylladb:
raft_topology: Modify the conditional logic in remove node operation to enhance concurrency for raft enabled clusters.
storage_service: remove assumptions and checks for ignore_nodes to be normal.
Our list_tables() utility function, used by the test
test_table.py::test_list_tables_paginated, asserts that empty pages
cannot be returned by ListTables - and in fact neither DynamoDB nor
Alternator returns them. But it turns out this is only true on
DynamoDB's us-east-1 region, and in the eu-north-1 region, ListTables
when using Limit=1 can actually return an empty last page.
So let's just drop that unnecessary assertion as being wrong. In any
case, this assert was in a utility function, not a test, which probably
wasn't a great idea in the first place.
Until August 2024, DynamoDB's "TagResource" operation was synchronous -
when it returned the tags were available for read. This is no longer
true, as the new documentation says and we see in practice with many
test_tag.py failing on DynamoDB. Not only do we can't read the new tags
without waiting, we're not allowed to change other tags or even to
delete the table without waiting.
We don't need to fix Alternator for this new behavior - there is
(surprisingly!) no new API to check if the tag change took affect,
and it's perfectly fine that in Alternator the tags take affect
immediately (when TagResource returns) and not a few seconds later.
But we don't need to fix most test_tag.py tests to work with the
new asynchronous API.
The fix introduces convenience functions tag_resource() and
untag_resource() which performs the TagResource or UntagResource
operation, but also waits until the change took affect by trying
ListTagsOfResources until the desired change took affect. This
will make failed tests wait until the timeout (60 seconds), but
that's fine - we don't expect to have failed test.
After this test all tests in test/altrnator/test_tag.py pass on
DynamoDB (and continue passing on Alternator).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When the monitor is started, the first disk utilization value is
obtained from the actual host filesystem and not from the fake
space source function.
Thus, register a fake space source function before the monitor
is started.
Fixes: https://github.com/scylladb/scylladb/issues/26036
Backport is not required. The test has been added recently.
Closesscylladb/scylladb#26054
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
test_health.py::test_health_only_works_for_root_path checks that while
http://ourserver/ is a valid health-check URL, taking other silly
strings at the end, like http://ourserver/abc - is NOT valid and results
in an error.
It turns out that for one of the silly strings we chose to test,
"/health", DynamoDB started recently NOT to return an error, and
instead return an empty but successful response. In fact, it does this
for every string starting with /health - including "/healthz". Perhaps
they did this for some sort of Kubernetes compatibility, but in any
case this behavior isn't documented and we don't need to emulate it.
For now, let's just remove the string "/health" from our test so the
test doesn't fail on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In issue #5320 we noticed that when we have a GSI with a hash key only
(no range key) but the implementation's MV needs to add a clustering
key for the original base key columns, the output of DescribeTable
wrongly lists that exta "range key" - which isn't a real range key of
the GSI.
It turns out that the fact that the extra attribute is not a real GSI
range key has another implication: It should not be allowed in
KeyConditions or KeyConditionExpression - which should allow only real
key columns of the GSI.
This patch adds two reproducing tests for this issue (issue #2601),
both pass on DynamoDB but xfail on Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: introducing the new components, Partitions.db and Rows.db
This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller.
This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details.
The changes are for the sake of new and unreleased code. No backporting should be done.
Closesscylladb/scylladb#26075
* github.com:scylladb/scylladb:
sstables/mx/reader: remove mx::make_reader_with_index_reader
test/boost/bti_index_test: fix indentation
sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file
sstables/trie: support reader_permit and trace_state properly
sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached
sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header
sstables/trie/bti_index_reader: support BYPASS CACHE
test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate
sstables/trie: change the signature of bti_partition_index_writer::finish
sstables/bti_index: improve signatures of special member functions in index writers
streaming/stream_transfer_task: coroutinize `estimate_partitions()`
types/comparable_bytes: add a missing implementation for date_type_impl
sstables: remove an outdated FIXME
storage_service: delete `get_splits()`
sstables/trie: fix some comment typos in bti_index_reader.cc
sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone
As noticed in issue #26079 the Alternator test test_gsi.py::test_17119a
fails on DynamoDB. The problem was that the test added to KeyConditions
reading from a GSI an unnecessary attribute - one which was accidentally
allowed by Alternator (Refs #26103) but not allowed by DynamoDB.
This is easy to fix - just remove the unnecessary attribute from
KeyConditions, and the test still works properly and passes on both
DynamoDB and Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
when cluster repair is run for an entire keyspace, nodetool makes a
repair api request for each table.
if the keyspace contains colocated tables, then the api request for the
colocated tables will fail, because currently scylla doesn't allow making
repair requests for specific colocated tables, but only for base tables.
if the request is to repair an entire keyspace then we can ignore this,
because we will make a repair request for all base tables, and this in
turn will repair also all the colocated tables in the keyspace.
however if specific tables are requested and some of them are colocated
then we should propagate the error to let the user know the request is
invalid.
Refs scylladb/scylladb#24816
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.
Refs: scylladb/scylladb#24411
Before this change, new connections were handled in a default
scheduling group (`main`), because before the user is authenticated
we do not know which service level should be used. With the new
`sl:driver` service level, creation of new connections can be moved to
`sl:driver`.
We switch the service level as early as possible, in `do_accepts`.
There is a possibility, that `sl:driver` will not exist yet, for
instance, in specific upgrade cases, or if it was removed. Therefore,
we also switch to `sl:driver` after a connection is accepted.
Refs: scylladb/scylladb#24411