In the patches that follow, we want Alternator to be able to use as a
key for a materialized view (GSI) not a real column from the schema,
but rather an attribute value deserialized from a member of the ":attrs"
map.
For this, we need the ability for materialized view to define a key
column which is computed as function of a real column (":attrs").
We already have an MV feature which we called "computed column"
(column_computation), but it is wholy inadequate for this job:
column_computation can only take a partition key, and produce a value -
while we need it to take a regular column (one member of ":attrs"),
not just the partition key, and return a cell - value or deletion,
timestamp and TTL.
So in this patch we introduce a new type of computed column, which we
called "regular_column_transformation" since it intends to perform some
sort of transformation on a single column (or more accurately, a single
atomic cell). The limitation that this function transforms a single
column only is important - if we had a function of multiple columns,
we wouldn't know which timestamp or ttl it should use for the result
if the two columns had different timestamps or TTLs.
The new class isn't wired to anything yet: The MV code cannot handle
it yet, and the Alternator code will not use it yet.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds the missing IndexStatus and Backfilling fields for the
GSIs listed by a DescribeTable request. These fields allow an application
to check whether a GSI has been fully built (IndexStatus=ACTIVE) or
currently being built (IndexStatus=CREATING, Backfilling=true).
This feature is necessary when a GSI can be added to an existing table
so its backfilling might take time - and the application might want to
wait for it.
One test - test_gsi.py::test_gsi_describe_indexstatus - begins to pass
with this fix, so the xfail tag is removed from it.
Fixes#11471.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to Alternator's api_error type yet another type of
error, api_error::limit_exceeded (error code "LimitExceededException").
DynamoDB returns this error code in certain situations where certain
low limits were exceeded, such as the case we'll need in a following
patch - an UpdateTable that tries to create more than one GSI at once.
The LimitExceededException error type should not be confused with
other similarly-named but different error messages like
ProvisionedThroughputExceededException or RequestLimitExceeded.
In general, we make an attempt to return the same error code that
DynamoDB returns for a given error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Two new features were added to DynamoDB this month - MultiRegionConsistency
and WarmThroughput. Document them as unimplemented - and link to the
relevant issue in our bug tracker - in docs/alternator/compatibility.md.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For example, nodes which are being decommissioned should not be
consider as available capacity for new tables. We don't allocate
tablets on such nodes.
Would result in higher per-shard load then planned.
Closesscylladb/scylladb#22657
in order to reduce the external header dependency, let's switch to
the standardlized std::ranges::min_element().
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22572
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.
Fixes: #22450Closesscylladb/scylladb#22679
One of the design goals of the Alternator test suite (test/alternator)
is that developers should be able to run the tests against some already
running installation by running `cd test/alternator; pytest [--url ...]`.
Some of our presentations and documents recommend running Alternator
via docker as:
docker run --name scylla -d -p 8000:8000 scylladb/scylla:latest
--alternator-port=8000 --alternator-write-isolation=always
This only makes port 8000 available to the host - the CQL port is
blocked. We had a bug in conftest.py's get_valid_alternator_role()
which caused it to fail (and fail every single test) when CQL is
not available. What we really want is that when CQL is not available
and we can't figure out a correct secret key to connect to Alternator,
we just try a connect with a fake key - and hope that the option
alternator-enforce-authorization is turned off. In fact, this is what
the code comments claim was already happening - but we failed to
handle the case that CQL is not available at all.
After this patch, one can run Alternator with the above docker
command, and then run tests against it.
By the way, this provides another way for running any old release of
Scylla and running Alternator tests against it. We already supported
a similar feature via test/alternator/run's "--release" option, but
its implementation doesn't use docker.
Fixes#22591
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22592
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.
Fixes: #22620Closesscylladb/scylladb#22622
The `which` command is typically not installed on cloud OS images
and so requires the user to remember to install it (or to be prompted
by a failure to install it).
Replace it with the built-in `type` that is always there. Wrap it
in a function to make it clear what it does.
Closesscylladb/scylladb#22594
Since mid December, tests started failing with ENOMEM while
submitting I/O requests.
Logs of failed tests show IO uring was used as backend, but we
never deliberately switched to IO uring. Investigation pointed
to it happening accidentaly in commit 1bac6b75dc,
which turned on IO uring for allowing native tool in production,
and picked linux-aio backend explicitly when initializing Scylla.
But it missed that seastar-based tests would pick the default
backend, which is io_uring once enabled.
There's a reason we never made io_uring the default, which is
that it's not stable enough, and turns out we made the right
choice back then and it apparently continue to be unstable
causing flakiness in the tests.
Let's undo that accidental change in tests by explicitly
picking the linux-aio backend for seastar-based tests.
This should hopefully bring back stability.
Refs #21968.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22695
Clean the code validating if a replication strategy can be used.
This PR consists of a bunch of unmerged https://github.com/scylladb/scylladb/pull/20088 commits - the solution to the problem that the linked PR tried to solve has been accomplished in another PR, leaving the refactor commits unmerged. The commits introduced in this PR have already been reviewed in the old PR.
No need to backport, it's just a refactor.
Closesscylladb/scylladb#22516
* github.com:scylladb/scylladb:
cql: restore validating replication strategies options
cql: change validating NetworkTopologyStrategy tags to internal_error
cql: inline abstract_replication_strategy::validate_replication_strategy
cql: clean redundant code validating replication strategy options
Currently, the session ID under which the truncate for tablets request is
running is created during the request creation and queuing. This is a problem
because this could overwrite the session ID of any ongoing operation on
system.topology#session
This change moves the creation of the session ID for truncate from the request
creation to the request handling.
Fixes#22613Closesscylladb/scylladb#22615
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.
Fixes: #22588Closesscylladb/scylladb#22624
As of right now, materialized views (and consequently secondary
indexes), lwt and counters are unsupported or experimental with tablets.
Since by defaults tablets are enabled, training cases using those
features are currently broken.
The right thing to do here is to disable tablets in those cases.
Fixes https://github.com/scylladb/scylladb/issues/22638Closesscylladb/scylladb#22661
`validate_options` needs to be extended with
`topology` parameter, because NetworkTopologyStrategy needs to validate if every
explicitly listed DC is really existing. I did cut corner a bit and
trimmed the message thrown when it's not the case, just to avoid passing
and extra parameter (ks name) to the `validate_options`
function, as I find the longer message to be a bit redundant (the driver will
receive info which KS modification failed).
The tests that have been commented out in the previous commit have been
restored.
The check for `replication_factor` tag in
`network_topology_strategy::validate_options` is redundant for 2 reasons:
- before we reach this part of the code, the `replication_factor` tag
is replaced with specific DC names
- we actually do allow for `replication_factor` tag in
NetworkTopologyStrategy for keyspaces that have tablets disabled.
This code is unreachable, hence changing it to an internal error, which
means this situation should never occur.
The place that unrolls `replication_factor` tag checked for presence of
this tag ignoring the case, which lead to an unexpected behaviour:
- `replication_factor` tag (note the lowercase) was unrolled, as
explained above,
- the same tag but written in any other case resulted in throwing a vague
message: "replication_factor is an option for SimpleStrategy, not
NetworkTopologyStrategy".
So we're changing this validation to accept and unroll only the
lowercase version of this tag. We can't ignore the case here, as this
tag is present inside a json, and json is case-sensitive, even though the
CQL itself is case insensitive.
Added a test that passes for both scylla and cassandra.
Fixes: #15336
task_stats contains short info about a task. To get a list of task_stats
in the module, one needs to request /task_manager/list_module_tasks/{module}.
To make identification and navigation between tasks easier, extend
task_stats to contain shard, start_time, and end_time.
Closesscylladb/scylladb#22351
tablet_repair_task_impl is run as a part of tablet repair. Make it
a child of tablet repair virtual task.
tablet_repair_task_impl started by /storage_service/repair_async API
(vnode repair) does not have a parent, as it is the top-level task
in that case.
No backport needed; new functionality
Closesscylladb/scylladb#22372
* github.com:scylladb/scylladb:
test: add test to check tablet repair child
service: add child for tablet repair virtual task
Currently, when the tablet repair is started, info regarding
the operation is kept in the system.tablets. The new tablet states
are reflected in memory after load_topology_state is called.
Before that, the data in the table and the memory aren't consistent.
To check the supported operations, tablet_virtual_task uses in-memory
tablet_metadata. Hence, it may not see the operation, even though
its info is already kept in system.tablets table.
Run read barrier in tablet_virtual_task::contains to ensure it will
see the latest data. Add a test to check it.
Fixes: #21975.
Closesscylladb/scylladb#21995
This was originally an attempt to reduce the compile time of this
translation unit, but apparently it doesn't work. Still, it has
the effect of converting stack traces that say "set_storage_service"
and refer to some lambda to stack traces that refer to the operation
being performed, so it's a net positive.
To faciliate the change, we introduce new functions rest_bind(),
similar to (and in fact wrapping) std::bind_front(), that capture
references like the lambdas did originally. We can't use
std::bind_front directly since the call to
seastar::httpd::path_description::set() cannot be disambiguated
after the function is obscured by the template returned by
std::bind_front. The new function rest_bind() has constraints
to understand which overload is in use.
Closesscylladb/scylladb#22526
This PR enhances the internode_compression configuration option in two ways:
1. Add validation for option values
Previously, we silently defaulted to 'none' when given invalid values. Now we
explicitly validate against the three supported values (all, dc, none) and
reject invalid inputs. This provides better error messages when users
misconfigure the option.
2. Fix documentation rendering
The help text for this option previously used C++ escape sequences which
rendered incorrectly in Sphinx-generated HTML. We now use bullet points with
'*' prefix to list the available values, matching our documentation style
for other config options. This ensures consistent rendering in both CLI
and HTML outputs.
Note: The current documentation format puts type/default/liveness information
in the same bullet list as option values. This affects other config options
as well and will need to be addressed in a separate change.
---
this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport.
Closesscylladb/scylladb#22548
* github.com:scylladb/scylladb:
config: validate internode_compression option values
config: start available options with '*'
Bug https://bugs.python.org/issue26789 is resolved in python 3.10.
The frozen tool chain uses python 3.12. Since this is a supported and
recommended way for work environment, removing workaround and bumping
requirements for a newer python version.
Closesscylladb/scylladb#22627
Following the work done in ed4bfad5c3, the action is failing with the
following error:
```
Error: Input required and not supplied: token
```
It is due ot missing permissions in the workflow, adding it
Closesscylladb/scylladb#22630
tablet_repair_task_impl is run as a part of tablet repair. Make it
a child of tablet repair virtual task.
tablet_repair_task_impl started by /storage_service/repair_async API
(vnode repair) does not have a parent, as it is the top-level task
in that case.
* seastar 71036ebcc0...5b95d1d798 (3):
> rpc stream: do not abort stream queue if stream connection was closed without error
> resource: fallback to sysconf when failed to detect memory size from hwloc
> Merge 'scheduling_group: improve scheduling group creation exception safety' from Michael Litvak
scylla-gdb.py adjusted for scheduling_group_specific data structure
changes in Seastar. As part of that, a gratuitous dereference of
std::unique_ptr, which fails for std::unique_ptr<void*, ...>, was
removed.
The test expects and asserts that after wait_for_view is completed we
read the view_build_status table and get a row for each node and view.
But this is wrong because wait_for_view may have read the table on one
node, and then we query the table on a different node that didn't insert
all the rows yet, so the assert could fail.
To fix it we change the test to retry and check that eventually all
expected rows are found and then eventually removed on the same host.
Fixesscylladb/scylladb#22547Closesscylladb/scylladb#22585
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.
A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.
The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.
To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.
Fixesscylladb/scylladb#21829Closesscylladb/scylladb#22493
Add two cqlpy tests that reproduce a bug where a secondary index query
returns more rows than the specified limit. This occurs when the indexed
column is a partition key column or the first clustering key column,
the query result spans multiple partitions, and the last partition
causes the limit to be exceeded.
`test/cqlpy/run --release ...` shows that the tests fail for Scylla
versions all the way back to 4.4.0. Older Scylla versions fail with a
syntax error in CQL query which suggests some incompatibility in the
CQL protocol. That said, this bug is not a regression.
The tests pass in Cassandra 5.0.2.
Refs #22158.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#22513
std::any_of was included by C++11, and boost::algorithm::any_of() is
provided by Boost for users stuck in the pre-C++11 era. in our case,
we've moved into C++23, where the ranges variant of this algorithm
is available.
in order to reduce the header dependency, let's switch to
`std::ranges::any_of()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22503
Materialized views with tablets are not stable yet, but we want
them available as an experimental feature, mainly for teseting.
The feature was added in scylladb/scylladb#21833,
but currently it has no effect. All tests have been updated to use the
feature, so we should finally make it work.
This patch prevents users from creating materialized views in keyspaces
using tablets when the VIEWS_WITH_TABLETS feature is not enabled - such
requests will now get rejected.
Fixesscylladb/scylladb#21832Closesscylladb/scylladb#22217
This commit addresses issue #21825, where invalid PERCENTILE values for
the `speculative_retry` setting were not properly handled, causing potential
server crashes. The valid range for PERCENTILE is between 0 and 100, as defined
in the documentation for speculative retry options, where values above 100 or
below 0 are invalid and should be rejected.
The added validation ensures that such invalid values are rejected with a clear
error message, improving system stability and user experience.
Fixes#21825Closesscylladb/scylladb#21879
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed
Closesscylladb/scylladb#22446
This patch adds an Alternator test for the case of UpdateItem attempting
to insert in invalid B (bytes) value into an item. Values of type B
use base64 encoding, and an attempt to insert a value which isn't
valid base64 should be rejected, and this is what this test verifies.
The new tests reproduce issue #17539, which claimed we have a bug in
this area. However, test/alternator/run with the "--release" option
shows that this bug existed in Scylla 5.2, but but fixed long ago, in
5.3 and doesn't exist in master. But we never had a regression test this
issue, so now we do.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22029
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map
We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
is a multiple of the number of racks and the minimum number of nodes
per rack in the dc is greater than or equal to rf / nr_racks.
In this case (that includes the single rack case), all racks would
have the same number of replicas, so we first filter all replicas
by dc and rack, retaining their ordinality in the process, and
finally, we pair between the base replicas and view replicas,
that are in the same rack, using their original order in the
tablet-map replica set.
For example, nr_racks=2, rf=4:
base_replicas = { N00, N01, N10, N11 }
view_replicas = { N11, N12, N01, N02 }
pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
Note that we don't optimize for self-pairing if it breaks pairing ordinality.
- Complex rack-aware pairing: when the replication factor is not
a multiple of nr_racks. In this case, we attempt best-match
pairing in all racks, using the minimum number of base or view replicas
in each rack (given their global ordinality), while pairing all the other
replicas, across racks, sorted by their ordinality.
For example, nr_racks=4, rf=3:
base_replicas = { N00, N10, N20 }
view_replicas = { N11, N21, N31 }
pairing would be: { N00, N31 }\*, { N10, N11 }, { N20, N21 }
\* cross-rack pair
If we'd simply stable-sort both base and view replicas by rack,
we might end up with much worse pairing across racks:
{ N00, N11 }\*, { N10, N21 }\*, { N20, N31 }\*
\* cross-rack pair
Fixesscylladb/scylladb#17147
* This is an improvement so no backport is required
Closesscylladb/scylladb#21453
* github.com:scylladb/scylladb:
network_topology_strategy_test: add tablets rack_aware_view_pairing tests
view: get_view_natural_endpoint: implement rack-aware pairing for tablets
view: get_view_natural_endpoint: handle case when there are too few view replicas
view: get_view_natural_endpoint: track replica locator::nodes
locator: topology: consult local_dc_rack if node not found by host_id
locator: node: add dc and rack getters
feature_service: add tablet_rack_aware_view_pairing feature
view: get_view_natural_endpoint: refactor predicate function
view: get_view_natural_endpoint: clarify documentation
view: mutate_MV: optimize remote_endpoints filtering check
view: mutate_MV: lookup base and view erms synchronously
view: mutate_MV: calculate keyspace-dependent flags once
When a replica get a write request it performs get_schema_for_write,
which waits until the schema is synced. However, database::add_column_family
marks a schema as synced before the table is added. Hence, the write may
see the schema as synced, but hit no_such_column_family as the table
hasn't been added yet.
Mark schema as synced after the table is added to database::_tables_metadata.
Fixes: #22347.
Closesscylladb/scylladb#22348
If start_time/end_time is unspecified for a task, task_manager API
returns epoch. Nodetool prints the value in task status.
Fix nodetool tasks commands to print empty string for start_time/end_time
if it isn't specified.
Modify nodetool tasks status docs to show empty end_time.
Fixes: #22373.
Closesscylladb/scylladb#22370
Fixes#22401
In the fix for scylladb/scylla-enterprise#892, the extraction and check for sstable component encryption mask was copied
to a subroutine for description purposes, but a very important 1 << <value> shift was somehow
left on the floor.
Without this, the check for whether we actually contain a component encrypted can be wholly
broken for some components.
Closesscylladb/scylladb#22398