Commit Graph

48267 Commits

Author SHA1 Message Date
Michael Litvak
602fa84907 storage service: generate tablet migration updates on base tables
When writing transition updates to a tablet map we must do so on a base
table. A table that is co-located with a base table doesn't have it's
own tablet map in the tablets table, but it only points to the base
table map. By writing to the base table, the tablet migration will be
applied for the entire co-location group.

We add a small helper in storage_service that creates a tablet mutation
builder for the base table, and use it whenever we need to write tablet
mutations.
2025-07-01 13:20:18 +03:00
Michael Litvak
ddf02c9489 tablets: replace all_tables method
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.

Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.

We replace all_tables with new methods that can be used for each of the
cases.
2025-07-01 13:20:18 +03:00
Michael Litvak
255ca569e3 tablets: split when all co-located tablets are ready
For a group of co-located tablets, they must be split together
atomically, so finalize tablet split only when all tablets in the group
are ready.
2025-07-01 13:20:18 +03:00
Michael Litvak
0dcb9f2ed6 tablets: load balancer: sizing plan for table groups
We update the sizing plan to work with table groups instead of single
tables, using the base table as a representative of a table group.

The resize decision is made based on the combined per-table tablet
hints, and considering the size of all tables in the group. We calculate
the average tablet size of all tablets in the group and compare it with
the target tablet size.

The target tablet size is changed to be some function of the group size,
because we may want to have a lower target tablet size when we have
multiple co-located tablets, in order to reduce the migration size.
2025-07-01 13:20:18 +03:00
Michael Litvak
ac5f4da905 tablets: load balancer: handle co-located tablets
Tablets of co-located tables are always co-located and migrated
together, so they are considered as an atomic unit for the tablets load
balancer.

We change the load balancer to work with table groups as migration
candidates instead of single tables, using the base table of a group as
a representative of the group.

For the purpose of load calculations, a group of co-located tablets is
considered like a single tablet, because their combined target tablet
sizes is the same as a single tablet.
2025-07-01 13:20:18 +03:00
Michael Litvak
3db8f6fd37 tablets: allocate co-located tablets
When allocating tablets for a new table, add the option to create a
co-located tablet map with an existing base table.

The co-located tablet map is created with the base_table value set.
2025-07-01 13:20:18 +03:00
Michael Litvak
6bed9d3cfe tablets: handle migration of co-located tablets
When handling tablet transition for a group of co-located tables,
maintain co-location by applying each transition operation (streaming,
cleanup, repair) on all tablets in the group in a synchronized way.

handle_tablet_migration is changed to work on groups of co-located
tablets instead of single tablets. Each transition step is handled by
applying its operation on all the tablets in the group.

The tablet map of co-located tablets is shared, so we need to read and
write only the tablet map of the base table.
2025-07-01 13:20:18 +03:00
Michael Litvak
11f045bb7c storage service: add repair colocated tablets rpc
add a new RPC repair_colocated_tablets which is similar to the RPC
tablet_repair, but instead of repairing a single tablet it takes a set
of co-located tablets, repairs them and returns a shared repair_time
result.

This is useful because the way co-located tablets are represented
doesn't allow to repair tablets independently but only as a group
operation, and the repair_time which is stored in the tablet map is
shared with the entire co-location group.

But when repairing a group of co-located tablets we may require a
different behavior, especially considering that co-located tablets are
derived tablets of a special type. For example, we may want to skip
running repair on CDC tablets when repairing the base table.

The new RPC and the storage service function repair_colocated_tablets
allow the flexibility to implement different strategies when repairing
co-located groups.

Currently the implementation is simply to repair each tablet and return
the minimum repair_time as the shared repair time.
2025-07-01 13:20:18 +03:00
Michael Litvak
c74cbca7cb tablets: save and read tablet metadata of co-located tables
Update the tablet metadata save and read methods to work with tablet
maps of co-located tables.

The new function colocated_tablet_map_to_mutation is used to generate a
mutation of a co-located table to system.tablets. It creates a static
row with the base_table column set with the base table id. The function
save_tablet_metadata is updated to use this function for co-located
tables.

When reading tablet metadata from the table, we handle the new case of
reading a co-located table. We store the co-located tables relationships
in the tablet_metadata_builder's `colocated_tables` map, and process it
in on_end_of_stream. The reason we defer the processing is that we want
to set all normal tablet maps first, to ensure the base tablet map is
found when we process a co-located table.
2025-07-01 10:29:59 +03:00
Michael Litvak
ddfe5dfb6b tablets: represent co-located tables in tablet metadata
Modify tablet_metadata to be able to represent co-located tables.

The new method set_colocated_table adds to tablet_metadata a table which
is co-located with another table. A co-located table shares the tablet
map object with the base table, so we just create a copy of the shared
tablet map pointer and store it as the co-located table's tablet map.

Whenever a tablet map is modified we update the pointer for all the
co-located tables accordingly, so the tablet map remains shared.

We add some data structures to tablet_metadata to be able to work with
co-located table groups efficiently:
* `_table_groups` maps every base table to all tables in its
  co-location group. This is convenient for iterating over all table
  groups, or finding all tables in some group.
* `_base_table` maps a co-located table to its base table.
2025-07-01 10:29:59 +03:00
Michael Litvak
4777444024 tablets: add base_table column to system.tablets
Add a new column base_table to the system.tablets table.

It can be set to point to another table to indicate that the tablets of
this table are co-located with the tablets of the base table.

When it's set, we don't store other tablet information in system.tablets
and in the in-memory tablet map object for this table, and we need to
refer instead to the base table tablet information. The method
get_tablet_map always returns the base tablet map.
2025-07-01 10:29:59 +03:00
Michael Litvak
4e2742a30b docs: update system.tablets schema
The schema of system.tablets in the docs is outdated. replace it with
the current schema.
2025-07-01 10:29:59 +03:00
Anna Stuchlik
b61641cf57 doc: remove support for Ubuntu 20.04
Fixes https://github.com/scylladb/scylladb/issues/24564

Closes scylladb/scylladb#24565
2025-06-30 12:33:29 +02:00
Nadav Har'El
7db5e9a3e9 test/cqlpy: reproducer for decimal parsing with very high exponent
This patch adds tests reproducing issue #24581, where Scylla incorrectly
parsed "decimal"-type literals in CQL with very high exponents, near or
above the 32-bit limit.

For example, 1.1234e-2147483647 was incorrectly read as 1.1234E+2147483649,
while it should be (as we explain in comments in the test) an error.

The tests in this patch failed (in multiple checks) before #24581 was
fixed, and pass after it was fixed.

These tests all pass on Cassandra 3, confirming our understanding on the
limits of "decimal" to be correct. But they fail on Cassandra 4 and 5 due
to a regression https://issues.apache.org/jira/browse/CASSANDRA-20723
in Cassandra, that mistakenly limited "decimal" exponents to just 309.

Refs #24581

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24646
2025-06-30 10:37:13 +03:00
Anna Stuchlik
b7683d0eba doc: remove duplicated content
This commit removes the Non-Reserved CQL Keywords and Reserved CQL Keywords pages-keyword
as that content is already covered on the Appendices page.
Redirections are added to avoid 404s for the removed pages.

In addition, the Appendices page title is extended with "Reserved CQL Keywords and Types"
to help users understand what those appendices are about.

Fixes https://github.com/scylladb/scylladb/issues/24319

Closes scylladb/scylladb#24320
2025-06-30 10:30:13 +03:00
Botond Dénes
ee6d7c6ad9 test/boost/memtable_test: only inject error for test table
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.

Fixes: https://github.com/scylladb/scylladb/issues/24637

Closes scylladb/scylladb#24638
2025-06-30 10:08:49 +03:00
Avi Kivity
07c5edcc30 tools: add patchelf utility
We use patchelf to rewrite the dynamic loader (known as the interpreter)
of the binaries we ship, so we can point to our shipped dynamic loader,
which is compatible with our binaries, rather than rely on the distribution's
dynamic loader, which is likely to be incompatible.

Upstream patchelf losing compatibity [1] with Linux 5.17 and below.
This change was also picked up by Fedora 42, so we cannot update the
toolchain to that distribution until we have an alternative.

Here we add a minimal patchelf alternative. It was mostly written by
Claude. It is minimal in that it only supports --set-interpreter and
--print-interpreter, and works well enough for our needs. We still use
the original patchelf for --remove-rpath; this reduces our maintenance
needs.

[1] 43b75fbc9f
[2] 4b015255d1

Closes scylladb/scylladb#24695
2025-06-30 07:24:05 +03:00
Avi Kivity
e2cda38b0f Merge 'alternator: improve, document and test table/index name lengths' from Nadav Har'El
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations:
 1. A table name must be up to 222 characters.
 2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters.
 3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters.

The first patch documents these existing limitations, improves their testing, and fixes a tiny bug found by one of the tests (where UpdateTable adding a GSI's limit testing is off by one).

The second patch unfortunately shows with a reproducer (issue #24598) this limit of 222 is problematic and we may need to lower it: If a user creates a table of length 222 and then enables Alternator streams, Scylla shuts down on an IO error. This will need to be fixed later, but at least this patch properly documents the existing behavior.

No need to backport this patch - it is a very minor improvement that it is unlikely users care about and there is no potential for harm.

Closes scylladb/scylladb#24597

* github.com:scylladb/scylladb:
  test/alternator: reproducer for streams bug with long table name
  alternator: improve, document and test table/index name lengths
2025-06-29 18:53:48 +03:00
Avi Kivity
b33dd2bd7d Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

Closes scylladb/scylladb#24492

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-06-29 18:18:36 +03:00
Avi Kivity
48d9f3d2e3 Merge 'mutation: check key of inserted rows' from Botond Dénes
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.

Fixes: https://github.com/scylladb/scylladb/issues/24506

Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.

Closes scylladb/scylladb#24497

* github.com:scylladb/scylladb:
  mutation: check key of inserted rows
  compound: optimize is_full() for single-component types
2025-06-29 18:10:17 +03:00
Pavel Emelyanov
ef396ecf7a api: Reserve resulting vector with schema versions
The get_schema_versions handler gets unordered_map from storage service,
then converts it to API returning type, which is a vector. This vector
can be reserved, the final number of elements is known in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24715
2025-06-29 14:37:45 +03:00
Nadav Har'El
50d370f06e test/alternator: reproducer for streams bug with long table name
The two tests in this patch reproduce issue #24598: When enabling
Alternator streams on an Alternator table with a very long name,
such as the maximum allowed name length 222, the result is an
I/O error and a Scylla shutdown.

The two tests are currently marked "skip", otherwise they would
crash the Scylla being tested.

Refs #24598

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-29 11:40:55 +03:00
Nadav Har'El
0ce0b2934f alternator: improve, document and test table/index name lengths
Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255
characters each, Alternator currently has different (and lower)
limitations:
 1. A table name must be up to 222 characters.
 2. For a GSI, the sum of the table's and GSI's name length, plus 1,
    must be up to 222 characters.
 3. For an LSI, the sum of the table's and LSI's name length, plus 2,
    must be up to 222 characters.

These specific limitations were never documented, so in this patch we
add this information to docs/alternator/compatibility.md.

Moreover, these limitations where only partially tested, so in this patch
we add testing for more cases that we forgot to check - such as length
of LSI names (only GSI were checked before this patch), or adding a
GSI to an existing table. It is important to check all these corner
cases because there is a risk that if we attempt to create a table
without checking its length, we can end up with an I/O error that brings
down Scylla.

In one case - UpdateTable adding a GSI to an existing table - the new
test exposed a trivial bug: Because UpdateTable wants to verify the new
GSI doesn't have the same name as an existing LSI, it mistakenly applied
the LSI's length name limit instead of the GSI's name length limit,
which is one byte less than it should be. So this patch fixes this
trivial bug as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-29 11:40:55 +03:00
Emil Maskovsky
c6307aafd5 test.py: handle cancellation gracefully to avoid TypeError
Previously, if test execution was cancelled, `run_all_tests()` could
return `None`. This caused a `TypeError` when the result was
unconditionally unpacked into `total_tests_pytest, failed_pytest_tests`.

This commit updates the code to handle the cancellation appropriately,
preventing the confusing `TypeError` exception and ensuring clean
cancellation behavior.

Closes scylladb/scylladb#24624
2025-06-27 20:14:35 +03:00
Pavel Emelyanov
23d86ede72 Merge 'audit: introduce debug level logs on happy path' from Dario Mirovic
Audit component defines `audit` logger which it uses only for `error` and `info` logs,
regarding `audit` module initialization and errors during audit log writing.
This change introduces `debug` level logs on the happy path of audit log writes.

Fixes: https://github.com/scylladb/scylladb/issues/23773

No backport needed - this is a small quality-of-life improvement.

Closes scylladb/scylladb#24658

* github.com:scylladb/scylladb:
  audit: change audit test logger level to `debug`
  audit: introduce debug level logs on happy path
2025-06-27 20:10:54 +03:00
Anna Stuchlik
2367330513 doc: remove OSS mention from the SI notes
This commit removes a confusing reference to an Open Source version
form the Local Secondary Indexes page.

Fixes https://github.com/scylladb/scylladb/issues/24668

Closes scylladb/scylladb#24673
2025-06-27 20:07:51 +03:00
Anna Stuchlik
7537f5f260 doc: fix the headings in the Admin Guide
This commit fixes incorrect headings in the Admin Guide and the files
that are included in that guide.

The purpose is to properly organize the content and improve the search,
as well as prevent potential build problems caused by a poor heading organization.

Fixes https://github.com/scylladb/scylladb/issues/24441

Closes scylladb/scylladb#24700
2025-06-27 20:07:09 +03:00
Dario Mirovic
ec6249b581 audit: change audit test logger level to debug
Audit module tests should show the `debug` level messages.
This change makes audit_test.py `audit` module log level to `debug`.

Closes scylladb/scylladb#23773
2025-06-27 16:27:33 +02:00
Dario Mirovic
666364f651 audit: introduce debug level logs on happy path
Audit component defines `audit` logger which it uses only for `error` and `info` logs,
regarding `audit` module initialization and errors during audit log writing.
This change introduces `debug` level logs on the happy path of audit log writes.

Ref: scylladb/scylladb#23773
2025-06-27 16:27:27 +02:00
Botond Dénes
495f607e73 test/cluster/test_read_repair: write 100 rows in trace test
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).

Fixes: #24330

Closes scylladb/scylladb#24403
2025-06-27 16:23:08 +03:00
Pavel Emelyanov
4c0154f156 Merge 'test.py: enhance allure reporting' from Andrei Chekun
Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails
Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/).

Backport is not needed, this is only framework enhancements

Closes scylladb/scylladb#24677

* github.com:scylladb/scylladb:
  test.py: Attach node logs in allure report in case of fail
  test.py: Add run id to the boost output file
2025-06-27 16:22:03 +03:00
Botond Dénes
e715a150b9 tools/scylla-nodetool: backup: add --move-files parameter
Allow opting in for backup to move the files instead of copying them.

Fixes: https://github.com/scylladb/scylladb/issues/24372

Closes scylladb/scylladb#24503
2025-06-27 16:21:39 +03:00
Piotr Dulikowski
9d70e7a067 Merge 'docs: document the new recovery procedure' from Patryk Jędrzejczak
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

The new recovery procedure requires the Raft-based topology to be
enabled, so to remove the old procedure from the documentation,
we must assume users have the Raft-based topology enabled.
We can do it in 2025.2 because the upgrade guides to 2025.1 state that
enabling the Raft-based topology is a mandatory step of the upgrade.
Another reminder is the upgrade guides to 2025.2.

Since we rely on the Raft-based topology being enabled, we remove the
obsolete parts of the documentation.

We will make the Raft-based topology mandatory in the code in the
future, hopefully in 2025.3. For this reason, we also don't touch the
dev docs in this PR.

Fixes scylladb/scylladb#24530

Requires backport to 2025.2 because 2025.2 contains the new recovery
procedure.

Closes scylladb/scylladb#24583

* github.com:scylladb/scylladb:
  docs: rely on the Raft-based topology being enabled
  docs: handling-node-failures: document the new recovery procedure
2025-06-26 17:07:37 +02:00
Gleb Natapov
5f953eb092 storage_proxy: retry paxos repair even if repair write succeeded
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.

Fixes #24630

Closes scylladb/scylladb#24651
2025-06-26 17:06:02 +02:00
Andrei Chekun
2c726c5074 test.py: Attach node logs in allure report in case of fail
Currently, allure report have no nodes logs in case of fail, this will
allow to view the logs in one place without going anywhere else.
2025-06-26 15:37:33 +02:00
Piotr Dulikowski
2f7ed8b1d4 Merge 'Fix for cassandra role gets recreated after DROP ROLE' from Marcin Maliszkiewicz
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.

Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.

If server is started without cluster quorum we'll skip creating role or password.

Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2

Closes scylladb/scylladb#24451

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for password reset procedure
  auth: cache roles table scan during startup
  test: auth_cluster: add test for replacing default superuser
  test: pylib: add ability to specify default authenticator during server_start
  test: pylib: allow rolling restart without waiting for cql
  auth: split auth-v2 logic for adding default superuser password
  auth: split auth-v2 logic for adding default superuser role
  auth: ldap: fix waiting for underlying role manager
  auth: wait for default role creation before starting authorizer and authenticator
2025-06-26 14:36:25 +02:00
Lakshmi Narayanan Sreethar
279253ffd0 utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640
2025-06-26 15:29:28 +03:00
Patryk Jędrzejczak
203ea5d8f9 docs: rely on the Raft-based topology being enabled
In 2025.2, we don't force enabling the Raft-based topology in the code,
but we stated in the upgrade guides that it's a mandatory step of the
upgrade to 2025.1. We also remind users to enable the Raft-based
topology in the upgrade guides to 2025.2. Hence, we can rely in the
the documentation on the Raft-based topology being enabled. If it is
still disabled, we can just send the user to the upgrade guides. Hence:
- we remove all documentation related to enabling the Raft-based
  topology, enabling the Raft-based schema (enabled Raft-based topology
  implies enabled Raft-based schema), and the gossip-based topology,
- we can replace the documentation of the old manual recovery procedure
  with the documentation of the new manual recovery procedure (done in
  the previous commit).
2025-06-26 14:17:54 +02:00
Patryk Jędrzejczak
4e256182a0 docs: handling-node-failures: document the new recovery procedure
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

We can get rid of the old procedure from the documentation because
we requested users to enable the Raft-based topology during upgrades to
2025.1 and 2025.2.

We leave the note that enabling the Raft-based topology is required to
use the new recovery procedure just in case, since we didn't force
enabling the Raft-based topology in the code.
2025-06-26 14:17:50 +02:00
Andrei Chekun
156e7d2e7a test.py: Add run id to the boost output file
To avoid overwriting the output tests adding the run id to it.
Previously, when first repeat failed and the second passes, because the
are using the same name for the output, it will be overwritten and
deleted since the second repeat passed
2025-06-26 12:51:15 +02:00
Marcin Maliszkiewicz
5e7ac34822 test: auth_cluster: add test for password reset procedure 2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
0ffddce636 auth: cache roles table scan during startup
It may be particularly beneficial during connection
storms on startup. In such cases, it can happen that
none of the user's read requests succeed, preventing
the cache from being populated. This, in turn, makes
it more difficult for subsequent reads to
succeed, reducing resiliency against such storms.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
67a4bfc152 test: auth_cluster: add test for replacing default superuser
This test demonstrates creating custom superuser guide:
https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
a3bb679f49 test: pylib: add ability to specify default authenticator during server_start
Sometimes we may not want to use default cassandra role for
control connection, especially when we test dropping default role.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
d9ec746c6d test: pylib: allow rolling restart without waiting for cql
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
f85d73d405 auth: split auth-v2 logic for adding default superuser password
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
- old code may be less tested now so it's best to not change it
- new code path avoids quorum selects in a typical flow (passwords set)

There may be a case when user deletes a superuser or password
right before restarting a node, in such case we may ommit
updating a password but:
- this is a trade-off between quorum reads on startup
- it's far more important to not update password when it shouldn't be
- if needed password will be updated on next node restart

If there is no quorum on startup we'll skip creating password
because we can't perform any raft operation.

Additionally this fixes a problem when password is created despite
having non default superuser in auth-v2.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
2e2ba84e94 auth: split auth-v2 logic for adding default superuser role
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
  - old code may be less tested now so it's best to not change it
  - new code path avoids quorum selects in a typical flow (roles set)

This fixes a problem when superuser role is created despite
having non default superuser in auth-v2.

If there is no quorum on startup we'll skip creating role
because we can't perform any raft operation.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
c96c5bfef5 auth: ldap: fix waiting for underlying role manager
ldap_role_manager depends on standard_role_manager,
therefore it needs to wait for superuser initialization.
If this is missing, the password authenticator will start
checking the default password too early and may fail to
create the default password if there is no default
role yet.

Currently password authenticator will create password
together with the role in such case but in following
commits we want to separate those responsibilities correctly.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
68fc4c6d61 auth: wait for default role creation before starting authorizer and authenticator
There is a hidden dependency: the creation of the default superuser role
is split between the password authenticator and the role manager.
To work correctly, they must start in the right order: role manager first,
then password authenticator.
2025-06-26 12:28:08 +02:00
Piotr Dulikowski
62efe6616a Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski
The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time.

The new algorithm can be summarized as follows:
 1. Prepare a tablet_replica -> partition_range mapping where the values     cover the entire space.
 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each
artition range (in such cases, the partition range is assigned to the appropriate tablet_replica).

In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host.

In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch.

In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range.

Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits.

This patch series:
 - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets).
 - Implements tablet-aware dispatching algorithm.
 - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard.
 - Adds test_long_query_timeout_erm to verify the new functionality.

Fixes: scylladb#21831

No backport, as it is rather new feature than a bugfix.

Closes scylladb/scylladb#24383

* github.com:scylladb/scylladb:
  mapreduce: add missing comma and space in mapreduce_request operator<<
  mapreduce: add shard_id_hint to mapreduce request
  test: add test_long_query_timeout_erm
  mapreduce: add tablet-aware dispatching algorithm
  storage_proxy: make storage_proxy::is_alive public
  mapreduce: remove _shared_token_metadata from mapreduce_service
  mapreduce: move dispatching logic to dispatch_to_vnodes
  mapreduce: remove underscores from variable names
  mapreduce: move req_with_modified_pr handling to a new function
  mapreduce: change next_vnode lambda to get_next_partition_range function
2025-06-26 12:25:39 +02:00