Fixes: #25045
added the ability to supply the list of files to
restore from the a given file.
mainly required for local testing.
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
Closesscylladb/scylladb#25077
Currently when refreshing submodule, the script puts a plain list of
non-merge commits into commit message. The resulting summary contains
everything, but is hard to understand. E.g. if updating seastar today
the summary would start with
* seastar 26badcb1...86c4893b (55):
> util: make SEASTAR_ASSERT() failure generate SIGABRT
> core: fix high CPU use at idle on high core count machines
> http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values
> reactor: Relax friendship with file_data_source_impl
> fstream: Use direct io_stats reference
> thread_pool: Relax coupling with reactor
> reactor: Mark some IO classes management methods private
> http: Deprecate json_exception
> fair_queue: Move io_throttler to io_queue.hh
> fair_queue: Move metrics from to io_queue::stream
> fair_queue: Remove io_throttler from tests
> fair_queue_test: Remove io-throttler from fair-queue
> fair_queue: Remove capacity getters
> fair_queue: Move grab_result into io_queue::stream too
> fair_queue: Move throtting code to io_queue.cc
> fair_queue: Move throttling code to io_queue::stream class
> fair_queue: Open-code dispatch_requests() into users
> fair_queue: Split dispatch_requests() into top() and pop_front()
> fair_queue: Swap class push back and dispatch
> fair_queue: Configure forgiving factor externally
...
That's not very informative, because the update includes several large
"merges" that have their summary which is missing here. This update
changes the way summary is generated to include merges and their
summaries and all merged commits are listed as sub-lines, like this
* seastar 26badcb1...86c4893b (26):
> util: make SEASTAR_ASSERT() failure generate SIGABRT
> core: fix high CPU use at idle on high core count machines
> Merge 'Move output IO throttler to IO queue level' from Pavel Emelyanov
fair_queue: Move io_throttler to io_queue.hh
fair_queue: Move metrics from to io_queue::stream
fair_queue: Remove io_throttler from tests
fair_queue_test: Remove io-throttler from fair-queue
fair_queue: Remove capacity getters
fair_queue: Move grab_result into io_queue::stream too
fair_queue: Move throtting code to io_queue.cc
fair_queue: Move throttling code to io_queue::stream class
fair_queue: Open-code dispatch_requests() into users
fair_queue: Split dispatch_requests() into top() and pop_front()
fair_queue: Swap class push back and dispatch
fair_queue: Configure forgiving factor externally
fair_queue: Move replenisher kick to dispatch caller
io_queue: Introduce io_queue::stream
fair_queue: Merge two grab_capacity overloads
fair_queue: Detatch outcoming capacity grabbing from main dispatch loop
fair_queue: Move available tokens update into if branch
io_queue: Rename make_fair_group_config into configure_throttler
io_queue: Rename get_fair_group into get_throttler
fair_queue: Rename fair_group -> io_throttler
> http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values
> Merge 'Relax reactor coupling with file_data_source_impl' from Pavel Emelyanov
reactor: Relax friendship with file_data_source_impl
fstream: Use direct io_stats reference
> thread_pool: Relax coupling with reactor
> reactor: Mark some IO classes management methods private
...
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24834
Do not use default, instead list all fall-through components
explicitly, so if we add a new one, the developer doing that
will be forced to consider what to do here.
Eliminate the `default` case from the switch in
`encryption_file_io_extension::wrap_sink`, and explicitly
handle all `component_type` values within the switch statement.
fixes: https://github.com/scylladb/scylladb/issues/23724Closesscylladb/scylladb#24987
As requested in #22114, moved the files and fixed other includes and build system.
Moved files:
- interval.hh
- Map_difference.hh
Fixes: #22114
This is a cleanup, no need to backport
Closesscylladb/scylladb#25095
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.
In this commit, we're restricting those operations. We also provide two
validation tests.
One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.
Fixesscylladb/scylladb#24643
Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.
Closesscylladb/scylladb#25008
* github.com:scylladb/scylladb:
cdc: Forbid altering columns of inactive CDC log table
cdc: Forbid altering columns of CDC log tables directly
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.
This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.
There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.
Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.
In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.
Testing
Build: `release`
Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)
Test arguments:
```
max_attempts=100'000
num_parallel=10
```
Throwing `protocol_exception` results:
```
real=1:26.97 user=10:00.27 sys=2:34.55 cpu=867%
real=1:26.95 user=9:57.10 sys=2:32.50 cpu=862%
real=1:26.93 user=9:56.54 sys=2:35.59 cpu=865%
real=1:26.96 user=9:54.95 sys=2:32.33 cpu=859%
real=1:26.96 user=9:53.39 sys=2:33.58 cpu=859%
real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862% # average
```
Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46 user=9:12.21 sys=2:19.08 cpu=881%
real=1:18.44 user=9:04.03 sys=2:17.91 cpu=869%
real=1:18.47 user=9:12.94 sys=2:19.68 cpu=882%
real=1:18.49 user=9:13.60 sys=2:19.88 cpu=883%
real=1:18.48 user=9:11.76 sys=2:17.32 cpu=878%
real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879% # average
```
This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.
Refs: #24567
This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.
Closesscylladb/scylladb#24738
* github.com:scylladb/scylladb:
test/cqlpy: add cpp exception metric test conditions
transport/server: replace protocol_exception throws with returns
utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
transport/server: avoid exception-throw overhead in handle_error
test/cqlpy: add protocol_exception tests
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.
Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.
Validation tests are provided.
Quit from the repeats if the test is under the pytest runner directory and has some typos or is absent. This allows not going several times through the discovery and stopping execution.
Print a warning at the end of the run when no tests were selected by provided name.
Fixes: scylladb/scylladb#24892Closesscylladb/scylladb#24918
* github.com:scylladb/scylladb:
test.py: print warning in case no tests were found
test.py: break the loop when there is no tests for pytest
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.
When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.
However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.
If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.
Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
while it's being dropped.
Both cases might as well fail with an error because the column is not
found in the base table.
Fixesscylladb/scylladb#24952
backport needed - simple fix for a node crash
Closesscylladb/scylladb#24986
* github.com:scylladb/scylladb:
test: cdc: add test_cdc_with_alter
cdc: throw error if column doesn't exist
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.
When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.
However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.
If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.
Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
while it's being dropped.
Both cases might as well fail with an error because the column is not
found in the base table.
Fixesscylladb/scylladb#24952
Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions`
metric is used. This is a global metric. To address potential test flakiness,
each test runs multiple times:
- `run_count = 100`
- `cpp_exception_threshold = 10`
If a change in the code introduced an exception, expectation is that the number
of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs.
In which case the test fails.
Replace throwing protocol_exception with returning it as a result
or an exceptional future in the transport server module. This
improves performance, for example during connection storms and
server restarts, where protocol exceptions are more frequent.
In functions already returning a future, protocol exceptions are
propagated using an exceptional future. In functions not already
returning a future, result_with_exception is used.
Notable change is checking v.failed() before calling v.get() in
process_request function, to avoid throwing in case of an
exceptional future.
Refs: #24567
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.
Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.
This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.
Refs: #24567
Previously, connection::handle_error always called f.get() inside a try/catch,
forcing every failed future to throw and immediately catch an exception just to
classify it. This change eliminates that extra throw/catch cycle by first checking
f.failed(), getting the stored std::exception_ptr via f.get_exception(), and
then dispatching on its type via utils::try_catch<T>(eptr).
The error-response logic is not changed - cassandra_exception, std::exception,
and unknown exceptions are caught and processed, and any exceptions thrown by
write_response while handling those exceptions continues to escape handle_error.
Refs: #24567
Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter
from Scylla's metrics endpoint. These metrics are used to track protocol error
count before and after each test.
Add cql_with_protocol context manager utility for session creation with parameterized
protocol_version value. This is used for testing connection establishment with
different protocol versions, and proper disposal of successfully established sessions.
The tests cover two failure scenarios:
- Protocol version mismatch in test_protocol_version_mismatch which tests both supported
and unsupported protocol version
- Malformed frames via raw socket in _protocol_error_impl, used by several test functions,
and also test_no_protocol_exceptions test to assert that the error counters never decrease
during test execution, catching unintended metric resets
Refs: #24567
This reverts commit 45f5efb9ba.
The load_and_repair_paxos_state function was introduced in
scylladb/scylladb#24478, but it has never been tested or proven useful.
One set of problems stems from its use of local data structures
from a remote shard. In particular, system_keyspace and schema_ptr
cannot be directly accessed from another shard — doing so is a bug.
More importantly, load_paxos_state on different shards can't ever
return different values. The actual shard from which data is read is
determined by sharder.shard_for_reads, and storage_proxy will jump
back to the appropriate shard if the current one doesn't match. This
means load_and_repair_paxos_state can't observe paxos state from
write-but-not-read shard, and therefore will never be able to
repair anything.
We believe this explicit Paxos state read-repair is not needed at all.
Any paxos state read which drives some paxos round forward is already
accompanied by a paxos state write. Suppose we wrote the state to the
old shard but not to the new shard (because of some error) while
streaming is already finished. The RPC call (prepare or accept) will
return error to the coordinator, such replica response won't affect
the current round. This write won't affect any subsequent paxos rounds
either, unless in those rounds the write actually succeeds on both
shards, effectively 'auto-repairing' paxos state.
Same if we managed to write to the new shard but not to the old shard.
Any subsequent reads will observe either the old state or the new
state (if the tablet already switched reads to the new shard). In any
case, we'll have to write the state to all relevant shards
from sharder.shard_for_writes (one or two) before sending rpc
response, making this state visible for all subsequent reads.
Thus, the monotonicity property ("once observed, the state must always
be observed") appears to hold without requiring explicit read-repair
and load_and_repair_paxos_state is not needed.
Closesscylladb/scylladb#24926
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).
In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.
This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.
Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.
No backports needed, this is a preparation for new functionality.
Closesscylladb/scylladb#25000
* github.com:scylladb/scylladb:
sstables: add sstable::make_index_reader() and use where appropriate
sstables/mx: in readers, use abstract_index_reader instead of index_reader
sstables: in validate(), use abstract_index_reader instead of index_reader where possible
test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
sstables/index_reader: introduce abstract_index_reader
sstables/index_reader: extract a prefetch_lower_bound() method
This PR adds a way for custom indexes to decide whether a view should be created for them, as for the vector_index the view is not needed, because we store it in the external service. To allow this, custom logic for describing indexes using custom classes was added (as it used to depend on the view corresponding to an index).
Fixes: VECTOR-10
Closesscylladb/scylladb#24438
* github.com:scylladb/scylladb:
custom_index: do not create view when creating a custom index
custom_index: refactor describe for custom indexes
custom_index: remove unneeded duplicate of a static string
If we add multiple index implementations, users of index readers won't
easily know which concrete index reader type is the right one to construct.
We also don't want pieces of code to depend on functionality specific to
certain concrete types, if that's not necessary.
So instead of constructing the readers by themselves, they can use a helper
function, which will return an abstract (virtual) index reader.
This patch adds such a function, as a method of `sstable`.
After we add a second index implementation, we will probably want to
adjust validate() to work with either implementation.
Some validations will be format-specific, but some will be common.
For now, let's use abstract_index_reader for the validations which
can be done through that interface, and let's have downcast-specific
codepaths for the others.
Note: we change a `get_data_file_position()` call to `data_file_positions().start`.
The call happens at the beginning of a partition, and at this points
these two expressions are supposed to be equivalent.
We don't want tests to create the concrete `index_reader` directly. We
would like them to be able to test both sstables which use
`index_reader`, and those which will use the planned new index implementation.
So we will let the tests construct an abstract_index_reader and pass it
to the index_reader_assertions, which will be able to assert the requested
properties on various implementations as it wants.
We want to implement BTI indexes in Scylla.
After we do that, some sstables will use a BTI index reader,
while others will use the old BIG index reader.
To handle that, we can expose a common virtual "index reader"
interface to sstable readers. This is what this patch does.
This interface can't be quite fully implemented by a BTI index,
because some methods returns keys which a BIG index stores,
but a BTI index doesn't. So it will be further restricted in future
patches. But for now, we only extract *all* methods currently
used by the readers to a virtual interface.
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.
The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.
Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.
In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running
test/alternator/run --runveryslow \
test_query.py::test_query_large_page_small_rows
reports in the log:
oversized allocation: 573440 bytes.
After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.
The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.
Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.
Fixes#23535
The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.
Closesscylladb/scylladb#24480
* github.com:scylladb/scylladb:
alternator: clean up by co-routinizing
alternator: avoid spamming the log when failing to write response
alternator: clean up and simplify request_return_type
alternator: avoid oversized allocation in Query/Scan
Fixes#24998
Helper routine translating input_stream buffers to single lines
did not loop over current buffer state, leading to only the first
line being sent to end listener.
Rewrote to use range iteration instead. Nicer.
Closesscylladb/scylladb#24999
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.
The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B
Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.
Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.
To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.
Fixes#24807Fixes#24925Closesscylladb/scylladb#24929
* github.com:scylladb/scylladb:
streaming: Avoid deadlock by running view checks in a separate scheduling group
service: migration_manager: Run group0 barrier in gossip scheduling group
repair: Speed up ranges calculation when small table optimization is on
Normally, during bootstrap, in repair_service::bootstrap_with_repair, we
need to calculate which range to sync data from carefully for the new
node. With small table optimization on, we pass a single full range and
all peer nodes to row level repair to sync data with. Now that we only
need to pass a single range and full peers, there is no need to calculate
the ranges and peers in repair_service::bootstrap_with_repair and drop
it later. The calculation takes time which slows down bootstrap, e.g.,
```
Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - bootstrap_with_repair: started with
keyspace=system_distributed_everywhere, nr_ranges=23809
Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]:
sync data for keyspace=system_distributed_everywhere, status=started,
reason=bootstrap, small_table_optimization=true
```
The range calculation took 15 seconds for system_distributed_everywhere
table.
To fix, the ranges calculation is skipped if small table optimization is
on for the keyspace.
Before:
cluster dev [ PASS ] cluster.test_boot_nodes.1 104.59s
After:
cluster dev [ PASS ] cluster.test_boot_nodes.1 89.23s
A 15% improvement to bootstrap 30 node cluster was observed.
Fixes#24817Closesscylladb/scylladb#24901
* github.com:scylladb/scylladb:
repair: Speed up ranges calculation when small table optimization is on
test: Add test_boot_nodes.py
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.
In this commit, we're restricting those operations. We also provide two
validation tests.
One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.
Fixesscylladb/scylladb#24643
Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups`
Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed.
Fixes: https://github.com/scylladb/scylladb/issues/24927Closesscylladb/scylladb#24928
* https://github.com/scylladb/scylladb:
test.py: add bypassing x_log2_compaction_groups to boost tests
test.py: add bypassing random seed to boost tests
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.
This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.
Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.
Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.
Fixes https://github.com/scylladb/scylladb/issues/24524
Backport not needed, as it is a new feature.
Closesscylladb/scylladb#24924
* github.com:scylladb/scylladb:
main: utils: add thread names to alien workers
auth: move passwords::check call to alien thread
test: wait for 3 clients with given username in test_service_level_api
auth: refactor password checking in password_authenticator
auth: make SHA-512 the only password hashing scheme for new passwords
auth: whitespace change in identify_best_supported_scheme()
auth: require scheme as parameter for `generate_salt`
auth: check password hashing scheme support on authenticator start
Normally, during bootstrap, in repair_service::bootstrap_with_repair, we
need to calculate which range to sync data from carefully for the new
node. With small table optimization on, we pass a single full range and
all peer nodes to row level repair to sync data with. Now that we only
need to pass a single range and full peers, there is no need to calculate
the ranges and peers in repair_service::bootstrap_with_repair and drop
it later. The calculation takes time which slows down bootstrap, e.g.,
```
Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - bootstrap_with_repair: started with
keyspace=system_distributed_everywhere, nr_ranges=23809
Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]:
[shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]:
sync data for keyspace=system_distributed_everywhere, status=started,
reason=bootstrap, small_table_optimization=true
```
The range calculation took 15 seconds for system_distributed_everywhere
table.
To fix, the ranges calculation is skipped if small table optimization is
on for the keyspace.
Before:
cluster dev [ PASS ] cluster.test_boot_nodes.1 104.59s
After:
cluster dev [ PASS ] cluster.test_boot_nodes.1 89.23s
A 15% improvement to bootstrap 30 node cluster was observed.
Fixes#24817
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).
The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.
While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.
Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.
Fixes: scylladb/scylladb#24975Closesscylladb/scylladb#24976
The sstable reader reaches directly for a `clustered_index_cursor`.
But a BTI index reader won't be able to implement
`clustered_index_cursor`, because a BTI index doesn't store
full clustering keys, only some trie-encoded prefixes.
So we want to weaken the dependency. Instead of reaching
for `clustered_index_cursor`, we add a method which expresses
our intent, and we let `index_reader` touch the cursor internally.
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
Analysis of customer stalls showed that the `detail::hash_with_salt`
function, called from `passwords::check`, often blocks the reactor.
This function internally uses the `crypt_r` function from an external
library to compute password hashes, which is a CPU-intensive operation.
To prevent such reactor stalls, this commit moves the
`passwords::check` call to a dedicated alien thread. This thread is
created at system startup and is shared by all shards.
Within the alien thread, an `std::mutex` synchronizes access between
the thread and the shards. While this could theoretically cause
frequent lock contentions, in practice, even during connection storms,
the number of new connections per second per shard is limited
(typically hundreds per second). Additionally, the
`_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not
too many connections are processed at once.
Fixesscylladb/scylladb#24524
test_service_level_api tests create a new session and wait for all
clients to authenticate. However, the check that all connections are
authenticated is done by verifying that there are no connections
with the username 'anonymous', which is insufficient if new connections
have not yet been listed.
To avoid test failures, this commit introduces an additional check that
verifies all expected clients are present in the system.clients table
before proceeding with the test.
This commit splits an if statement to two ifs, to make it possible
to call `password::check` function from another (alien) thread in
the next commit of this patch series.
Ref. scylladb/scylladb#24524
Before this change, the following hashing schemes were supported by
`identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512,
SHA-256, and MD5. The reason for this was that the `crypt_r` function
used for password hashing comes from an external library (currently
`libxcrypt`), and the supported hashing algorithms vary depending
on the library in use.
However:
- The bcrypt algorithms do not work because their scheme
prefix lacks the required round count (e.g., it is `$2y$` instead of
`$2y$05$`). We suspect this never worked as intended. Moreover,
bcrypt tends to be slower than SHA-512, so we do not want to fix the
prefix and start using it.
- SHA-256 and SHA-512 are both part of the SHA-2 family, and libraries
that support one almost always support the other. It is not expected
to find a library that supports only SHA-256 but not SHA-512.
- MD5 is not considered secure for password hashing.
Therefore, this commit removes support for bcrypt_y, bcrypt_a, SHA-256,
and MD5 for hashing new passwords to ensure that the correct hashing
function (SHA-512) is used everywhere.
This commit does not change the behavior of `passwords::check`, so
it is still possible to use passwords hashed with the removed
algorithms.
Ref. scylladb/scylladb#24524
Remove tabs in `identify_best_supported_scheme()` to facilitate
reuse of those lines after the for loop is removed. This change is
motivated by the upcoming removal of support for obsolete password
hashing schemes and removal of `identify_best_supported_scheme()`
function.
Ref. scylladb/scylladb#24524
This is a refactoring commit that changes the `generate_salt` function
to require a password hashing scheme as a parameter. This change is
motivated by the upcoming removal of support for obsolete password
hashing schemes and removal of `identify_best_supported_scheme()`
function.
Ref. scylladb/scylladb#24524
This commit adds a check to the `password_authenticator` to ensure
that at least one of the available password hashing schemes is
supported by the current environment. It is better to fail at system
startup rather than on the first attempt to use the password
authenticator. This change is motivated by the upcoming removal
of support for obsolete password hashing schemes and removal of
`identify_best_supported_scheme()` function.
Ref. scylladb/scylladb#24524
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance
* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`
No backport needed since it enhances functionality which has not been released yet
fixes: https://github.com/scylladb/scylladb/issues/22458Closesscylladb/scylladb#23695
* github.com:scylladb/scylladb:
sstables: Start using `make_data_or_index_source` in `sstable`
sstables: refactor readers and sources to use coroutines
sstables: coroutinize futurized readers
sstables: add `make_data_or_index_source` to the `storage`
encryption: refactor key retrieval
encryption: add `encrypted_data_source` class
To discover what tests are included into combined_tests, pytest check this at
the very beginning. In the case if combined_tests binary is missing, it will
fail discovery and will not run test, even when it was not included into
combined_tests. This PR changes behavior, so it will not fail when
combined_tests is missing and only fail in case someone tries to run test from
it.
Closesscylladb/scylladb#24761
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.
For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.
These counters are no longer accounted by io-queue code and are always
zero. Even more -- accounting removal happened years ago and we don't
have Scylla versions built with seastar older than that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24835