Commit Graph

592 Commits

Author SHA1 Message Date
Evgeniy Naydanov
2cb640f95c test.py: manager: add server_get_returncode() method
The method return None if Scylla process is still running or returncode.
If there is no Scylla process launched then raise NoSuchProcess exception.
2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
d874beb17f test.py: manager: change CLI and env options on a node start
Add parameters to server_start() method to provide ability to
change Scylla' CLI and env options on a node start.

Also, add `expected_server_up_state` parameter as we have for
server_add() method.
2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
5d3b54aa9b test.py: REST API: add set_trace_probability() method 2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
a16a4b6171 test.py: REST API: add get_tokens() method
Get a list of the tokens for the specified node.
Optional `endpoint` parameter can be provided.
2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
f6e3fdd778 test.py: rework log_browsing for dtest migration
Rework `ScyllaLogFile.wait_for()` method to make it easier
to add required methods to ScyllaNode class of ccm-like shim.

Also, added `ScyllaLogFile.grep_for_errors()` method and
reworked `ScyllaLogFile.grep()`
2025-05-19 11:50:55 +00:00
Andrei Chekun
c33c0d62e1 test.py: change pattern for cleaning .log files in testlog directory
Currently, test.py will delete recursively all .log files under the
testlog directory instead of cleaning only on testlog directory. With
this change it will not go deeper to delete log files. We still have a
method for cleaning the log files in modes directories.
The downside of this solution, that we will need to explicitly tell all
directories that we want to clean.

Fixes: https://github.com/scylladb/scylladb/issues/24001

Closes scylladb/scylladb#24004
2025-05-13 13:58:36 +03:00
Dawid Mędrek
5d1bb8ebc5 test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
We assign the newly created nodes to multiple racks. If RF <= 3,
we create as many racks as the provided RF. We disallow the case
of  RF > 3 to avoid trying to create an RF-rack-invalid keyspace;
note that no existing test calls `create_table_insert_data_for_repair`
providing a higher RF. The rationale for doing this is we want to ensure
that the tests calling the function can be run with the
`rf_rack_valid_keyspaces` configuration option enabled.
2025-05-10 16:30:37 +02:00
Tomasz Grabiec
fadfbe8459 Merge 'transport: storage_proxy: release ERM when waiting for query timeout' from Andrzej Jackowski
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.

This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).

This change:
- Rewrite cql_server::connection::process_request_one to use
  seastar::futurize_invoke and try_catch<> instead of utils::result_try
- Add new read_failure_exception_with_timeout and throws it in storage_proxy
- Add sleep in CQL server when the new exception is caught
- Catch local exceptions in Mapreduce Service and convert them
   to std::runtime_error.
- Add get_cql_exclusive to manager_client.py
- Add test_long_query_timeout_erm

No backport needed - minor issue fix.

Closes scylladb/scylladb#23156

* github.com:scylladb/scylladb:
  test: add test_long_query_timeout_erm
  test: add get_cql_exclusive to manager_client.py
  mapreduce: catch local read_failure_exception_with_timeout
  transport: storage_proxy: release ERM when waiting for query timeout
  transport: remove redundant references in process_request_one
  transport: fix the indentation in process_request_one
  transport: add futures in CQL server exception handling
2025-05-08 12:45:49 +02:00
Pavel Emelyanov
6389099dfb Merge 'test/cluster/test_read_repair.py: improve trace logging test (again)' from Botond Dénes
The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair     times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.

Fixes: scylladb/scylladb#23968

Needs backport to 2025.1 and 6.2, both have the flaky test

Closes scylladb/scylladb#23989

* github.com:scylladb/scylladb:
  test/cluster/test_read_repair.py: improve trace logging test (again)
  test/cluster: extract execute_with_tracing() into pylib/util.py
2025-05-07 10:32:45 +03:00
Botond Dénes
51025de755 test/cluster: extract execute_with_tracing() into pylib/util.py
To allow reuse in other tests.
2025-05-02 01:53:35 -04:00
Andrei Chekun
22ef09489d test.py: add awareness of extra_scylla_cmdline_options
test_config.yaml can have field extra_scylla_cmdline_options that
previously was not added to the commandline to start Scylla. Now any
extra options will be added to commandline to start tests
2025-04-24 14:05:50 +02:00
Andrei Chekun
2758c4a08e test.py: increase timeout for C++ tests in pytest
Current timeouts it not enough. Tests failed randomly with hitting
timeout. This will allow to test finish normally. As a downside if the
process will hang we will be waiting more. This adjustments will be
changed after we will have metrics how long it takes to test to pass in
each mode.
2025-04-24 14:05:50 +02:00
Andrei Chekun
f5c88e1107 test.py: switch method of finding the root repo directory
Switching to use constant defined in __init__ filet instead of getting
the root directory from pytest's config. This is will allow to have only
one source of truth in defining the  root directory of the project to
avoid cases when root directory defined incorrectly. This change also
simplifies potential changes in future.
2025-04-24 14:05:50 +02:00
Andrei Chekun
06eca04370 test.py: move get_combined_tests to the correct facade
Since get_combined_tests method is used only for boost tests and not all C++ tests, moving it into the correct place
2025-04-24 14:05:49 +02:00
Andrei Chekun
8cc9c0a53a test.py: add common directory for reports
When test.py executing python test it executes it by mode and by file,
so it can say where the report should with mode. With new approach
pytest will execute the tests for all modes inside himself, and we can
only have one report per pytest invocation. That's why we need common
directory for reports and not under the mode directory. It can later be
used for simplification, so any report should be there.
2025-04-24 14:05:49 +02:00
Andrei Chekun
b791af1f16 test.py: add the possibility to provide additional env vars
This will allow inject any environment variable to the test, because
previosly it was taking only the environment variables from the process.
Adding injecting ASAN and UBSAN variablet to the tests
2025-04-24 14:05:49 +02:00
Andrei Chekun
3cb5838619 test.py: move setup cgroups to the generic method
This changes needed for later integration for pytest executing the C++
tests to be able to gather resource metric.
2025-04-24 14:05:49 +02:00
Andrei Chekun
ca615af407 test.py: refactor resource_gather.py
Refactor resource_gather.py to not create the initial cgroup when the process it's already in it. This will allow not going deeper, creating again and again the same cgroup with each test.py execution when the terminal isn't closed.
Add creation of own event loop in case it's not exists. This needed to be able to work with
test.py that creates loop and with pytest that not create loop.
2025-04-24 14:05:49 +02:00
Andrzej Jackowski
1f1e4f09cd test: add get_cql_exclusive to manager_client.py
This commit adds to ManagerClient a get_cql_exclusive function that
allows creating a cql connection with WhiteListRoundRobinPolicy for
a single server. Such connection is useful in tests that kill nodes to
make sure that the live node handles the queries. Before this commit,
some tests used cluster_con from test/cluster/conftest.py, and after
this commit test can start to use a method from MangerClient.

This change:
 - Extend ManagerClient con_gen type to allow LoadBalancingPolicy arg
 - Implement get_cql_exclusive()
2025-04-23 09:29:47 +02:00
Andrei Chekun
57b66e6b2e test.py: move the readme file for LDAP tests to the correct location
README file was created in incorrect location, now it moved to the
directory with source files where it intended to be.
2025-04-22 19:03:28 +02:00
Andrei Chekun
cf4747c151 test.py: eliminate deprecation warning for xml.etree.ElementTree.Element
Testing the truth value of an Element emits DeprecationWarning. This check is done correctly
2025-04-22 19:03:21 +02:00
Andrei Chekun
5c3501e4bf test.py: fix typo in toxiproxy name parameter
Fix typo in toxiproxy name parameter. No any functional changes just
cosmetic fix.
2025-04-22 19:02:12 +02:00
Andrei Chekun
2c37a793d1 test.py: add locking to the sqlite writer for resource gather
SQLite blocking the DB during writes, so it's not possible to make writes from
several thread. To be able to gather metrics in several threads, we need a
locking mechanism for threads during writes. So thread will not try to
write metrics while another thread is performing writes.
2025-04-22 19:01:30 +02:00
Andrei Chekun
800710dc2c test.py: add sqlite datetime adapter for resource gather
Add sqlite datetime adapter for resource gather since default adapters are deprecated from 3.12
2025-04-22 18:59:49 +02:00
Andrei Chekun
bf2a9e267e test.py: change the parameter for get_modes_to_run()
Change the parameter for get_modes_to_run() from session to config to
narrow the scope, and prepare it to later use in method that do not have
access to the session, but have access to the config object
2025-04-22 18:58:33 +02:00
Andrei Chekun
441cee8d9c test.py: fix gathering logs in case of fail
Currently log files have information about run_id twice:
cluster.object_store_test_backup.10.test_abort_restore_with_rpc_error.dev.10_cluster.log
However, sometimes the first run_id can be incorrect:
cluster.object_store_test_backup.1.test_abort_restore_with_rpc_error.dev.10_cluster.log
Removing first run_id in the name to not face this issue and because
it's actually redundant.
Removing creation empty file for scylla manager log, since it redundant
and was done as incorrect assumption on the root cause of the fail.
Add extension to the stacktrace file, so it will be opened in the
browser in Jenkins in the new tab instead of downloading it.

Fixes: https://github.com/scylladb/scylladb/issues/23731

Closes scylladb/scylladb#23797
2025-04-21 13:12:35 +03:00
Nadav Har'El
fbcf77d134 raft: make group0 Raft operation timeout configurable
A recent commit 370707b111 (re)introduced
a timeout for every group0 Raft operation. This timeout was set to 60
seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody".

However, one of the things we do as a group0 operation is schema
changes, and we already noticed a few years ago, see commit
0b2cf21932, that in some extremely
overloaded test machines where tests run hundreds of times (!) slower
than usual, a single big schema operation - such as Alternator's
DeleteTable deleting a table and multiple of its CDC or view tables -
sometimes takes more than 60 seconds. The above fix changed the
client's timeout to wait for 300 seconds instead of 60 seconds,
but now we also need to increase our Raft timeout, or the server can
time out. We've seen this happening recently making some tests flaky
in CI (issue #23543).

So let's make this timeout configurable, as a new configuration option
group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e,
60 seconds), the same as the existing default. The test framework
overrides this default with a a higher 300 second timeout, matching
the client-side timeout.

Before this patch, this timeout was already configurable in a strange
way, using injections. But this was a misstep: We already have more
than a dozen timeouts configurable through the normal configration,
and this one should have been configured in the same way. There is
nothing "holy" about the default of 60 seconds we chose, and who
knows maybe in the future we might need to tweek it in the field,
just like we made the other timeouts tweakable. Injections cannot
be used in release mode, but configuration options can.

Fixes #23543

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23717
2025-04-15 10:57:39 +03:00
Andrei Chekun
8e33d7ab81 test.py: Make the testpy log files in pytest follow the same format
Fix the incorrect log file names between conftest and scylla_manager.
This regression issue, was introduced in #22960.

Currently, scylla manager will output it's logs to the file with the
next pattern:
suite_name.path_to_the_test_file_with_subfolders.run_id.function_name.mode.run_id_cluster.log
On the same time pytest will try to find this log with next name:
suite_name.file_name_without_subfolders_path.py.run_id.function_name.mode.run_id_cluster.log

This inconsistency leads to the situation when the test failed, scylla
manager log file will not be copied to the failed_test directory and
test will have exception on teardown.

Closes scylladb/scylladb#23596
2025-04-14 12:52:48 +03:00
Evgeniy Naydanov
d6b64642c5 test.py: print out path to Scylla log for Python test suites
Test suites with `type: Python` are using single Scylla node
created by test.py, but it's handy to print a path to a log
file in pytest log too to make it easier to find the file
on failures.

Closes scylladb/scylladb#23683
2025-04-14 11:15:37 +03:00
Patryk Jędrzejczak
07a7a75b98 Merge 'raft: implement the limited voters feature' from Emil Maskovsky
Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked).
* When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR.

The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures.

Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested).

Tests added:
* boost/group0_voter_registry_test.cc: run time on CI: ~3.5s
* topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total

Fixes: scylladb/scylladb#18793

No backport: This is a new feature that will not be backported.

Closes scylladb/scylladb#21969

* https://github.com/scylladb/scylladb:
  raft: distribute voters by rack inside DC
  raft/test: fix lint warnings in `test_raft_no_quorum`
  raft/test: add the upgrade test for limited voters feature
  raft topology: handle on_up/on_down to add/remove node from voters
  raft: fix the indentation after the limited voters changes
  raft: implement the limited voters feature
  raft: drop the voter removal from the decommission
  raft/test: disable the `stop_before_becoming_raft_voter` test
  raft/test: stop the server less gracefully in the voters test
2025-04-10 15:29:15 +02:00
Botond Dénes
e5afd9b5fb test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
2025-04-08 00:11:36 -04:00
Emil Maskovsky
1d06ea3a5a raft: implement the limited voters feature
Currently if raft is enabled all nodes are voters in group0. However it
is not necessary to have all nodes to be voters - it only slows down
the raft group operation (since the quorum is large) and makes
deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along
1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only
limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and
rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the
  current distribution of voters - either if there are still some voter
  "slots" available, or if the new node is a better candidate than some
  existing voter (in which case the existing node voter status might be
  revoked).
* When a voter node is removed or stopped (shut down), its voter status
  is revoked and another node might become a voter instead (this can also
  depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of datacenters
  (DCs) or racks, the rebalance action might become wider (as there are
  some special rules applying to 1 vs 2 vs more DCs, also changing the
  number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible),
  meaning that we can tolerate a loss of the DC with the smaller number
  of voters (if both would have the same number of voters we'd lose the
  majority if any of the DCs is lost).
  For example, if we have 2 DCs with 2 nodes each, one of them will only
  have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has
  more racks than the other and the node count allows it, the DC with
  the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every
  DC has strictly less than half of the total voters (so a loss of any
  of the DCs cannot lead to the majority loss). Again, DCs with more
  racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way
as the regular nodes (i.e. the zero-token nodes will not take any
priority in the voter distribution). Technically it doesn't make much
sense to have a zero-token node that is not a voter (when there are
regular nodes in the same DC being voters), but currently the intended
purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs,
creating a third DC with zero-token nodes only), so for that intended
purpose no special handling is needed and will work out of the box.
If a preference of zero token nodes will eventually be needed/requested,
it will be added separately from this PR.

Currently the voter limits will not be configurable (we might introduce
configurable limits later if that would be needed/requested).

The feature is enabled by the `group0_limited_voters` feature flag
to avoid issues with cluster upgrade (the feature will be only enabled
once all nodes in the cluster are upgraded to the version supporting
the feature).

Fixes: scylladb/scylladb#18793
2025-04-07 12:31:18 +02:00
Tomasz Grabiec
fe8187e594 Merge 'repair: release erm in repair_writer_impl::create_writer when possible' from Aleksandra Martyniuk
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.

Fixes: #23453.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#23455

* github.com:scylladb/scylladb:
  \test: add test to check concurrent migration and repair of two different tablets
  repair: release erm in repair_writer_impl::create_writer when possible
2025-04-03 11:15:08 +02:00
Aleksandra Martyniuk
bae6711809 \test: add test to check concurrent migration and repair of two different tablets 2025-04-02 15:30:17 +02:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Avi Kivity
69684e16d8 Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski
This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes:

- We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards.
- We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options).
- We add a cluster feature which guards the creation of dictionary-compressed SSTables.
- We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards.
- We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries.
- We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that).
- We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio.
- We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement.

Known imperfections:
- The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first.

New feature, no backporting involved.

Closes scylladb/scylladb#23025

* github.com:scylladb/scylladb:
  docs: add user-facing documentation for SSTable compression with shared dicts
  docs/dev: add sstable-compression-dicts.md
  test: add test_sstable_compression_dictionaries_autotrain.py
  test: add test_sstable_compression_dictionaries_basic.py
  test/pylib/rest_client: add `keyspace_upgrade_sstables` helper
  main: run a sstable_dict_autotrainer
  api: add the estimate_compression_ratios API call
  dict_autotrainer: introduce sstable_dict_autotrainer
  db/system_keyspace: add query_dict_timestamp
  compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor
  main: clean up sstable compression dicts after table drops
  sstables/compress: discard hidden compression options after the decompressor is created
  compress: change compressor_ptr from shared_ptr to unique_ptr
  api: add the retrain_dict API call
  storage_service: add some dict-related routines
  main: in compression_dict_updated_callback, recognize and use SSTable compression dicts
  storage_service: add do_sample_sstables()
  messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs
  db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict
  raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback
  database: add sample_data_files()
  database: add take_sstable_set_snapshot()
  compress: teach `lz4_processor` about dictionaries
  compress: teach `zstd_processor` about dictionaries
  sstables: delegate compressor creation to the compressor factory
  sstables: plug an `sstable_compressor_factory` into `sstables_manager`
  sstables: introduce sstable_compressor_factory
  utils/hashers: add get_sha256()
  gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature
  compress: add hidden dictionary options
  compress: remove `compression_parameters::get_compressor()`
  sstables/compress: remove get_sstable_compressor()
  sstables/compress: move ownership of `compressor` to `sstable::compression`
  compress: remove compressor::option_names()
  compress: clean up the constructor of zstd_processor
  compress: squash zstd.cc into compress.cc
  sstables/compress: break the dependency of `compression_parameters` on `compressor`
  compress.hh: switch compressor::name() from an instance member to a virtual call
  bytes: adapt fmt_hex to std::span<const std::byte>
2025-04-01 12:47:34 +03:00
Botond Dénes
0fdf2a2090 Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy
So that a multi-dc/multi-rack cluster can be populated
in a single call.

* Enhancement, no backport required

Closes scylladb/scylladb#23341

* github.com:scylladb/scylladb:
  test/pylib: servers_add: add auto_rack_dc parameter
  test/pylib: servers_add: support list of property_files
2025-04-01 09:14:20 +03:00
Michał Chojnowski
7b0eeefd79 test/pylib/rest_client: add keyspace_upgrade_sstables helper 2025-04-01 00:07:30 +02:00
Michał Chojnowski
a19d6d95f7 api: add the estimate_compression_ratios API call
Add an API call which estimates the effectiveness of possible
compression config changes.

This can be used to make an informed decision about whether to
change the compression method, without actually recompressing
any SSTables.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
58ae278d10 api: add the retrain_dict API call
Add an API call which will retrain the SSTable compression dictionary
for a given table.

Currently, it needs all nodes to be alive to succeed. We can relax this later.
2025-04-01 00:07:29 +02:00
Robert Bindar
e3a3508960 Move object_storage.yaml endpoints to scylla.yaml
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 13:39:39 +03:00
Benny Halevy
a4aa4d74c1 test/pylib: servers_add: add auto_rack_dc parameter
To quickly populate nodes in a single dc,
each node in its own rack.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:23:40 +03:00
Benny Halevy
c4dbb11c87 test/pylib: servers_add: support list of property_files
So that a multi-dc/multi-rack cluster can be populated
in a single call.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:12:39 +03:00
Evgeniy Naydanov
1a0c14aa50 test.py: async_cql: remove unused event_loop fixture
Newer version of pytest-asyncio (0.24.0) allows to control the scope
of async loop per fixture.  Don't need this workaround anymore.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
9bba59631f test.py: add xdist worker ID to log filenames
When run tests in parallel we need to ensure that filenames
are unique by adding xdist worker ID to them.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
9cb0ec2b42 test.py: topology: run tests using bare pytest command
Run ScyllaClusterManager using pytest fixture if `--manager-api`
option is not provided.

On this stage we're trying to be as close to test.py as possible.
test.py runs tests file-by-file, so, effectively, scopes `session`,
`package`, and `module` are pretty same.  Also, test.py starts
ScyllaClusterManager for every test module and this is the reason
why fixture `manager_api_sock_path` has scope=`module`.  And, in
result, we need to change scope for fixture `manager_internal` too.
2025-03-30 03:19:29 +00:00
Evgeniy Naydanov
42075170d1 test.py: add fixtures for current test suite and test
Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py`
To build TestSuite object we need to discover a corresponding `suite.xml`
file.  Do this by walking up thru the fs tree starting from the current
test file.
2025-03-30 03:19:29 +00:00
Evgeniy Naydanov
c4ae4e247a test.py: refactor paths constants and options
Add path constants to `test` module and use them in different test suites
instead of own dups of the same code:

 - TOP_SRC_DIR : ScyllaDB's source code root directory
 - TEST_DIR : the directory with test.py tests and libs
 - BUILD_DIR : directory with ScyllaDB's build artefacts

Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path
provided using `--tmpdir` CLI argument.  Don't use `tmpdir` name because it
mixed up with pytest's built-in fixture and `--tmpdir` option itself.

Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog`

Refactor `ResourceGather*` classes to use path from a `test` object instead of
providing it separately.

Move modes constants to `test` module and remove duplications.

Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to
`pylib.suite.base` to avoid circular imports (with little refactoring to
use `pathlib.Path` instead of `str` as paths.)

Also, in some places refactor to use f-strings for formatting.
2025-03-30 03:19:29 +00:00
Avi Kivity
b292b5800b Merge 'test.py: move starting LDAP service to dedicate method' from Andrei Chekun
Move starting LDAP to the method where the rest of the services are started. This will unify the way of starting the 3rd party services.
Fix LDAP tests flakiness due not possible to connect to LDAP server.
Add catching stdout and stderr of toxiproxy-cli in case of errors

Related: https://github.com/scylladb/scylladb/pull/23333

This PR is based on https://github.com/scylladb/scylladb/pull/23221, so #23221 should be merged first.

Closes scylladb/scylladb#23235

* github.com:scylladb/scylladb:
  test.py: Refactor nodetool/conftest
  test.py: Refactor test/pylib/cpp/ldap
  test.py: move starting LDAP service to dedicate method
2025-03-26 15:31:00 +02:00
Artsiom Mishuta
8bb6414037 test.py: reuse clusters in Python suite
PR https://github.com/scylladb/scylladb/pull/22274 was introduced due to
CI instability and want to mark the cluster dirty after each test for topology
But in fact, affects only Python suites that are quite stable, and CI was
Stabilized by PR https://github.com/scylladb/scylladb/pull/22252

This PR get back cluster reusage in Python test suites

Closes scylladb/scylladb#23179
2025-03-23 20:08:36 +02:00