Make it possible to use logging from within tests in the topology
suites. The tests are executed using `pytest`, which uses a `pytest.ini`
file for logging configuration.
Also cleanup the `pytest.ini` files a bit.
Some tests mark clusters as 'dirty', which makes them non-reusable by
later tests; we don't want to return them to the pool of clusters.
This use-case was covered by the `add_one` function in the `Pool` class.
However, it had the unintended side effect of creating extra clusters
even if there were no more tests that were waiting for new clusters.
Rewrite the implementation of `Pool` so it provides 3 interface
functions:
- `get` borrows an object, building it first if necessary
- `put` returns a borrowed object
- `steal` is called by a borrower to free up space in the pool;
the borrower is then responsible for cleaning up the object.
Both `put` and `steal` wake up any outstanding `get` calls. Objects are
built only in `get`, so no objects are built if none are needed.
Closes#11558
- Raise on response not HTTP 200 for `.get_text()` helper
- Fix API paths
- Close and start a fresh driver when restarting a server and it's the only server in the cluster
- Fix stop/restart response as text instead of inspecting (errors are status 500 and raise exceptions)
Closes#11496
* github.com:scylladb/scylladb:
test.py: handle duplicate result from driver
test.py: log server restarts for topology tests
test.py: log actions for topology tests
Revert "test.py: restart stopped servers before...
test.py: ManagerClient API fix return text
test.py: ManagerClient raise on HTTP != 200
test.py: ManagerClient fix paths to updated resource
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The test is supposed to give a helpful error message when the user forgets to
run --populate before the benchmark. But this must have become broken at some
point, because execute_cql() terminates the program with an unhelpful
("unconfigured table config") message, which doesn't mention --populate.
Fix that by catching the exception and adding the helpful tip.
Closes#11533
Sometimes the driver calls twice the callback on ready done future with
a None result. Log it and avoid setting the local future twice.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted.
The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading.
The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run.
The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold.
What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned.
Closes#11233
* github.com:scylladb/scylladb:
test: Add test for large partition splitting on compaction
compaction: Add support to split large partitions
sstable: Extend sstable_run to allow disjointness on the clustering level
sstables: simplify will_introduce_overlapping()
test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
test: lib: Fix inefficient merging of mutations in make_sstable_containing()
sstables: Keep track of first partition's first pos and last partition's last pos
sstables: Rename min/max position_range to a descriptive name
sstables_manager: Add sstable metadata reader concurrency semaphore
sstables: Add ability to find first or last position in a partition
teardown..."
This reverts commit df1ca57fda.
In order to prevent timeouts on teardown queries, the previous commit
added functionality to restart servers that were down. This issue is
fixed in fc0263fc9b so there's no longer need to restart stopped servers
on test teardown.
For ManagerClient request API, don't return status, raise an exception.
Server side errors are signaled by status 500, not text body.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Halted background fibers render raft server effectively unusable, so
report this explicitly to the clients.
Fix: #11352Closes#11370
* github.com:scylladb/scylladb:
raft server, status metric
raft server, abort group0 server on background errors
raft server, provide a callback to handle background errors
raft server, check aborted state on public server public api's
Pool.get() might have waiting callers, so if an item is not returned
to the pool after use, tell the pool to add a new one and tell the pool
an entry was taken (used for total running entries, i.e. clusters).
Use it when a ScyllaCluster is dirty and not returned.
While there improve logging and docstrings.
Issue reported by @kbr-.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11546
When a server is down, the driver expects multiple schema timeouts
within the same request to handle it properly.
Found by @kbr-
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11544
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
make_sstable_containing() was absurdly slow when merging thousands of
mutations belonging to the same key, as it was unnecessarily copying
the mutation for every merge, producing bad complexity.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.
Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.
Fixes#10637.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new method allows sstable to load the first row of the first
partition and last row of last partition.
That's useful for incremental reading of sstable run which will
be split at clustering boundary.
To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
`_request` performed a GET request and extracted a text body out of the
response.
Split it into `_get`, which only performs the request, and `_get_text`,
which calls `_get` and extracts the body as text.
Also extract a `_resource_uri` function which will be used for other
request types.
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.
The suite will be used to test the Raft upgrade procedure.
The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.
Closes#11487
* github.com:scylladb/scylladb:
test.py: PythonTestSuite: sum default config params with user-provided ones
test: add a topology suite with Raft disabled
test: pylib: use Python dicts to manipulate `ScyllaServer` configuration
test: pylib: store `config_options` in `ScyllaServer`
Due to slow debug machines timing out, bump up all timeouts
significantly.
The cause was ExecutionProfile request_timeout. Also set a high
heartbeat timeout and bump already set timeouts to be safe, too.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11516
This series introduces two configurable options when working with TWCS tables:
- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.
Refs: #6923Fixes: #9029Closes#11445
* github.com:scylladb/scylladb:
tests: cql_query_test: add mixed tests for verifying TWCS guard rails
tests: cql_query_test: add test for TWCS window size
tests: cql_query_test: add test for TWCS tables with no TTL defined
cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
cql: add max window restriction for TimeWindowCompactionStrategy
time_window_compaction_strategy: reject invalid window_sizes
cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
`feature_service` provided two sets of features: `known_feature_set` and
`supported_feature_set`. The purpose of both and the distinction between
them was unclear and undocumented.
The 'supported' features were gossiped by every node. Once a feature is
supported by every node in the cluster, it becomes 'enabled'. This means
that whatever piece of functionality is covered by the feature, it can
by used by the cluster from now on.
The 'known' set was used to perform feature checks on node start; if the
node saw that a feature is enabled in the cluster, but the node does not
'know' the feature, it would refuse to start. However, if the feature
was 'known', but wasn't 'supported', the node would not complain. This
means that we could in theory allow the following scenario:
1. all nodes support feature X.
2. X becomes enabled in the cluster.
3. the user changes the configuration of some node so feature X will
become unsupported but still known.
4. The node restarts without error.
So now we have a feature X which is enabled in the cluster, but not
every node supports it. That does not make sense.
It is not clear whether it was accidental or purposeful that we used the
'known' set instead of the 'supported' set to perform the feature check.
What I think is clear, is that having two sets makes the entire thing
unnecessarily complicated and hard to think about.
Fortunately, at the base to which this patch is applied, the sets are
always the same. So we can easily get rid of one of them.
I decided that the name which should stay is 'supported', I think it's
more specific than 'known' and it matches the name of the corresponding
gossiper application state.
Closes#11512
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.
The suite will be used to test the Raft upgrade procedure.
The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.
Previously we used a formattable string to represent the configuration;
values in the string were substituted by Python's formatting mechanism
and the resulting string was stored to obtain the config file.
This approach had some downsides, e.g. it required boilerplate work to
extend: to add a new config options, you would have to modify this
template string.
Instead we can represent the configuration as a Python dictionary. Dicts
are easy to manipulate, for example you can sum two dicts; if a key
appears in both, the second dict 'wins':
```
{1:1} | {1:2} == {1:2}
```
This makes the configuration easy to extend without having to write
boilerplate: if the user of `ScyllaServer` wants to add or override a
config option, they can simply add it to the `config_options` dict and
that's it - no need to modify any internal template strings in
`ScyllaServer` implementation like before. The `config_options` dict is
simply summed with the 'base' config dict of `ScyllaServer`
(`config_options` is the right summand so anything in there overrides
anything in the base dict).
An example of this extensibility is the `authenticator` and `authorizer`
options which no longer appear in `scylla_cluster.py` module after this
change, they only appear in the suite.yaml file.
Also, use "workdir" option instead of specifying data dir, commitlog
dir etc. separately.
Previously the code extracted `authenticator` and `authorizer` keys from
the config options and stored them.
Store the entire dict instead. The new code is easier to extend if we
want to make more options configurable.
"
The test in question plays with snitches to simulate the topology
over which tokens are spread. This set replaces explicit snitch
usage with temporary topology object.
Some snitch traces are still left, but those are for token_metadata
internal which still call global snitch for DC/RACK.
"
* 'br-tests-use-topology-not-snitch' of https://github.com/xemul/scylla:
network_topology_strategy_test: Use topology instead of snitch
network_topology_strategy_test: Populate explicit topology
dirty_memory_manager tracks lsa regions (memtables) under region_group:s,
in order to be able to pick up the largest memtable as a candidate for
flushing.
Just as region_group:s contain regions, they can also contain other
region_group:s in a nested structure. It also tracks the nested region_group
that contains the largest region in a binomial heap.
This latter facility is no longer used. It saw use when we had the system
dirty_memory_manager nested under the user dirty_memory_manager, but
that proved too complicated so it was undone. We still nest a virtual
region_group under the real region_group, and in fact it is the
virtual region_group that holds the memtables, but it is accessed
directly to find the largest memtable (region_group::get_largest_region)
and so all the mechanism that sorts region_group:s is bypassed.
Start to dismantle this house of cards by removing the subgroup
sorting. Since the hierarchy has exactly one parent and one child,
it's clearly useless. This is seen by the fact that we can just remove
everything related.
We still need the _subgroups member to hold the virtual region_group;
it's replaced by a vector. I verified that the non-intrusive vector
is exception safe since push_back() happens at the very end; in any
case this is early during setup where we aren't under memory pressure.
A few tests that check the removed functionality are deleted.
Closes#11515
Compaction group can be defined as a set of files that can be compacted together. Today, all sstables belonging to a table in a given shard belong to the same group. So we can say there's one group per table per shard. As we want to eventually allow isolation of data that shouldn't be mixed, e.g. data from different vnodes, then we want to have more than one group per table per shard. That's why compaction groups is being introduced here.
Today, all memtables and sstables are stored in a single structure per table. After compaction groups, there will be memtables and sstables for each group in the table.
As we're taking an incremental approach, table still supports a single group. But work was done on preparing table for supporting multiple groups. Completing that work is actually the next step. Also, a procedure for deriving the group from token is introduced, but today it always return the single group owned by the table. Once multiple groups are supported, then that procedure should be implemented to map a token to a group.
No semantics was changed by this series.
Closes#11261
* github.com:scylladb/scylladb:
replica: Move memtables to compaction_group
replica: move compound SSTable set to compaction group
replica: move maintenance SSTable set to compaction_group
replica: move main SSTable set to compaction_group
replica: Introduce compaction_group
replica: convert table::stop() into coroutine
compaction_manager: restore indentation
compaction_manager: Make remove() and stop_ongoing_compactions() noexcept
test: sstable_compaction_test: Don't reference main sstable set directly
test: sstable_utils: Set data size fields for fake SSTable
test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.
At first it will support repair and compaction tasks, and possibly more in the future.
Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.
Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.
Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.
User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish
To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.
Fixes: #9809Closes#11216
* github.com:scylladb/scylladb:
test: task manager api test
task_manager: test api layer implementation
task_manager: add test specific classes
task_manager: test api layer
task_manager: api layer implementation
task_manager: api layer
task_manager: keep task_manager reference in http_context
start sharded task manager
task_manager: create task manager object
This patch adds set of 10 cenarios that have been unveiled during additional testing.
In particular, most of the scenarios cover ALTER TABLE statements, which - if not handled -
may break the guardrails safe-mode. The situations covered are:
- STCS->TWCS with no TTL defined
- STCS->TWCS with small TTL
- STCS->TWCS with large TTL value
- TWCS table with small to large TTL
- No TTL TWCS to large TTL and then small TTL
- twcs_max_window_count LiveUpdate - Decrease TTL
- twcs_max_window_count LiveUpdate - Switch CompactionStrategy
- No TTL TWCS table to STCS
- Large TTL TWCS table, modify attribute other than compaction and default_time_to_live
- Large TTL STCS table, fail to switch to TWCS with no TTL explicitly defined
This patch adds a test for checking the validity of tables using TimeWindowCompactionStrategy
with an incorrect number of compaction windows.
The twcs_max_window_count LiveUpdate-able parameter is also disabled during the execution of the
test in order to ensure that users can effectively disable the enforcement, should they want.
This patch adds a testcase for TimeWindowCompactionStrategy tables created with no
default_time_to_live defined. It makes use of the LiveUpdate-able restrict_twcs_default_ttl
parameter in order to determine whether TWCS tables without TTL should be forbidden or not.
The test replays all 3 possible variations of the tri_mode_restriction and verifies tables
are correctly created/altered according to the current setting on the replica which receives
the request.
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Preparatory change for main sstable set to be moved into compaction
group. After that, tests can no longer direct access the main
set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
So methods that look at data size and require it to be higher than 0
will work on fake SSTables created using set_values().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>