Rather than maximum per-node shard overcommit. Global shard overcommit
is a better metric since we want to equalize global load not just
per-node load.
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.
We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".
This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.
Fixes#19911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19914
In commit bac7c33313 we introduced a new
test for the Alternator "/localnodes" request, checking that a node
that is still joining does not get returned. The tests used what I
thought were "very high" timeouts - we had a timeout of 10 seconds
for starting a single node, and injected a 20 second sleep to leave
us 10 seconds after the first sleep.
But the test failed in one extremely slow run (a debug build on
aarch64), where starting just a single node took more than 15 seconds!
So in this patch I increase the timeouts significantly: We increase
the wait for the node to 60 seconds, and the sleeping injection to
120 seconds. These should definitely be enough for anyone (famous
last words...).
The test doesn't actually wait for these timeouts, so the ridiculously
high timeouts shouldn't affect the normal runtime of this test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19916
There's a test in object_store suite that verifies the contents of a bucket. It does with the plain http request, but unfortunately this doesn't work -- even local minio uses restricted bucket and using plain http request results in 403(Forbidden) error code. Test doesn't check it and continues working with empty list of objects which, in turn, is what it expects to see.
The fix is in using boto3. With it, the acc/secret pair is picked up and listing the bucket finally works.
Closesscylladb/scylladb#19889
* github.com:scylladb/scylladb:
test/object_store: Use boto3.resource to list bucket
test/object_store: Add get_s3_resource() helper
Instead of plain http request, use the power of boto3 package. The
recently added get_s3_resource() facilitates creating one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It creates boto3.resource object that points to endpoint maintained
by s3_server argument (that tests obtain via fixture). This allows
using boto3 to access S3 bucket from local minio server.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
single tablet replica
In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.
Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes. This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.
Refs scylladb/scylladb#18862
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19816
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.
Fixes#19879
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#19883
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.
As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.
In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.
Fixes: scylladb/scylladb#19439
This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.
Closesscylladb/scylladb#19765
* github.com:scylladb/scylladb:
test: regression test for MV crash with tablets during decommission
db/view: drop view updates to replaced node marked as left
The admission test has a section which tests admission when the
semaphore has inactive reads. This section (and therefore the enire
test) became flaky lately, after a seemingly unrelated seastar upgrade,
which improved timers.
The cause of the flakyness is the permit which is made inactive later:
this permit is created with 0 timeout (times out immediately). For some
time now, when the timeout timer of a permit fires, if the permit is
inactive, it is evicted. This is what makes the test fail: the inactive
read times out and ends up evicting this permit, which is not expected
for the test. The reason this was not a problem before, is that the test
finishes very quickly, usually, before the timer could even be polled by
the reactor. The recent seastar changes changed this and now the timer
sometimes get polled and fires, failing the test.
Fixes: #19801Closesscylladb/scylladb#19859
Alternator allows authentication into the existing CQL roles, but
roles which have the flag "login=false" should be refused in
authentication, and this patch adds the missing check.
The patch also adds a regression test for this feature in the
test/alternator test framework, in a new test file
test/alternator/cql_rbac.py. This test file will later include more
tests of how the CQL RBAC commands (CREATE ROLE, GRANT, REVOKE)
affect authentication and authorization in Alternator.
In particular, these tests need to use not just the DynamoDB API but
also CQL, so this new test file includes the "cql" fixture that allows
us to run CQL commands, to create roles, to retrieve their secret keys,
and so on.
Fixesscylladb/scylladb#19735Closesscylladb/scylladb#19740
Introduce virtual tasks - task manager tasks which cover
cluster-wide operations.
Virtual tasks aren't kept in memory, instead their statuses
are retrieved from associated service when user requests
them with task manager API. From API users' perspective,
virtual tasks behave similarly to regular tasks, but they can
be queried from any node in a cluster.
Virtual tasks cannot have a parent task. They can have
children on each node in a cluster, but do not keep references
to them. So, if a direct child of a virtual task is unregistered
from task manager, it will no longer be shown in parent's
children vector.
virtual_task class corresponds to all virtual tasks in one
group. If users want to list all tasks in a module, a virtual_task
returns all recent supported operations; if they request virtual
task's status - info about the one specified operation is
presented. Time to live, number of tracked operations etc.
depend on the implementation of individual virtual_task.
All virtual_tasks are kept only on shard 0.
Refs: https://github.com/scylladb/scylladb/issues/15852
New feature, no backport needed.
Closesscylladb/scylladb#16374
* github.com:scylladb/scylladb:
docs: describe virtual tasks
db: node_ops: filter topology request entries
test: add a topology suite for testing tasks
node_ops: service: create streaming tasks
node_ops: register node_ops_virtual_task in task manager
service: node_ops: keep node ops module in storage service
node_ops: implement node_ops_virtual_task methods
db: service: modify methods to get topology_requests data
db: service: add request type column to topology_requests
node_ops: add task manager module and node_ops_virtual_task
tasks: api: add virtual task support to get_task_status_recursively
tasks: api: add virtual task support
tasks: api: add virtual tasks support to get_tasks
tasks: add task_handler to hide task and virtual_task differences from user
tasks: modify invoke_on_task
tasks: implement task_manager::virtual_task::impl::get_children
tasks: keep virtual tasks in task manager
tasks: introduce task_manager::virtual_task
We modify `ScyllaCluster.server_start` so that it changes seeds of the
starting node to all currently running nodes. This allows writing tests like
```python
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
However, it disallows writing tests that start multiple clusters. To fix this,
we add the `seeds` parameter to `server_start`.
We also improve the logic in `ScyllaCluster.add_server` to allow writing
tests like
```python
await manager.server_add(expected_error="...")
await manager.server_add()
```
This PR only adds improvements to the `test.py` framework, no need
to backport it.
Closesscylladb/scylladb#19847
* github.com:scylladb/scylladb:
test: scylla_cluster: improve expected_error in add_server
test: scylla_cluster: support more test scenarios
test: scylla_cluster: correctly change seeds in server_start
We make two changes:
- we lease the IP address of a node that failed to boot because of
an expected error,
- we don't log "Cluster ... added ..." when a node fails to boot
because of an expected error.
Here are some examples of tests that don't work with no initial
nodes, but they should work:
1.
```
await manager.server_add(expected_error="...")
await manager.server_add()
```
2.
```
await manager.servers_add(2, expected_error="...")
await manager.servers_add(2)
```
3.
```
s1 = await manager.server_add(start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_add()
```
4.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_start(s2.server_id, expected_error="...")
await manager.servers_add(2)
```
5.
```
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
6.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.servers_add(2)
await manager.server_start(s1.server_id)
await manager.server_start(s2.server_id)
```
In this patch, we make a few improvements to make tests like the ones
presented above work. I tested all the examples above manually.
From now on, servers receive correct seeds if the first servers added
in the test didn't start or failed to boot.
Also, we remove the assertion preventing the creation of a second
cluster. This assertion failed the tests presented above. We could
weaken it to make these tests pass, but it would require some work.
Moreover, we have tests that intentionally create two clusters.
Therefore, we go for the easiest solution and accept that a single
`ScyllaCluster` may not correspond to a single Scylla cluster.
We change seeds in `ScyllaCluster.server_start` to all currently
running nodes. The previous code only pretended that it did it.
After doing this change, writing tests that create multiple clusters
is impossible. To allow it, we add the `seeds` parameter to
`ManagerClient.server_start`. We use it to fix and simplify the only
test that creates two clusters - `test_different_group0_ids`.
system_keyspace::get_topology_request_entries returns entries for
requests which are running or have finished after specified time.
In task manager node ops task set the time so that they are shown
for task_ttl seconds after they have finished.
Add topology_tasks test suite for testing task manager's node ops
tasks. Add TaskManagerClient to topology_tasks for an easy usage
of task manager rest api.
Write a test for bootstrap, replace, rebuild, decommission and remove
top level tasks using the above.
Keep task manager node ops module in storage service. It will be
used to create and manage tasks related to topology changes.
The module is created and registered in storage service constructor.
In storage_service::stop() the module is stopped and so all the remaining
tasks would be unregistered immediately after they are finished.
Virtual tasks are supported by get_task_status, abort_task and
wait_task.
Task status returned by get_task_status and wait_task:
- contains task_kind to indicate whether it's virtual (cluster) or
regular (node) task;
- children list apart from task_id contains node address of the task.
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.
The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.
The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().
But the hard part of this test is the testing:
1. An existing multi-node test for "/localnodes" assummed that right after
a new node was created, it appears on "/localnodes". But after this
patch, it may take a bit more time for the bootstrapping to complete
and the new node to appear in /localnodes - so I had to add a retry loop.
2. I added a test that reproduces the bug fixed here, and verifies its
fix. The test is in the multi-node topology framework. It adds an
injection which delays the bootstrap, which leaves a new node in JOINING
state for a long time. The test then verifies that the new node is
alive (as checked by the REST API), but is not returned by "/localnodes".
3. The new injection for delaying the bootstrap is unfortunately not
very pretty - I had to do it in three places because we have several
code paths of how bootstrap works without repair, with repair, without
Raft and with Raft - and I wanted to delay all of them.
Fixes#19694.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19725
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:
- with the file, we don't need to accumulate the payloads in memory
and send them in batch, as we do in upload_sink and in
upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
upload applies to it, this helps to improve the throughput.
so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.
Fixes#16287
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.
Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.
Fixes: #19351.
Closesscylladb/scylladb#19834
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.
Not a regression, so a backport isn't strictly necessary.
Closesscylladb/scylladb#19792
* github.com:scylladb/scylladb:
config: specialize from-string conversion for bool
config: wrap boost::lexical_cast<> when converting from strings
If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.
Related to #15820
The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc.
Closesscylladb/scylladb#15971
* github.com:scylladb/scylladb:
commitlog: Make max data lifetime runtime-configurable
db::config: Expose commitlog_max_data_lifetime_in_s parameter
commitlog: Add optional max lifetime parameter to cl instance
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722Closesscylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
`filesystem_storage` methods frequently call `sync_directory()`, for the sake of flushing (sync'ing) a directory. `sync_directory()` always brackets the sync with open and close, and given that most `sync_directory()` calls target the sstable base directory, those repeated opens and closes are considered wasteful. Rework the `filesystem_storage::_dir` member (from a mere pathname) so that it stand for an `opened_directory` object, which keeps the sstable base directory open, for the purpose of repeated sync'ing.
Resolves#2399.
Closesscylladb/scylladb#19624
* github.com:scylladb/scylladb:
sstables/storage: synch "dst_dir" more leanly in create_links_common()
sstables/storage: close previous directory asynchronously upon dir change
sstables/storage: futurize change_dir_for_test()
sstables/storage: sync through "opened_directory" in filesystem...::move()
sstables/storage: sync through "opened_directory" in the "easy" cases
sstables/storage: introduce "opened_directory" class
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.
Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.
This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.
Fixesscylladb/scylladb#19772
Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.
Closesscylladb/scylladb#19787
Users outside of the token module don't
need to mess with the token::kind.
They can only create key tokens.
Never, minimum or maximum tokens, with a particular
datya value.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
sizeof(dht::token) is only 16 bytes and therefore
it can be passed with 2 registers.
There is no sense in defining minimum_token
and maximum_token out of line, returning a token&
to statically allocated values that require memory
access/copy, while the only call sites that needs
to point to the static min/max tokens are in
dht::ring_position_view.
Instead, they can be defined inline as constexpr
functions and return their const values.
Respectively, define token ctors and methods
as constexpr where applicable (and noexcept while at it
where applicable)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the rolling restart to avoid spurious driver reconnects.
This can be eventually reverted once the scylladb/python-driver#295 is fixed.
Fixesscylladb/scylladb#19154Closesscylladb/scylladb#19771
* github.com:scylladb/scylladb:
test: raft: fix the flaky `test_raft_recovery_stuck`
test: raft: code cleanup in `test_raft_recovery_stuck`
In v4 of scylladb/scylladb#19598 the last commit of the patch was replaced but this change missed merge so submitting it in a separate patch.
In the current patch, the original functions class correctly marks methods as const where appropriate, and the instance() method now returns a const object. This ensures protection against accidental modifications, as all changes must go through the change_batch object.
Since the functions_changer class was intended to serve the same purpose, it is now redundant. Therefore, we are reverting the commit that introduced it.
Relates scylladb/scylladb#19153Closesscylladb/scylladb#19647
* github.com:scylladb/scylladb:
cql3: functions: replace template with std::function in with_udf_iter()
cql3: functions: improve functions class constness handling
Revert "cql3: functions: make modification functions accessible only via batch class"
Replaced the old `read_barrier` helper from "test/pylib/util.py"
by the new helper from "test/pylib/rest_client.py" that is calling
the newly introduced direct REST API.
Replaced in all relevant tests and decommissioned the old helper.
Introduced a new helper `get_host_api_address` to retrieve the host API
address - which in come cases can be different from the host address
(e.g. if the RPC address is changed).
Fixes: scylladb/scylladb#19662Closesscylladb/scylladb#19739
Currently change_dir_for_test() is synchronous. Make it return a future,
so that we can use async operations in change_dir_for_test() overrides.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.
Refs scylladb/scylladb#17636Fixesscylladb/scylladb#17660Closesscylladb/scylladb#19741
* github.com:scylladb/scylladb:
db/hints: Verify that Scylla limits the concurrency of written hints
db/hints: Coroutinize `hint_endpoint_manager::store_hint()`
db/hints: Move a constant value to the TU it's used in
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes#19722
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.
Closesscylladb/scylladb#19721
* github.com:scylladb/scylladb:
sstables: use _origin in write path
sstable::open_sstable: pass and store origin
This PR adds support for aborting index reads from within `index_consume_entry_context::consume_input` when the server is being stopped. The abort source is now propagated down to the `index_consume_entry_context`, making it available for `consume_input` to check if an abort has been requested. If an abort is detected, `consume_input` will throw an exception to stop the index read operation.
Closesscylladb/scylladb#19453
* github.com:scylladb/scylladb:
test/boost: test abort behaviour during index read
sstables/index_reader: stop consuming index when abort has been requested
sstables::index_consume_entry_context: store abort_source
sstable: drop old filter only after the new filter is built during rebuild
sstables/sstables_manager: store abort_source in sstable_manager
replica/database: pass abort_source to database constructor
The yaml/json representation for bool is true/false, but boost::lexical_cast
is 1/0. Specialize bool conversion to accept true/false (for yaml/json
compatibilty) and 1/0 (for backward compatibility). This provides
round-trip conversion for bool configs in system.config.
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.
Should be backported to all supported versions.
Closesscylladb/scylladb#19699
* github.com:scylladb/scylladb:
test: add test to check that coordinator lwt semaphore continues functioning after locking failures
paxos: do not signal semaphore if it was not acquired