rewrite the function as coroutine to make it easier to read and maintain, following lifetime issues we had and fixed in this function.
The second commit adds a test that drops a table while there is a counter update operation ongoing in the table.
The test reproduces issue https://github.com/scylladb/scylla-enterprise/issues/4475 and verifies it is fixed.
Follow-up to https://github.com/scylladb/scylladb/pull/19948
Doesn't require backport because the fix to the issue was already done and backported. This is just cleanup and a test.
Closesscylladb/scylladb#19982
* github.com:scylladb/scylladb:
db: test counter update while table is dropped
db: coroutinize do_apply_counter_update
Recently, some users have seen "Key size too large" errors in various
places. Cassandra and Scylla impose a 64KB length limit on keys, and
we have known about bugs in this area for a long time - and even had
some translated Cassandra unit tests that cover some of them. But these
tests did not cover all the corner cases and left us with partial and
fragmented knowledge of this problem, spread over many test files and
many issues.
In this patch, we add a single test file, test/cql-pytest/test_key_length.py
which attempts to rigourously explore the various bugs we have with
CQL key length limits. These test aim to reproduce all known bugs in
this area:
* Refs #3017 - CQL layer accepts set values too large to be written to
an sstable
* Refs #10366 - Enforce Key-length limits during SELECT
* Refs #12247 - Better error reporting for oversized keys during INSERT
* Refs #16772 - Key length should be limited to exactly 65535, not less
The following less interesting bug is already covered by many tests so
I decided not to test it again:
* Refs #7745 - Length of map keys and set items are incorrectly limited
to 64K in unprepared CQL
There's also a situation in materialized views and secondary indexes,
where a column that was _not_ a key, now becomes a key, and a length
limit needs to be enforced on it. We already have good test coverage
for this (in test/cql-pytest/test_secondary_index.py and in
test/cql-pytest/test_materialized_view.py), and we have an issue:
* Refs #8627 - Cleanly reject updates with indexed values where value > 64k
All 16 tests added here pass on Cassandra 5 except one that fails on
https://issues.apache.org/jira/browse/CASSANDRA-19270, but 11 of the
tests currently fail on Scylla (6 on #12247, 2 on #10366, 3 on #16772).
It is possible that our decision in #16772 will not be to fix Scylla
to match Cassandra but rather to declare that strict compatibility isn't
needed in this case or even that Cassandra is wrong. But even then,
having these tests which demonstrate the behavior of both Cassandra
and Scylla will be important.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16779
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.
When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.
With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.
Fixesscylladb/scylladb#8199Closesscylladb/scylladb#19338
* github.com:scylladb/scylladb:
mv: skip reading rows when generating partition tombstone update
mv: delete a partition in a single operation when applicable
cql-pytest: move ScyllaMetrics to util file to allow reuse
Add a test that drops a table while there is a counter update operation
ongoing in the table.
The test reproduces issue scylladb/scylla-enterprise#4475 and verifies
it is fixed.
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.
In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.
This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance. Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.
We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.
The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.
For example, for the following parameters:
params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}
The results are:
Before:
Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}
After:
Overcommit : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
Overcommit : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
So worst shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.
Also, node overcommit for table1 was reduced from 1.81 to 1.02.
The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.
The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).
One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.
Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.
Fixes https://github.com/scylladb/scylladb/issues/16824Closesscylladb/scylladb#19779
* github.com:scylladb/scylladb:
test: perf: tablet_load_balancing: Test with higher shard and tablet counts
tablets: load_balancer: Avoid quadratic complexity when finding best candidate
tablets: load_balancer: Maintain load sketch properly during intra-node migration
tablets: load_balancer: Use "drained" flag
test: perf: tablet_load_balancing: Report load balancer stats
tablets: load_balancer: Move load_balancer_stats_manager to header file
tablets: load_balancer: Split evaluate_candidate() into src and dst part
tablets: load_balancer: Optimize evaluate_candidate()
tablets: load_balancer: Add more statistics
tablets: load_balancer: Track load per table on cluster level
tablets: load_balancer: Track load per table on node level
tablets: load_balancer: Use a single load sketch for tracking all nodes
locator: load_sketch: Introduce populate_dc()
tablets: load_balancer: Modify target load sketch only when emitting migration
locator: load_sketch: Introduce get_most_loaded_shard()
locator: load_sketch: Introduce get_least_loaded_shard()
locator: load_sketch: Optimize pick()/unload()
locator: load_sketch: Introduce load_type
test: perf: tablet_load_balancing: Report total tablet counts
test: perf: tablet_load_balancing: Print run parameters in the single simulation case too
test: perf: tablet_load_balancing: Report time it took to schedule migrations
tablets: load_balancer: Log table load stats after each migration
tablets: load_balancer: Log per-shard load distribution in debug level
tablets: load_balancer: Improve per-table balance
tablets: load_balancer: Extract check_convergence()
tablets: load_balancer: Extract nodes_by_load_cmp
tablets: load_balancer: Maintain tablet count per table
tablets: load_balancer: Reuse src_node_info
test: perf: tablet_load_balancing: Print warnings about bad overcommit
test: perf: tablet_load_balancing: Allow running a single simulation
test: perf: tablet_load_balancing: Report best possible shard overcommit
test: perf: tablet_load_balancing: Report global shard overcommit
It's only needed to start hints via proxy, but proxy can do it without gossiper argument
Closesscylladb/scylladb#19894
* github.com:scylladb/scylladb:
storage_service: Remote gossiper argument from join_cluster()
proxy: Use remote gossiper to start hints resource manager
hints: Const-ify gossiper references and anchor pointers
A user complained that ScyllaDB is incompatible with Cassandra when it
requires ALLOW FILTERING on a restriction like WHERE x=1 AND y=1 where
x and y are two columns with secondary indexes.
In the tests added in this patch we show that:
1. Scylla *is* compatible with Cassandra when the traditional "CREATE
INDEX" is used - ALLOW FILTERING *is* required in this case in both
Cassandra and Scylla.
2. If SAI is used in Cassandra (CREATE CUSTOM INDEX USING 'SAI'),
indeed ALLOW FILTERING becomes optional. I believe this is incorrect
so I opened CASSANDRA-19795.
These two tests combined show that we're not incompatible with Cassandra,
rather Cassandra's two index implementations are incompatible between
themselves, and Scylla is in fact compatible in this case with Cassadra's
traditional index and not with SAI.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19909
Rather than maximum per-node shard overcommit. Global shard overcommit
is a better metric since we want to equalize global load not just
per-node load.
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.
We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".
This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.
Fixes#19911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19914
In commit bac7c33313 we introduced a new
test for the Alternator "/localnodes" request, checking that a node
that is still joining does not get returned. The tests used what I
thought were "very high" timeouts - we had a timeout of 10 seconds
for starting a single node, and injected a 20 second sleep to leave
us 10 seconds after the first sleep.
But the test failed in one extremely slow run (a debug build on
aarch64), where starting just a single node took more than 15 seconds!
So in this patch I increase the timeouts significantly: We increase
the wait for the node to 60 seconds, and the sleeping injection to
120 seconds. These should definitely be enough for anyone (famous
last words...).
The test doesn't actually wait for these timeouts, so the ridiculously
high timeouts shouldn't affect the normal runtime of this test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19916
There's a test in object_store suite that verifies the contents of a bucket. It does with the plain http request, but unfortunately this doesn't work -- even local minio uses restricted bucket and using plain http request results in 403(Forbidden) error code. Test doesn't check it and continues working with empty list of objects which, in turn, is what it expects to see.
The fix is in using boto3. With it, the acc/secret pair is picked up and listing the bucket finally works.
Closesscylladb/scylladb#19889
* github.com:scylladb/scylladb:
test/object_store: Use boto3.resource to list bucket
test/object_store: Add get_s3_resource() helper
Instead of plain http request, use the power of boto3 package. The
recently added get_s3_resource() facilitates creating one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It creates boto3.resource object that points to endpoint maintained
by s3_server argument (that tests obtain via fixture). This allows
using boto3 to access S3 bucket from local minio server.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
single tablet replica
In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.
Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes. This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.
Refs scylladb/scylladb#18862
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19816
This pointer was only needed to pull all the way down the hints resource
manager start() method. It's no longer needed for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.
Fixes#19879
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#19883
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.
As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.
In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.
Fixes: scylladb/scylladb#19439
This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.
Closesscylladb/scylladb#19765
* github.com:scylladb/scylladb:
test: regression test for MV crash with tablets during decommission
db/view: drop view updates to replaced node marked as left
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.
When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.
With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.
Fixesscylladb/scylladb#8199
ScyllaMetrics is a useful generic component for retrieving metrics in a
pytest.
The commit moves the implementation from test_shedding.py to util.py to
make it reusable in other tests in cql-pytest.
The admission test has a section which tests admission when the
semaphore has inactive reads. This section (and therefore the enire
test) became flaky lately, after a seemingly unrelated seastar upgrade,
which improved timers.
The cause of the flakyness is the permit which is made inactive later:
this permit is created with 0 timeout (times out immediately). For some
time now, when the timeout timer of a permit fires, if the permit is
inactive, it is evicted. This is what makes the test fail: the inactive
read times out and ends up evicting this permit, which is not expected
for the test. The reason this was not a problem before, is that the test
finishes very quickly, usually, before the timer could even be polled by
the reactor. The recent seastar changes changed this and now the timer
sometimes get polled and fires, failing the test.
Fixes: #19801Closesscylladb/scylladb#19859
Alternator allows authentication into the existing CQL roles, but
roles which have the flag "login=false" should be refused in
authentication, and this patch adds the missing check.
The patch also adds a regression test for this feature in the
test/alternator test framework, in a new test file
test/alternator/cql_rbac.py. This test file will later include more
tests of how the CQL RBAC commands (CREATE ROLE, GRANT, REVOKE)
affect authentication and authorization in Alternator.
In particular, these tests need to use not just the DynamoDB API but
also CQL, so this new test file includes the "cql" fixture that allows
us to run CQL commands, to create roles, to retrieve their secret keys,
and so on.
Fixesscylladb/scylladb#19735Closesscylladb/scylladb#19740
Introduce virtual tasks - task manager tasks which cover
cluster-wide operations.
Virtual tasks aren't kept in memory, instead their statuses
are retrieved from associated service when user requests
them with task manager API. From API users' perspective,
virtual tasks behave similarly to regular tasks, but they can
be queried from any node in a cluster.
Virtual tasks cannot have a parent task. They can have
children on each node in a cluster, but do not keep references
to them. So, if a direct child of a virtual task is unregistered
from task manager, it will no longer be shown in parent's
children vector.
virtual_task class corresponds to all virtual tasks in one
group. If users want to list all tasks in a module, a virtual_task
returns all recent supported operations; if they request virtual
task's status - info about the one specified operation is
presented. Time to live, number of tracked operations etc.
depend on the implementation of individual virtual_task.
All virtual_tasks are kept only on shard 0.
Refs: https://github.com/scylladb/scylladb/issues/15852
New feature, no backport needed.
Closesscylladb/scylladb#16374
* github.com:scylladb/scylladb:
docs: describe virtual tasks
db: node_ops: filter topology request entries
test: add a topology suite for testing tasks
node_ops: service: create streaming tasks
node_ops: register node_ops_virtual_task in task manager
service: node_ops: keep node ops module in storage service
node_ops: implement node_ops_virtual_task methods
db: service: modify methods to get topology_requests data
db: service: add request type column to topology_requests
node_ops: add task manager module and node_ops_virtual_task
tasks: api: add virtual task support to get_task_status_recursively
tasks: api: add virtual task support
tasks: api: add virtual tasks support to get_tasks
tasks: add task_handler to hide task and virtual_task differences from user
tasks: modify invoke_on_task
tasks: implement task_manager::virtual_task::impl::get_children
tasks: keep virtual tasks in task manager
tasks: introduce task_manager::virtual_task
We modify `ScyllaCluster.server_start` so that it changes seeds of the
starting node to all currently running nodes. This allows writing tests like
```python
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
However, it disallows writing tests that start multiple clusters. To fix this,
we add the `seeds` parameter to `server_start`.
We also improve the logic in `ScyllaCluster.add_server` to allow writing
tests like
```python
await manager.server_add(expected_error="...")
await manager.server_add()
```
This PR only adds improvements to the `test.py` framework, no need
to backport it.
Closesscylladb/scylladb#19847
* github.com:scylladb/scylladb:
test: scylla_cluster: improve expected_error in add_server
test: scylla_cluster: support more test scenarios
test: scylla_cluster: correctly change seeds in server_start
We make two changes:
- we lease the IP address of a node that failed to boot because of
an expected error,
- we don't log "Cluster ... added ..." when a node fails to boot
because of an expected error.
Here are some examples of tests that don't work with no initial
nodes, but they should work:
1.
```
await manager.server_add(expected_error="...")
await manager.server_add()
```
2.
```
await manager.servers_add(2, expected_error="...")
await manager.servers_add(2)
```
3.
```
s1 = await manager.server_add(start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_add()
```
4.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_start(s2.server_id, expected_error="...")
await manager.servers_add(2)
```
5.
```
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
6.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.servers_add(2)
await manager.server_start(s1.server_id)
await manager.server_start(s2.server_id)
```
In this patch, we make a few improvements to make tests like the ones
presented above work. I tested all the examples above manually.
From now on, servers receive correct seeds if the first servers added
in the test didn't start or failed to boot.
Also, we remove the assertion preventing the creation of a second
cluster. This assertion failed the tests presented above. We could
weaken it to make these tests pass, but it would require some work.
Moreover, we have tests that intentionally create two clusters.
Therefore, we go for the easiest solution and accept that a single
`ScyllaCluster` may not correspond to a single Scylla cluster.
We change seeds in `ScyllaCluster.server_start` to all currently
running nodes. The previous code only pretended that it did it.
After doing this change, writing tests that create multiple clusters
is impossible. To allow it, we add the `seeds` parameter to
`ManagerClient.server_start`. We use it to fix and simplify the only
test that creates two clusters - `test_different_group0_ids`.
system_keyspace::get_topology_request_entries returns entries for
requests which are running or have finished after specified time.
In task manager node ops task set the time so that they are shown
for task_ttl seconds after they have finished.
Add topology_tasks test suite for testing task manager's node ops
tasks. Add TaskManagerClient to topology_tasks for an easy usage
of task manager rest api.
Write a test for bootstrap, replace, rebuild, decommission and remove
top level tasks using the above.
Keep task manager node ops module in storage service. It will be
used to create and manage tasks related to topology changes.
The module is created and registered in storage service constructor.
In storage_service::stop() the module is stopped and so all the remaining
tasks would be unregistered immediately after they are finished.
Virtual tasks are supported by get_task_status, abort_task and
wait_task.
Task status returned by get_task_status and wait_task:
- contains task_kind to indicate whether it's virtual (cluster) or
regular (node) task;
- children list apart from task_id contains node address of the task.
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.
The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.
The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().
But the hard part of this test is the testing:
1. An existing multi-node test for "/localnodes" assummed that right after
a new node was created, it appears on "/localnodes". But after this
patch, it may take a bit more time for the bootstrapping to complete
and the new node to appear in /localnodes - so I had to add a retry loop.
2. I added a test that reproduces the bug fixed here, and verifies its
fix. The test is in the multi-node topology framework. It adds an
injection which delays the bootstrap, which leaves a new node in JOINING
state for a long time. The test then verifies that the new node is
alive (as checked by the REST API), but is not returned by "/localnodes".
3. The new injection for delaying the bootstrap is unfortunately not
very pretty - I had to do it in three places because we have several
code paths of how bootstrap works without repair, with repair, without
Raft and with Raft - and I wanted to delay all of them.
Fixes#19694.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19725
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:
- with the file, we don't need to accumulate the payloads in memory
and send them in batch, as we do in upload_sink and in
upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
upload applies to it, this helps to improve the throughput.
so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.
Fixes#16287
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.
Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.
Fixes: #19351.
Closesscylladb/scylladb#19834
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.
Not a regression, so a backport isn't strictly necessary.
Closesscylladb/scylladb#19792
* github.com:scylladb/scylladb:
config: specialize from-string conversion for bool
config: wrap boost::lexical_cast<> when converting from strings
If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.
Related to #15820
The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc.
Closesscylladb/scylladb#15971
* github.com:scylladb/scylladb:
commitlog: Make max data lifetime runtime-configurable
db::config: Expose commitlog_max_data_lifetime_in_s parameter
commitlog: Add optional max lifetime parameter to cl instance
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722Closesscylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
`filesystem_storage` methods frequently call `sync_directory()`, for the sake of flushing (sync'ing) a directory. `sync_directory()` always brackets the sync with open and close, and given that most `sync_directory()` calls target the sstable base directory, those repeated opens and closes are considered wasteful. Rework the `filesystem_storage::_dir` member (from a mere pathname) so that it stand for an `opened_directory` object, which keeps the sstable base directory open, for the purpose of repeated sync'ing.
Resolves#2399.
Closesscylladb/scylladb#19624
* github.com:scylladb/scylladb:
sstables/storage: synch "dst_dir" more leanly in create_links_common()
sstables/storage: close previous directory asynchronously upon dir change
sstables/storage: futurize change_dir_for_test()
sstables/storage: sync through "opened_directory" in filesystem...::move()
sstables/storage: sync through "opened_directory" in the "easy" cases
sstables/storage: introduce "opened_directory" class
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.
Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.
This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.
Fixesscylladb/scylladb#19772
Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.
Closesscylladb/scylladb#19787