Soon, we want to log a warning on too many cell tombstones as well.
Extract the logging code to allow reuse between the row and cell
tombstone warnings.
rewrite the function as coroutine to make it easier to read and maintain, following lifetime issues we had and fixed in this function.
The second commit adds a test that drops a table while there is a counter update operation ongoing in the table.
The test reproduces issue https://github.com/scylladb/scylla-enterprise/issues/4475 and verifies it is fixed.
Follow-up to https://github.com/scylladb/scylladb/pull/19948
Doesn't require backport because the fix to the issue was already done and backported. This is just cleanup and a test.
Closesscylladb/scylladb#19982
* github.com:scylladb/scylladb:
db: test counter update while table is dropped
db: coroutinize do_apply_counter_update
Recently, some users have seen "Key size too large" errors in various
places. Cassandra and Scylla impose a 64KB length limit on keys, and
we have known about bugs in this area for a long time - and even had
some translated Cassandra unit tests that cover some of them. But these
tests did not cover all the corner cases and left us with partial and
fragmented knowledge of this problem, spread over many test files and
many issues.
In this patch, we add a single test file, test/cql-pytest/test_key_length.py
which attempts to rigourously explore the various bugs we have with
CQL key length limits. These test aim to reproduce all known bugs in
this area:
* Refs #3017 - CQL layer accepts set values too large to be written to
an sstable
* Refs #10366 - Enforce Key-length limits during SELECT
* Refs #12247 - Better error reporting for oversized keys during INSERT
* Refs #16772 - Key length should be limited to exactly 65535, not less
The following less interesting bug is already covered by many tests so
I decided not to test it again:
* Refs #7745 - Length of map keys and set items are incorrectly limited
to 64K in unprepared CQL
There's also a situation in materialized views and secondary indexes,
where a column that was _not_ a key, now becomes a key, and a length
limit needs to be enforced on it. We already have good test coverage
for this (in test/cql-pytest/test_secondary_index.py and in
test/cql-pytest/test_materialized_view.py), and we have an issue:
* Refs #8627 - Cleanly reject updates with indexed values where value > 64k
All 16 tests added here pass on Cassandra 5 except one that fails on
https://issues.apache.org/jira/browse/CASSANDRA-19270, but 11 of the
tests currently fail on Scylla (6 on #12247, 2 on #10366, 3 on #16772).
It is possible that our decision in #16772 will not be to fix Scylla
to match Cassandra but rather to declare that strict compatibility isn't
needed in this case or even that Cassandra is wrong. But even then,
having these tests which demonstrate the behavior of both Cassandra
and Scylla will be important.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16779
When debugging coredumps some (small, but useful) information is hidden
in the notes of the core ELF file. Add some words about it exists, what
it includes and the thing that is always forgotten -- the way to get one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19962
Commit ad0e6b79 (replica: Remove all_datadir from keyspace config) removed all_datadirs from keyspace config, now it's datadir turn. After this change keyspace no longer references any on-disk directories, only the sstables's storage driver attached to keyspace's tables does.
refs #12707Closesscylladb/scylladb#19866
* github.com:scylladb/scylladb:
replica: Remove keyspace::config::datadir
sstables/storage: Evaluate path for keyspace directory in storage
sstables/storage: Add sstables_manager arg to init_keyspace_storage()
This commit adds OS support for version 6.1 and removes OS support for 5.4
(according to our support policy for versions).
Closesscylladb/scylladb#19992
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
The for_tests constructor has a metrics parameter defaulted to
register_metrics::no, but when delegating to the other constructor, a
hard-coded register_metrics::no is passed. This makes no difference
currently, because all callers use the default and the hard-coded value
corresponds to it. Let's fix it nevertheless to avoid any future
surprises.
Closesscylladb/scylladb#20007
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.
Use find_column_family at each table occurrence in insert_repair_meta
instead.
Closesscylladb/scylladb#19953
in fabab2f4, we introduced preemption_source, and added
`SCYLLA_ENABLE_PREEMPTION_SOURCE` preprocessor macro to enable
opt-in the pluggable preemption check.
but CMake building system was not updated accordingly.
so, in this change, let's sync the CMake building system with
`configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19951
This reverts commit c3bea539b6.
Since it breaking offline-installer artifact-tests. Also, it seems that we should have merged it in the first place since we don't need scylla-housekeeping checks for offline-installer
Closesscylladb/scylladb#19976
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.
When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.
With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.
Fixesscylladb/scylladb#8199Closesscylladb/scylladb#19338
* github.com:scylladb/scylladb:
mv: skip reading rows when generating partition tombstone update
mv: delete a partition in a single operation when applicable
cql-pytest: move ScyllaMetrics to util file to allow reuse
Add a test that drops a table while there is a counter update operation
ongoing in the table.
The test reproduces issue scylladb/scylla-enterprise#4475 and verifies
it is fixed.
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.
In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.
This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance. Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.
We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.
The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.
For example, for the following parameters:
params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}
The results are:
Before:
Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}
After:
Overcommit : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
Overcommit : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
So worst shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.
Also, node overcommit for table1 was reduced from 1.81 to 1.02.
The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.
The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).
One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.
Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.
Fixes https://github.com/scylladb/scylladb/issues/16824Closesscylladb/scylladb#19779
* github.com:scylladb/scylladb:
test: perf: tablet_load_balancing: Test with higher shard and tablet counts
tablets: load_balancer: Avoid quadratic complexity when finding best candidate
tablets: load_balancer: Maintain load sketch properly during intra-node migration
tablets: load_balancer: Use "drained" flag
test: perf: tablet_load_balancing: Report load balancer stats
tablets: load_balancer: Move load_balancer_stats_manager to header file
tablets: load_balancer: Split evaluate_candidate() into src and dst part
tablets: load_balancer: Optimize evaluate_candidate()
tablets: load_balancer: Add more statistics
tablets: load_balancer: Track load per table on cluster level
tablets: load_balancer: Track load per table on node level
tablets: load_balancer: Use a single load sketch for tracking all nodes
locator: load_sketch: Introduce populate_dc()
tablets: load_balancer: Modify target load sketch only when emitting migration
locator: load_sketch: Introduce get_most_loaded_shard()
locator: load_sketch: Introduce get_least_loaded_shard()
locator: load_sketch: Optimize pick()/unload()
locator: load_sketch: Introduce load_type
test: perf: tablet_load_balancing: Report total tablet counts
test: perf: tablet_load_balancing: Print run parameters in the single simulation case too
test: perf: tablet_load_balancing: Report time it took to schedule migrations
tablets: load_balancer: Log table load stats after each migration
tablets: load_balancer: Log per-shard load distribution in debug level
tablets: load_balancer: Improve per-table balance
tablets: load_balancer: Extract check_convergence()
tablets: load_balancer: Extract nodes_by_load_cmp
tablets: load_balancer: Maintain tablet count per table
tablets: load_balancer: Reuse src_node_info
test: perf: tablet_load_balancing: Print warnings about bad overcommit
test: perf: tablet_load_balancing: Allow running a single simulation
test: perf: tablet_load_balancing: Report best possible shard overcommit
test: perf: tablet_load_balancing: Report global shard overcommit
It's only needed to start hints via proxy, but proxy can do it without gossiper argument
Closesscylladb/scylladb#19894
* github.com:scylladb/scylladb:
storage_service: Remote gossiper argument from join_cluster()
proxy: Use remote gossiper to start hints resource manager
hints: Const-ify gossiper references and anchor pointers
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.
Fixesscylladb/scylla-enterprise#4475Closesscylladb/scylladb#19948
A user complained that ScyllaDB is incompatible with Cassandra when it
requires ALLOW FILTERING on a restriction like WHERE x=1 AND y=1 where
x and y are two columns with secondary indexes.
In the tests added in this patch we show that:
1. Scylla *is* compatible with Cassandra when the traditional "CREATE
INDEX" is used - ALLOW FILTERING *is* required in this case in both
Cassandra and Scylla.
2. If SAI is used in Cassandra (CREATE CUSTOM INDEX USING 'SAI'),
indeed ALLOW FILTERING becomes optional. I believe this is incorrect
so I opened CASSANDRA-19795.
These two tests combined show that we're not incompatible with Cassandra,
rather Cassandra's two index implementations are incompatible between
themselves, and Scylla is in fact compatible in this case with Cassadra's
traditional index and not with SAI.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19909
If the source and destination shards picked for migration based on
global tablet balance do not have a good candidate in terms of effect
on per-table balance, the algorithm explores other source shards and
destinations. This has quadratic complexity in terms of shard count in
the worst case, when there are no good candidates.
Since we can have up to ~200 shards, this can slow down scheduling
significantly. I saw total scheduling time of 5 min in the following run:
scylla perf-load-balancing -c1 -m1G --iterations=8 \
--nodes=4 --tablets1=1024 --tablets2=8096 \
--rf1=2 --rf2=3 --shards=256
To improve, change the apprach to first find the best source shard and
then best target shard, sequentially. So it's now linear in terms of
shard count.
After the change, the total scheduling time in that run is down to 4s.
Minimizing source and destination metrics piece-wise minimizes the
combined metric, so badness of the best candidate doesn't suffer after
this change.
Affects only intra-node migration. The code was recording destination
shard as taken and did not un-take it in case we skipped the migration
due to lack of candidates.
Noticed during code review. Impact is minor, since even if this leads
to suboptimal balance, the next scheduling round should fix it.
Also, the source shard was not unloaded, but that should have no
impact on decisions. But to be future-proof, better to maintain the
load accurately in case the algorithm is extended with more steps.
Some of the calls inside the `raft_group0_client::start_operation()` method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.
This was in particular the case of operations that try to take the ownership of the raft group semaphore (`get_units(semaphore)`) - these waits should be cancelled when the abort source is triggered.
This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3
Fixesscylladb/scylladb#19223Closesscylladb/scylladb#19860
* github.com:scylladb/scylladb:
raft: fix the shutdown phase being stuck
raft: use the abort source reference in raft group0 client interface
They are executed frequently during tablet scheduling. Currently, they
have time complexity of O(N*log(N)) in terms of shard count. With
large shard counts, that has significant overhead.
This patch optimizes them down to O(log(N)).