Even we switched to Ubuntu based container image, housekeeping still
using yum repository.
It should be switched to apt repository.
Fixes#9144Closes#9147
* seastar ce3cc2687f...07758294ef (12):
> perftune.py: change hwloc-calc parameters order
Fixes perftune on Fedora 34 based hwloc
> resource: pass configuration to nr_processing_units()
> semaphore: semaphore_timed_out: derive from timed_out_error
> Merge "resource: use hwloc_topology_holder" from Benny
> Merge "file: ioctl, fcntl and lifetime_hint interfaces in seastar::file" from Arun George
> pipe: mark pipe_reader and pipe_writer ctors as noexcept
> test: pipe: add simple unit test
> test: source_location_test: relax function name check for gcc 11
> http: add 429 too_many_requests status code
> Added [[nodiscard]] to abort-source's subscribe
> io_queue: Use on_internal_error in io_queue
> reactor: Remove unused epoll poller from reactor
read_schema_partition_for_keyspace() copies some parameters to capture them
in a coroutine, but the same can be achieved more cleanly by changing the
reference parameters to value parameters, so do that.
Test: unit (dev)
Closes#9154
This trial patch set moves compaction_strategy.hh and compaction_garbage_collector.hh to compaction directory and drops two unused compact_for_mutation_query_state and compact_for_data_query_state.
Closes#9156
* github.com:scylladb/scylla:
compaction: Move compaction_garbage_collector.hh to compaction dir
compaction: Move compaction_strategy.hh to compaction dir
mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state
The cluster would forget its configuration when taking a snapshot,
making it unable to reelect a leader.
We fix the problem and introduce a regression test.
The last commit introduces some additional assertions for safety.
* kbr/snapshot-preserve-config-v4:
raft: sanity checking of apply index
test: raft: regression test for storing cluster configuration when taking snapshots
raft: store cluster configuration when taking snapshots
Before the fix introduced in the previous patch, the cluster would
forget its configuration when taking a snapshot, making it unable to
reelect a leader. This regression test catches that.
We add a function `log_last_conf_before(index_t)` to `fsm` which, given
an index greater than the last snapshot index, returns the configuration
at this index, i.e. the configuration of the last configuration entry
before this index.
This function is then used in `applier_fiber` to obtain the correct
configuration to be stored in a snapshot.
In order to ensure that the configuration can be obtained, i.e. the
index we're looking at is not smaller than the last snapshot index, we
strengthen the conditions required for taking a snapshot: we check that
`_fsm` has not yet applied a snapshot at a larger index (which it may
have due to a remote snapshot install request). This also causes fewer
unnecessary snapshots to be taken in general.
Calculating clustering ranges on a local index has been rewritten to use the new `expression` variant.
This allows us to finally remove the old `bounds_ranges` function.
Closes#9080
* github.com:scylladb/scylla:
cql3: Remove unused functions like bounds_ranges
cql3: Use expressions to calculate the local-index clustering ranges
statement_restrictions_test: tests for extracting column restrictions
expression: add a function to extract restrictions for a column
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
This applies to the case when pages are broken by replicas based on
memory limits (not row or partition limits).
If replicas stop pages in the following places:
replica1 = {
row 1,
<end-of-page>
row 2
}
replica2 = {
row 3
}
The coordinator will reconcile the first page as:
{
row 1,
row 3
}
and row 2 will not be emitted at all in the following pages.
The coordinator should notice that replica1 returned a short read and
ignore everything past row 1 from other replicas, but it doesn't.
There is a logic to do this trimming, but it is done in
got_incomplete_information_across_partitions() which is executed only
for the partition for which row limits were exhausted.
Fix by running the logic unconditionally.
Fixes#9119
Tests:
- unit (dev)
- manual (2 node cluster, manual reproducer)
Message-Id: <20210802231539.156350-1-tgrabiec@scylladb.com>
In issue #9083 a user noted that whereas Cassandra's partition-count
estimation is accurate, Scylla's (rewritten in commit b93cc21) is very
inaccurate. The tests introduce here, which all xfail on Scylla, confirm
this suspicion.
The most important tests are the "simple" tests, involving a workload
which writes N *distinct* partitions and then asks for the estimated
partition count. Cassandra provides accurate estimates, which grow
more accurate with more partitions, so it passes these tests, while
Scylla provides bad estimates and fails them.
Additional tests demonstrate that neither Scylla nor Cassandra
can handle anything beyond the "simple" case of distinct partitions.
Two tests which xfail on both Cassandra and Scylla demonstrate that
if we write the same partitions to multiple sstables - or also delete
partitions - the estimated partition counts will be way off.
Refs #9083
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210726211315.1515856-1-nyh@scylladb.com>
Realized that the overall complexity of partition filtering in
cleanup is O(N * log(M)), where
N is # of tokens
M is # of ranges owned by the node
Assuming N=10,000,000 for a table and M=257, N*log(M) ~= 80,056,245
checks performed during the whole cleanup.
This can be optimized by taking advantage that owned ranges are
both sorted and non wrapping, so an incremental iterator-oriented
checker is introduced to reduce complexity from O(N * log(M)) to
O(N + M) or just O(N).
BEFORE
240MB to 237MB (~98% of original) in 3239ms = 73MB/s. ~950016 total partitions merged to 949943.
719MB to 719MB (~99% of original) in 9649ms = 74MB/s. ~2900608 total partitions merged to 2900576.
1GB to 1GB (~100% of original) in 15231ms = 74MB/s. ~4536960 total partitions merged to 4536852.
1GB to 1GB (~100% of original) in 15244ms = 74MB/s. ~4536960 total partitions merged to 4536840.
1GB to 1GB (~100% of original) in 15263ms = 74MB/s. ~4536832 total partitions merged to 4536783.
1GB to 1GB (~100% of original) in 15216ms = 74MB/s. ~4536832 total partitions merged to 4536812.
AFTER
240MB to 237MB (~98% of original) in 3169ms = 74MB/s. ~950016 total partitions merged to 949943.
719MB to 719MB (~99% of original) in 9444ms = 76MB/s. ~2900608 total partitions merged to 2900576.
1GB to 1GB (~100% of original) in 14882ms = 76MB/s. ~4536960 total partitions merged to 4536852.
1GB to 1GB (~100% of original) in 14918ms = 76MB/s. ~4536960 total partitions merged to 4536840.
1GB to 1GB (~100% of original) in 14919ms = 76MB/s. ~4536832 total partitions merged to 4536783.
1GB to 1GB (~100% of original) in 14894ms = 76MB/s. ~4536832 total partitions merged to 4536812.
Fixes#6807.
test: mode(dev).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802213159.182393-1-raphaelsc@scylladb.com>
Finding clustering ranges has been rewritten to use the new
expression variant.
Old bounds_ranges() and other similar ones are no longer needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Removes old code used to calculate local-index clustering range
and replaces it with new based on the expression variant.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Not all calculate_natural_endpoints implementations respect can_yield
flag, for example, everywhere_replication_strategy.
This patch adds yield at the caller site to fix stalls we saw in
do_get_ranges.
Fixes#8943Closes#9139
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.
Fixes#9138
Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
In this patch we add another test case for a case where ALLOW FILTERING
should not be required (and Cassandra doesn't require it) but Scylla
does.
This problem was introduced by pull request #9122. The pull request
fixed an incorrect query (see issue #9085) involving both an index and
a multi-column restriction on a compound clustering key - and the fix is
using filtering. However, in one specific case involving a full prefix,
it shouldn't require filtering. This test reproduces this case.
The new test passes on Cassandra (and also theoretically, should pass),
but fails on Scylla - the check_af_optional() call fails because Scylla
made the ALLOW FILTERING mandatory for that case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210803092046.1677584-1-nyh@scylladb.com>
We already have tests for Query's ExclusiveStartKey option, but we
only exercised it as a way for paging linearly through all the results.
Now we add a test that confirms that ExclusiveStartKey can be used not
just for paging through all the result - but also for jumping directly to
the middle of a partition after any clustering key (existing or non-
existing clustering key). The new test also for the first time verifies
that ExclusiveStartKey with a specific format works (previous tests just
copied LastEvaluatedKey to ExclusiveStartKey, so any opaque cookie could
have worked).
The test passes on both DynamoDB and Alternator so it did not find a new
bug. But it's useful to have as a regression test, in case in the future
we want to improve paging performance (see #6278) - and need to keep in
mind that ExclusiveStartKey is not just for paging.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210729114703.1609058-1-nyh@scylladb.com>
They were already correctly returned to the caller, but we had a
leftover discarded future that would sometimes end up with a
broken_promise exception. Ignore the exception explicitly.
Message-Id: <20210803122207.78406-1-kbraun@scylladb.com>
This add support for Azure snitch. The work is an adaptation of
AzureSnitch for Apache Cassandra by Yoshua Wakeham:
https://raw.githubusercontent.com/yoshw/cassandra/9387-trunk/src/java/org/apache/cassandra/locator/AzureSnitch.java
Also change `production_snitch_base` to protect against
a snitch implementation setting DC and rack to an empty string,
which Lubos' says can happen on Azure.
Fixes#8593Closes#9084
* github.com:scylladb/scylla:
scylla_util: Use AzureSnitch on Azure
production_snitch_base: Fallback for empty DC or rack strings
azure_snitch: Azure snitch support
Query pager was reusing query_uuid only when it had no local state (no
_last_pkey), so querier cache was not used when paging locally.
This bug affects performance of aggregate queries like count(*).
Fixes#9127
Message-Id: <20210803003941.175099-1-tgrabiec@scylladb.com>
Following conversion to corotuines in fc91e90c59, remove extra
indents and braces left to make the change clearer.
One variable had to be renamed since without the braces it
duplicated another variable in the same block.
Test: unit (dev)
Closes#9125
With commit 1924e8d2b6, compaction code was moved into a
top level dir as compaction is layered on top of sstables.
Let's continue this work by moving all compaction unit tests
into its own test file. This also makes things much more
organized.
sstable_datafile_test, as its name implies, will only contain
sstable data tests. Perhaps it should be renamed to only
sstable_data_test, as the test also contains tests involving
other components, not only the data one.
BEFORE
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
105
AFTER
$ cat test/boost/sstable_compaction_test.cc | grep TEST_CASE | wc -l
57
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
48
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802192120.148583-1-raphaelsc@scylladb.com>
"
Now reshape can be aborted on either boot or refresh.
The workflow is:
1) reshape starts
2) user notices it's taking too long
3) nodetool stop RESHAPE
the good thing is that completed reshape work isn't lost, allowing
table to enjoy the benefits of all reshaping done up to the abortion
point.
Fixes#7738.
"
* 'abort_reshape_v1' of https://github.com/raphaelsc/scylla:
compaction: Allow reshape to be aborted
api: make compaction manager api available earlier
Now reshape can be aborted on either boot or refresh.
The workflow is:
1) reshape starts
2) user notices it's taking too long
3) nodetool stop RESHAPE
the good thing is that completed reshape work isn't lost, allowing
table to enjoy the benefits of all reshaping done up to the abortion
point.
Fixes#7738.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When perfing cleanup, merging reader showed up as significant.
Given that cleanup is performed on a single sstable at a time,
merging reader becomes an extra layer doing useless work.
1.71% 1.71% scylla scylla [.] merging_reader<mutation_reader_merger>::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}::operator()
mutation compactor, to get rid of purgeable expired data
and so on, still consumes the data retrieved by sstable
reader, so no semantic change is done.
With the overhead removed, cleanup becomes ~9% faster, see:
BEFORE
real 1m15.240s
user 0m2.648s
sys 0m0.128s
240MB to 237MB (~98% of original) in 3301ms = 71MB/s.
719MB to 719MB (~99% of original) in 9761ms = 73MB/s.
1GB to 1GB (~100% of original) in 15372ms = 73MB/s.
1GB to 1GB (~100% of original) in 15343ms = 74MB/s.
1GB to 1GB (~100% of original) in 15329ms = 74MB/s.
1GB to 1GB (~100% of original) in 15360ms = 73MB/s.
AFTER
real 1m9.154s
user 0m2.428s
sys 0m0.123s
240MB to 237MB (~98% of original) in 3010ms = 78MB/s.
719MB to 719MB (~99% of original) in 8997ms = 79MB/s.
1GB to 1GB (~100% of original) in 14114ms = 80MB/s.
1GB to 1GB (~100% of original) in 14145ms = 80MB/s.
1GB to 1GB (~100% of original) in 14106ms = 80MB/s.
1GB to 1GB (~100% of original) in 14053ms = 80MB/s.
With 1TB set, ~20m would had been reduced instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210730190713.462135-1-raphaelsc@scylladb.com>
Add a function, which given an expression and a column,
extracts all restrictions involving this column.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We should not use the current term; we should use the term of the
snapshot's index, which may be lower.
* https://github.com/kbr-/scylla/tree/snapshot-right-term-fix:
test: raft: regression test for using the correct term when taking a snapshot
test: raft: randomized_nemesis_test: server configuration parameter
raft: use the correct term when storing a snapshot
Message changed according to what 'scylla_bootparam_setup' currently does
(set a clock source at boot time) instead of of what it used to do in
the past (setting huge pages).
Closes#9116.
When a WHERE clause contains a multi-column restriction and an indexed
regular column, we must filter the results. It is generally not
possible to craft the index-table query so it fetches only the
matching rows, because that table's clustering key doesn't match up
with the column tuple.
Fixes#9085.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9122
schema_tables is quite hairy, but can be easily simplified with coroutines.
In addition to switching future-returning functions to coroutines, we also
switch Seastar threads to coroutines. This is less of a clear-cut win; the
motivation is to reduce the chances of someone calling a function that
expects to run in a thread from a non-thread context. This sometimes works
by accident, but when it doesn't, it's pretty bad. So a uniform calling convention
has some benefit.
I left the extra indents in, since the indent-fixing patch is hard to rebase in case
a rebase is needed. I will follow up with an indent fix post merge.
Test: unit (dev, debug, release)
Closes#9118
* github.com:scylladb/scylla:
db: schema_tables: drop now redundant #includes
db: schema_tables: coroutinize drop_column_mapping()
db: schema_tables: coroutinize column_mapping_exists()
db: schema_tables: coroutinize get_column_mapping()
db: schema_tables: coroutinize read_table_mutations()
db: schema_tables: coroutinize create_views_from_schema_partition()
db: schema_tables: coroutinize create_views_from_table_row()
db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition()
db: schema_tables: coroutinize create_tables_from_tables_partition()
db: schema_tables: coroutinize create_table_from_name()
db: schema_tables: coroutinize read_table_mutations()
db: schema_tables: coroutinize merge_keyspaces()
db: schema_tables: coroutinize do_merge_schema()
db: schema_tables: futurize and coroutinize merge_functions()
db: schema_tables: futurize and coroutinize user_types_to_drop::drop
db: schema_tables: futurize and coroutinize merge_types()
db: schema_tables: futurize and coroutinize merge_tables_and_views()
db: schema_tables: coroutinize store_column_mapping()
db: schema_tables: futurize and coroutinize read_tables_for_keyspaces()
db: schema_tables: coroutinize read_table_names_of_keyspace()
db: schema_tables: coroutinize recalculate_schema_version()
db: schema_tables: coroutinize merge_schema()
db: schema_tables: introduce and use with_merge_lock()
db: schema_tables: coroutinize update_schema_version_and_announce()
db: schema_tables: coroutinize read_keyspace_mutation()
db: schema_tables: coroutinize read_schema_partition_for_table()
db: schema_tables: coroutinize read_schema_partition_for_keyspace()
db: schema_tables: coroutinize query_partition_mutation()
db: schema_tables: coroutinize read_schema_for_keyspaces()
db: schema_tables: coroutinize convert_schema_to_mutations()
db: schema_tables: coroutinize calculate_schema_digest()
db: schema_tables: coroutinize save_system_schema()
This series improves the repair logging by removing the unused sub_ranges_nr counter, adding peer node ip in the log, removing redundant logs in case of error.
Closes#9120
* github.com:scylladb/scylla:
repair: Remove redudnary error log in tracker::run
repair: Do not log errors in repair_ranges
repair: Move more repair single range code into repair_info::repair_range
repair: Use the same uuid from the repair_info
repair: Drop sub_ranges_nr counter
It was observed that since fce124bd90 ('Merge "Introduce
flat_mutation_reader_v2" from Tomasz') database_test takes much longer.
This is expected since it now runs the upgrade/downgrade reader tests
on all existing tests. It was also observed that in a similar time frame
database_test sometimes times our on test machines, taking much
longer than usual, even with the extra work for testing reader
upgrade/downgrade.
In an attempt to reproduce, I noticed ti failing on EMFILE (too many
open file descriptors). I saw that tests usually use ~100 open file
descriptors, while the default limit is 1024.
I suspect we have runaway concurrency, but I was not able to pinpoint the
cause. It could be compaction lagging behind, or cleanup work for
deleting tables (the test
test_database_with_data_in_sstables_is_a_mutation_source creates and
deletes many tables).
As a stopgap solution to unblock the tests, this patch raises the file
descriptor limit in the way recommended by [1]. While tests shouldn't
use so many descriptors, I ran out of ideas about how to plug the hole.
Note that main() does something similar, through more elaborate since
it needs to communicate to users. See ec60f44b64 ("main: improve
process file limit handling").
[1] http://0pointer.net/blog/file-descriptor-limits.htmlCloses#9121