Fixes#9348
If we get exceptions in delete_segments, we can, and probably will, loose
track of footprint counters. We need to recompute the used disk footprint,
otherwise we will flush too often, and even block indefinately on new_seg
iff using hard limits.
Tests that we can handle exception-in-alloc cleanup if the file actually
does not exist. This however uncovers another weakness (addressed in next
patch) - that we can loose track of disk footprint here, and w. hard limits
end up waiting for disk space that never comes. Thus test does not use hard
limit.
Fixes#9343
If we fail in allocate_segment_ex, we should push the file opened/created
to the delete set to ensure we reclaim the disk space. We should also
ensure that if we did not recycle a file in delete_segments, we still
wake up any recycle waiters iff we made a file delete instead.
Included a small unit test.
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.
Fix by restoring the "remaining" count when adjusting the paging state
on short read.
Tests:
- index_with_paging_test
- secondary_index_test
Fixes#9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>
(cherry picked from commit 1e4da2dcce)
This PR started by realizing that in the memtable reversing reader, it
never happened on tests that `do_refresh_state` was called with
`last_row` and `last_rts` which are not `std::nullopt`.
Changes
- fix memtable test (`tesst_memtable_with_many_versions_conforms_to_mutation_source`), so that there is a background job forcing state refreshes,
- fix the way rt_slice is computed (was `(last_rts, cr_range_snapshot.end]`, now is `[cr_range_snapshot.start, last_rts)`).
Fixes#9486Closes#9572
* github.com:scylladb/scylla:
partition_snapshot_reader: fix indentation in fill_buffer
range_tombstone_list: {lower,upper,}slice share comparator implementation
test: memtable: add full_compaction in background
partition_snapshot_reader: fix obtaining rt_slice, if Reversing and _last_rts was set
range_tombstone_list: add lower_slice
Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source
in background. Without it, some paths in the partition snapshot reader
weren't covered, as the tests always managed to read all range
tombstones and rows which cover a given clustering range from just a
single snapshot. Now, when full_compaction happens in process of reading
from a clustering range, we can force state refresh with non-nullopt
positions of last row and last range tombstone.
Note: this inability to test affected only the reversing reader.
for symmetry, let's call it perform_* as it doesn't work like submission
functions which doesn't wait for result, like the one for minor
compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
there's no need for wrapping compaction_data in shared_ptr, also
let's kill unused params in create_compaction_data to simplify
its creation.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Again, there's a sub-case with sequential time stamps that still
works by chance. This time it's because splitting 256 sstables
into buckets of maximum 8 ones is allowed to have the 1st and the
last ones with less than 8 items in it, e.g. 3, 8, ..., 8, 5. The
exact generation depends on the time-since-epoch at which it
starts.
When all the cases are run altogether this time luckily happens
to be well-aligned with 8-hours and the generated buckets are
filled perfectly. When this particular test-case is run all alone
(e.g. by --run_test or --parallel-cases) then the starting time
becomes different and it gets less than 4 sstables in its first
bucket.
The fix is in adjusting the starting time to be aligned with the
8 hours window.
Actually, the 8 hours appeared in the previous patch, before which
it was 24 hours. Nonetheless, the above reasoning applies to any
size of the time window that's less than 256, so it's still an
independent fix.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The options are std::move-d twice -- first into schema builder then
into compaction strategy. Surprisingly, but the 2nd move makes the
test work.
There's a sub-case in this case that checks sstables with incremental
timestamps with 1 hour step -- 0h, 1h, 2h, ... 255h. Next, the twcs
buckets generator obeys a minimal threshold of 4 sstables per bucket.
Those with less sstables in are not included in the job. Finally,
since the options used to create the twcs are empty after the 1st
move the default window of 24 hours is used. If they were configured
correctly with 1 hour window then all buckets would contain 1 sstable
and the generated job would become empty.
So the fix is both -- don't move after move and make the window size
large enough to fit more sstables than the mentioned minimum.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Test both methods: the "old" disk-based one and the recently added
in-memory one, with different configurations and also add additional
checks to ensure they don't loose data.
The partition based splitting writer (used by scrub) was found to be exception-unsafe, converting an `std::bad_alloc` to an assert failure. This series fixes the problem and adds a unit test checking the exception safety against `std::bad_alloc`:s fixing any other related problems found.
Fixes: https://github.com/scylladb/scylla/issues/9452Closes#9453
* github.com:scylladb/scylla:
test: mutation_writer_test: add exception safety test for segregate_by_partition()
mutation_writer: segregate_by_partition(): make exception safe
mutation_reader: queue_reader_handle: make abandoned() exception safe
mutation_writer: feed_writers(): make it a coroutine
mutation_writer: partition_based_splitting_writer: erase old bucket if we fail to create replacement
sprint() uses the printf-style formatting language while most of our
code uses the Python-derived format language from fmt::format().
The last mass conversion of sprint() to fmt (in 1129134a4a)
missed some callers (principally those that were on multiple lines, and
so the automatic converter missed them). Convert the remainder to
fmt::format(), and some sprintf() and printf() calls, so we have just
one format language in the code base. Seastar::sprint() ought to be
deprecated and removed.
Test: unit (dev)
Closes#9529
* github.com:scylladb/scylla:
utils: logalloc: convert debug printf to fmt::print()
utils: convert fmt::fprintf() to fmt::print()
main: convert fprint() to fmt::print()
compress: convert fmt::sprintf() to fmt::format()
tracing: replace seastar::sprint() with fmt::format()
thrift: replace seastar::sprint() with fmt::format()
test: replace seastar::sprint() with fmt::format()
streaming: replace seastar::sprint() with fmt::format()
storage_service: replace seastar::sprint() with fmt::format()
repair: replace seastar::sprint() with fmt::format()
redis: replace seastar::sprint() with fmt::format()
locator: replace seastar::sprint() with fmt::format()
db: replace seastar::sprint() with fmt::format()
cql3: replace seastar::sprint() with fmt::format()
cdc: replace seastar::sprint() with fmt::format()
auth: replace seastar::sprint() with fmt::format()
With 062436829c,
we return all input sstables in strict mode
if they are dosjoint even if they don't need reshaping at all.
This leads to an infinite reshape loop when uploading sstables
with TWCS.
The optimization for disjoint sstables is worth it
also in relaxed mode, so this change first makes sorting of the input
sstables by first_key order independent of reshape_mode,
and then it add a check for sstable_set_overlapping_count
before trimming either the multi_window vector or
any single_window bucket such that we don't trim the list
if the candidates are disjoint.
Adjust twcs_reshape_with_disjoint_set_test accordingly.
And also add some debug logging in
time_window_compaction_strategy::get_reshaping_job
so one can figure out what's going on there.
Test: unit(dev)
DTest: cdc_snapshot_operation.py:CDCSnapshotOperationTest.test_create_snapshot_with_collection_list_with_base_rows_delete_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211025071828.509082-1-bhalevy@scylladb.com>
Currently we update the effective_replication_map
only on non-system keyspace, leaving the system keyspace,
that uses the local replication strategy, with the empty
replication_map, as it was first initialized.
This may lead to a crash when get_ranges is called later
as seen in #9494 where get_ranges was called from the
perform_sstable_upgrade path.
This change updates the effective_replication_map on all
keyspaces rather than just on the non-system ones
and adds a unit test that reproduces #9494 without
the fix and passes with it.
Fixes#9494
Test: unit(dev), database_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211020143217.243949-1-bhalevy@scylladb.com>
When an experimental feature graduates from being experimental, we want
to continue allow the old "--experimental-features=..." option to work,
in case some user's configuration uses it - just do nothing. The way
we do it is to map in db::experimental_features_t::map() the feature's
name to the UNUSED value - this way the feature's name is accepted, but
doesn't change anything.
When the CDC feature graduated from being experimental, a new bit
UNUSED_CDC was introduced to do the same thing. This separate bit was
not actually necessary - if we ever check for UNUSED_CDC bit anywhere
in the code it means the flag isn't actually unused ;-) And we don't
check it.
So simplify the code by conflating UNUSED_CDC into UNUSED.
This will also make it easy to build from db::experimental_features_t::map()
a list of current experimental features - now it will simply be those that
do not map to UNUSED.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013105107.123544-1-nyh@scylladb.com>
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.
Fixes#9302.
Closes#9304.
* 'prepare_options' of https://github.com/bhalevy/scylla:
cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy
abstract_replication_strategy: add to_qualified_class_name
After a4053dbb72, data segregation is postponed to offstrategy, so reshape
procedure is called with disjoint sstables which belong to different
windows, so let's extend the optimization for disjoint sstables which
span more than one window. In this way, write amplification is reduced
for offstrategy compaction, as all disjoint sstables will be compacted
at once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013203046.233540-2-raphaelsc@scylladb.com>
It was auto-expanded only if the strategy name
was the short "NetworkTopologyStrategy" name.
Fixes#9302
Test: cql_query_test.test_rf_expand(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The current api design of abstract_replication_strategy
provides a can_yield parameter to calls that may stall
when traversing the token metadata in O(n^2) and even
in O(n) for a large number of token ranges.
But, to use this option the caller must run in a seastar thread.
It can't be used if the caller runs a coroutine or plain
async tasks.
Rather than keep adding threads (e.g. in storage_service::load_and_stream
or storage_service::describe_ring), the series offers an infrastructure
change: precalculating the token->endpoints map once, using an async task,
and keeping the results in a `effective_replication_map` object.
The latter can be used for efficient and stall-free calls, like
get_natural_endpoints, or get_ranges/get_primary_range, replacing their
equivalents in abstract_replication_strategy, and dropping the public
abstract_replication_strategy::calculate_natural_endpoints and its
internal cached_endpoints map.
Other than the performance benefits of:
1. The current calls require running a thread to yield.
Precalculating the map (using async task) allows us to use synchronous calls
without stalling the rector.
2. The replication maps can and should be shared
between keyspaces that use the same replication strategy.
(Will be sent as a follow-up to the series)
The bigger benefits (courtesy of Avi Kivity) are laying the groundwork for:
1. atomic replication metadata - an operation can capture a replication map once, and then use consistent information from the map without worrying that it changes under its feet. We may even be able to s/inet_address/replica_ptr/ later.
2. establish boundaries on the use of replication information - by making a replication map not visible, and observing when its reference count drops to zero, we can tell when the new replication map is fully in use. When we start writing to a new node we'll be able to locate a point in time where all writes that were not aware of the new node were completed (this is the point where we should start streaming).
Notes:
* The get_natural_endpoints method that uses the effective_replication_map
is still provided as a abstract_replication_strategy virtual method
so that local_strategy can override it and privide natural endpoints
for any search token, even in the absence of token_metadata, when\
called early-on, before token_metadata has been established.
The effective_replication_map materializes the replication strategy
over a given replication strategy options and token_metadata.
Whenever either of those change for a keyspace, we make a new
effective_replication_map and keep it in the keyspace for latter use.
Methods that depend on an ad-hoc token_metadata (e.g. during
node operations like bootstrap or replace) are still provided
by abstract_replication_strategy.
TODO:
- effective_replication_map registry
- Move pending ranges from token_metadata to replication map
- get rid of abstract_replication_strategy::get_range_addresses(token_metadata&)
- calculate replication map and use it instead.
Test: unit(dev, debug)
Dtest: next-gating, bootstrap_test.py update_cluster_layout_tests.py alternator_tests.py -a 'dtest-full,!dtest-heavy' (release)
"
* tag 'effective_replication_strategy-v6' of github.com:bhalevy/scylla: (44 commits)
effective_replication_map: add get_range_addresses
abstract_replication_strategy: get rid of shared_token_metadata member and ctor param
abstract_replication_strategy: recognized_options: pass const topology&
abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map
token_metadata: get rid of now-unused sync methods
abstract_replication_strategy: get rid of do_calculate_natural_endpoints
abstract_replication_strategy: futurize get_*address_ranges
abstract_replication_strategy: futurize get_range_addresses
abstract_replication_strategy: futurize get_ranges(inet_address ep, token_metadata_ptr)
abstract_replication_strategy: move get_ranges and get_primary_ranges* to effective_replication_map
compaction_manager: pass owned_ranges via cleanup/upgrade options
abstract_replication_strategy: get rid of cached_endpoints
all replication strategies: get rid of do_get_natural_endpoints
storage_proxy: use effective_replication_map token_metadata_ptr along with endpoints
abstract_replication_strategy: move get_natural_endpoints_without_node_being_replaced to effective_replication_map
storage_service: bootstrap: add log messages
storage_service: get_mutable_token_metadata_ptr: always invalidate_cached_rings
shared_token_metadata: set: check version monotonicity
token_metadata: use static ring version
token_metadata: get rid of copy constructor and assignment operator
...
Make a reader that reads from memtable in reverse order.
This draft PR includes two commits, out of which only the second is
relevant for review.
Described in #9133.
Refs #1413.
Closes#9174
* github.com:scylladb/scylla:
partition_snapshot_reader: pop_range_tombstone returns reference (instead of value) when possible.
memtable: enable native reversing
partition_snapshot_reader: reverse ck_range when needed by Reversing
memtable, partition_snapshot_reader: read from partition in reverse
partition_snapshot_reader: rows_position and rows_iter_type supporting reverse iteration
partition_snapshot_reader: split responsibility of ck_range
partition_snapshot_reader: separate _schema into _query_schema and _partition_schema
query: reverse clustering_range
test: cql_query_test: fix test_query_limit for reversed queries
It is not used any more.
Methods either use the token_metadata_ptr in the
effective_replication_map, or receive an ad-hoc
token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Provide a sync get_ranges method by effective_replication_map
that uses the precalculated map to get all token ranges owned by or
replicated on a given endpoint.
Reuse do_get_ranges as common infrastructure for all
3 cases: get_ranges, get_primary_ranges, and get_primary_ranges_within_dc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So they can be easily computed using an async task
before constructing the compaction object
in a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Serialize the metadata changes with
keyspace create, update, or drop.
This will become necessary in the following patch
when we update the effective_replication_map
on all keyspaces and we want instances on all shards
end up with the same replication map.
Note that storage_service::keyspace_changed is called
from the scheme_merge path so it already holds
the merge_lock.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And with that rename calculate_natural_endpoints(const token& search_token, const token_metadata&, can_yield)
to do_calculate_natural_endpoints and make it protected,
With this patch, all its external users call the async version, so
rename it back to calculate_natural_endpoints, and make
calculate_natural_endpoints_sync private since it's being called
only within abstract_replication_strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Our source base drifted away from gcc compatibility; this mostly
restores the ability to build with gcc. An important exception is
coroutines that have an initializer list [1]; this still doesn't work.
We aim to switch back to gcc 11 if/when this gives us better
C++ compatibility and performance.
Test: unit (dev)
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056Closes#9459
* github.com:scylladb/scylla:
test: radix_tree_printer: avoid template specialization in class context
test: raft: avoid ignored variable errors
test: reader_concurrency_semaphore_test: isolate from namespace of source_location
test: cql_query_test: drop unused lambda assert_replication_not_contains
test: commitlog_test: don't use deprecated seastar::unaligned_cast
test: adjust signed/unsigned comparisons in loops and boost tests
build: silence some gcc 11 warnings
sstables: processing_result_generator: make coroutine support palatable for C++20 compilers
managed_bytes: avoid compile-time loop in converting constructor
service: service_level_controller: drop unused variable sl_compare
raft: disambiguate promise name in raft::active_read
locator: azure_snitch: use full type name in definition of globals
cql3: statements: create_service_level_statement: don't ignore replace_defaults()
cql3: statement_restrictions: adjust call to std::vector deduction guide
types: remove recursive constraint in deserialize_value
cql3: restrictions: relax constraint on visitor_with_binary_operator_content
treewide: handle switch statements that return
cql3: expr: correct type of captured map value_type
cdc: adjust type of streams_count
alternator: disambiguate attrs_to_get in table_requests
Most of the methods are marked public, but only few of them should.
Test needs a bit more, however, so the distributed_loader_for_tests
is declared as friend class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit consists of changes, which need to reside in a single
commit, so that the tests pass on each of the commits.
1. Remove do_make_flat_reader which disabled reverse reads by making the
slice a forward one. Remove call to get_ranges which would do
superfluous reversal of clustering ranges.
2. test: cql_query_test: remove expectation that the test_query_limit
fails for reversed queries, since reversed queries no longer require
linear memory wrt. the result size, when paginated.
gcc complains about comparing a signed loop induction variable
with an unsigned limit, or comparing an expected value and measured
value. Fix by using unsigned throughout, except in one case where
the signed value was needed for the data_value constructor.
The test generates random mutations and eliminates mutations whose keys
tokenize to 0, in particular it eliminates mutations with empty
partition keys (which should not end up in sstables).
However it would do that after using the randomly generated mutations to
create their reversed versions. So the reversed versions of mutations
with empty partition keys would stay.
Fix by placing the workaround earlier in the test.
Closes#9447
I stumbled upon this failure in dev mode:
```
test/boost/sstable_conforms_to_mutation_source_test.cc(0): Entering test case "test_sstable_reversing_reader_random_schema"
sstable_conforms_to_mutation_source_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed.
Aborting on shard 0.
```
Since dev mode has no debug symbols I can't
decode the stack trace so I'm not 100% sure about
the root cause and I couldn't reproduce it in release
or debug modes yet.
One vulnerability in the current code is that r1
won't be closed if an exception is thrown before r1 and r2
are moved to `compare_readers` so this change adds
a deferred close of r1 in this case.
Test: sstable_conforms_to_mutation_source_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211006144009.696412-1-bhalevy@scylladb.com>
This allows us to forward-declare raw_selector, which in turn reduces
indirect inclusions of expression.hh from 147 to 58, reducing rebuilds
when anything in that area changes.
Includes that were lost due to the change are restored in individual
translation units.
Closes#9434
base64.hh pulls in the huge rjson.hh, so if someone just wants
a base64 codec they have to pull in the entire rapidjson library.
Move the json related parts of base64.hh to rjson.hh and adjust
includes and namespaces.
In practice it doesn't make much difference, as all users of base64
appear to want json too. But it's cleaner not to mix the two.
Closes#9433
"
There's a run_mutation_source_tests lib helper that runs a bunch of
tests sequentially. The problem is that it does 4 different flavors
of it each being a certain decoration over provided reader. This
amplification makes some test cases run enormous amount of time
without any chance for parallelizm.
The simplest way to help running those cases in parallel is to teach
the slowest cases to run different flavors of mutation source tests
in dedicated cases. This patch makes it so.
The resulting timings are
dev debug
sequential run: 2m1s 53m50s
--parallel-cases (+ this patch): 1m3s 31m15s
tests: unit(dev, debug)
"
* 'br-parallel-mutation-source-tests' of https://github.com/xemul/scylla:
test: Split multishard combining reader case
test: Split database test case
test: Split run_mutation_source_tests
(Single-partition) reversed queries are no longer unlimited but some
places still treat them as such. This causes, for example, shorter pages
for such queries, which breaks a test that expects certain results to
come in a single page.
All the cases in this test also run mutation source tests and the
case with single-fragment buffer takes times more time to execute
than the others.
Splitting this single case so that it runs mutation source tests
flavours in different cases improves the test parallelizm.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test_database_with_data_in_sstables_is_a_mutation_source case runs
the mutation source tests in one go. The problem is that on each step
a whole new ks:cf is created which takes the majority of the tests time.
In the end of the day this case is the slowest one in the suite being
up to two times longer (depending on mode) than the #2 on this list.
This patch splits the case into 4 so that each mutation source flavor
is run in separate case.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>