This pull request introduces global secondary-indexing for non-frozen collections.
The intent is to enable such queries:
```
CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));
-- index on test(c) is the same as index on (values(c))
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);
SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;
```
We use here all-familiar materialized views (MVs). Scylla treats all the
collections the same way - they're a list of pairs (key, value). In case
of sets, the value type is dummy one. In case of lists, the key type is
TIMEUUID. When describing the design, I will forget that there is more
than one collection type. Suppose that the columns in the base table
were as follows:
```
pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2)
```
The MV schema is as follows (the names of columns which are not the same
as in base might be different). All the columns here form the primary
key.
```
-- for index over entries
indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over keys
indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over values
indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int
```
The reason for the last additional column is that the values from a collection might not be unique.
Fixes#2962Fixes#8745Fixes#10707
This patch does not implement **local** secondary indexes for collection columns: Refs #10713.
Closes#10841
* github.com:scylladb/scylladb:
test/cql-pytest: un-xfail yet another passing collection-indexing test
secondary index: fix paging in map value indexing
test/cql-pytest: test for paging with collection values index
cql, view: rename and explain bytes_with_action
cql, index: make collection indexing a cluster feature
test/cql-pytest: failing tests for oversized key values in MV and SI
cql: fix secondary index "target" when column name has special characters
cql, index: improve error messages
cql, index: fix default index name for collection index
test/cql-pytest: un-xfail several collecting indexing tests
test/cql-pytest/test_secondary_index: verify that local index on collection fails.
docs/design-notes/secondary_index: add `VALUES` to index target list
test/cql-pytest/test_secondary_index: add randomized test for indexes on collections
cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra
cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail.
test/boost/secondary_index_test: update for non-frozen indexes on collections
test/cql-pytest: Uncomment collection indexes tests that should be working now
cql, index: don't use IS NOT NULL on collection column
cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows
cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections
cql3, index: Use entries() indexes on collections for queries
cql3, index: Use keys() and values() indexes on collections for queries.
types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function
db/view/view.cc: compute view_updates for views over collections
view info: has_computed_column_depending_on_base_non_primary_key
column_computation: depends_on_non_primary_key_column
schema, index/secondary_index_manager: make schema for index-induced mv
index/secondary_index_manager: extract keys, values, entries types from collection
cql3/statements/: validate CREATE INDEX for index over a collection
cql3/statements/create_index_statement,index_target: rewrite index target for collection
column_computation.hh, schema.cc: collection_column_computation
column_computation.hh, schema.cc: compute_value interface refactor
Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.
Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.
Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.
Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672Closes#11053
* github.com:scylladb/scylladb:
test/cql-pytest: add test for query tombstone page limit
query-result-writer: stop when tombstone-limit is reached
service/pager: prepare for empty pages
service/storage_proxy: set smallest continue pos as query's continue pos
service/storage_proxy: propagate last position on digest reads
query: result_merger::get() don't reset last-pos on short-reads and last pages
query: add tombstone-limit to read-command
service/storage_proxy: add get_tombstone_limit()
query: add tombstone_limit type
db/config: add config item for query tombstone limit
gms: add cluster feature for empty replica pages
tree: don't use query::read_command's IDL constructor
The query result writer now counts tombstones and cuts the page (marking
it as a short one) when the tombstone limit is reached. This is to avoid
timing out on large span of tombstones, especially prefixes.
In the case of unpaged queries, we fail the read instead, similarly to
how we do with max result size.
If the limit is 0, the previous behaviour is used: tombstones are not
taken into consideration at all.
This series converts the synchronous `effective_replication_map::get_range_addresses` to async
by calling the replication strategy async entry point with the same name, as its callers are already async
or can be made so easily.
To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map,
let the callers of this patch hold a effective_replication_map per keyspace and pass it down
to the (now asynchronous) functions that use it (making affected storage_service methods static where possible
if they no longer depend on the storage_service instance).
Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints
are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate
that is true for local_strategy and everywhere_replication_strategy, and is false otherwise.
With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow.
Refs #11005
Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping)
Closes#11009
* github.com:scylladb/scylladb:
effective_replication_map: make get_range_addresses asynchronous
range_streamer: add_ranges and friends: get erm as param
storage_service: get_new_source_ranges: get erm as param
storage_service: get_changed_ranges_for_leaving: get erm as param
storage_service: get_ranges_for_endpoint: get erm as param
repair: use get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces
storage_service: coroutinize update_pending_ranges
effective_replication_map: add get_replication_strategy
effective_replication_map: get_range_addresses: use the precalculated replication_map
abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
abstract_replication_strategy: reindent
utils: sequenced_set: expose set and `contains` method
abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
utils: sequenced_set: templatize VectorType
utils: sanitize sequenced_set
utils: sequenced_set: delete mutable get_vector method
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239Closes#11240
* github.com:scylladb/scylladb:
test: mvcc: Fix illegal use of maybe_refresh()
tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
tests: row_cache_test: Introduce one_shot mode to throttle
row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
It is not type safe: has multiple limits passed to it as raw ints, as
well as other types that ints implicitly convert to. Furthermore the row
limit is passed in two separate fields (lower 32 bits and upper 32
bits). All this make this constructor a minefield for humans to use. We
have a safer constructor for some time but some users of the old one
remain. Move them to the safe one.
maybe_refresh() can only be called if the cursor is pointing at a row.
The code was calling it before the cursor was advanced, and was
thus relying on implementation detail.
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.
Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than defining generate_random,
and use respectively in unit tests.
(It was inherited from raft::internal::tagged_id.)
This allows us to shorten counter_id's definition
to just using utils::tagged_uuid<struct counter_id_tag>.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add include statements to satisfy dependencies.
Delete, now unneeded, include directives from the upper level
source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For generating #include directives in the generated files,
so we don't have to hand-craft include the dependencies
in the right order.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass an optional truncated_at time_point to
truncate_table_on_all_shards instead of the over-complicated
timestamp_func that returns the same time_point on all shards
anyhow, and was only used for coordination across shards.
Since now we synchronize the internal execution phase in
truncate_table_on_all_shards, there is no longer need
for this timestamp_func.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
timestamp_func
Since in the drop_table case we want to discard ALL
sstables in the table, not only those with `max_data_age()`
up until drop started.
Fixes#11232
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Following up on 1c26d49fba,
apply mutations on the correct db shard in all test cases
before we define and use database::truncate_table_on_all_shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
"
There are several helpers in this .cc file that need to get datacenter
for endpoints. For it they use global snitch, because there's no other
place out there to get that data from.
The whole dc/rack info is now moving to topology, so this set patches
the consistency_level.cc to get the topology. This is done two ways.
First, the helpers that have keyspace at hand may get the topology via
ks's effective_replication_map.
Two difficult cases are db::is_local() and db.count_local_endpoints()
because both have just inet_address at hand. Those are patched to be
methods of topology itself and all their callers already mess with
token metadata and can get topology from it.
"
* 'br-consistency-level-over-topology' of https://github.com/xemul/scylla:
consistency_level: Remove is_local() and count_local_endpoints()
storage_proxy: Use topology::local_endpoints_count()
storage_proxy: Use proxy's topology for DC checks
storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
storage_proxy: Use topology local_dc_filter in its methods
storage_proxy: Mark some digest_read_resolver methods private
forwarding_service: Use topology local_dc_filter
storage_service: Use topology local_dc_filter
consistency_level: Use topology local_dc_filter
consitency-level: Call count_local_endpoints from topology
consistency_level: Get datacenter from topology
replication_strategy: Remove hold snitch reference
effective_replication_map: Get datacenter from topology
topology: Add local-dc detection shugar
When the strategy is constructed there's no place to get snitch from
so the global instance is used. However, after previous patch the
replication strategy no longer needs snitch, so this dependency can
be dropped
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently they are copied for the get_sstables function
so this change reduces copies.
Also, it will allow further decoupling of compaction_manager
from replica::database, by letting the caller of
perform_cleanup and perform_sstable_upgrade get the
owned token ranges from db and pass it to the perform_*
functions in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
over the rest api
Performing compaction scrub user did not know whether an operation
was aborted.
If compaction scrub is aborted, return status the user gets over
rest api is set to 1.
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables. In that case we don't want to add fuel
to the fire by creating yet another sstable.
Fixes#4116Closes#10954
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
compaction_manager: add maybe_wait_for_sstable_count_reduction
time_window_compaction_strategy: get_sstables_for_compaction: clean up code
time_window_compaction_strategy: make get_sstables_for_compaction idempotent
time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages
leveled_manifest: pass compaction_counter as const&
Currently logalloc::region is relying on boost binomial_heap handle to properly move listeners registration when the region (when derived from dirty_memory_manager_logalloc::size_tracked_region) is moved, like boost::intrusive link hooks do -
hence 81e20ceaab/dirty_memory_manager.cc (L89-L90) does nothing.
Unfortunately, this doesn't work as expected.
This series adds a unit test that verifies the move semantics
and a fix to size_tracked_region and region_group code to make it pass.
Also "logalloc: region: get_impl might be called on disengaged _impl when moved"
fixes a couple corner cases where the shared _impl could be dereferenced when disengaged, and
the change also adds a unit test for that too.
Closes#11141
* github.com:scylladb/scylla:
logalloc: region: properly track listeners when moved
logalloc: region_impl: add moved method
logalloc: region: merge: optimize getting other impl
logalloc: region: merge: call region_impl::unlisten
logalloc: region: call unlisten rather than open coding it
logalloc: region move-ctor: initialize _impl
logalloc: region: get_impl might be called on disengaged _impl when moved
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
(cherry picked from commit b5684aa96d)
(cherry picked from commit 25407a7e41)
consume_clustering_fragments already ignores dummy rows, but does it in
the wrong place. Currently they're ignored after comparing them with
range tombstones. This change skips them before any useful work is done
with them.
Consider a simplified mutation reversal scenario scenario (ckp is
clustering key prefix, -1, 0, 1 are bound_weights):
schema_ptr s = schema_builder{"ks", "cf"}
.with_column("pk", bytes_type, column_kind::partition_key)
.with_column("ck1", bytes_type, column_kind::clustering_key)
.build();
Input range tombstone positions:
{clustered, ckp{}, before}
{clustered, ckp{1}, after}
Clustering rows:
{clustered, ckp{2}, equal}
{clustered, ckp{}, after} // dummy row
During reversal, clustering rows are read backwards, and reversed range
tombstone positions are read forwards (because the range tombstones are
reversed and applied backwards). The read order in the example above is:
Reversed range tombstone positions:
1: {clustered, ckp{}, before}
2: {clustered, ckp{1}, before}
Clustering rows read backwards:
3: {clustered, ckp{}, after} // dummy row
4: {clustered, ckp{2}, equal}
Then we effectively do the merge part of merge sort, trying to put all
fragments in order according to their positions from the two lists
above. However, the dummy row is used in the comparison, and it compares
to be gt each of the reversed range tombstone positions. Then we
try to emit the clustering row, but only at that point we notice it's
dummy and should be skipped. Subsequent row with ckp{2} is compared to
the last used range tombstone position and the fragments are out of
order (in reversed schema, ckp{2} should come before ckp{1}).
The solution is to move the logic skipping the dummy clustering rows to
the beginning of the loop, so they can be ignored before they're used.
Fixes: https://github.com/scylladb/scylla/issues/11147Closes#11129
* github.com:scylladb/scylla:
mutation: Add test if mutations are consumed in order
test: Move validating_consumer to test/lib/mutation_assertions.hh
mutation: Ignore dummy rows when consuming clustering fragments
First check if _impl is engaged before accessing it
to set its _region = this in the move constructor and
move assignment operator.
Add unit test for these odd orner cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.
The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).
Closes#11058
* github.com:scylladb/scylla:
Add summary_test
database: Reduce the number of per-table metrics
replica/table.cc: Do not register per-table metrics for system
histogram_metrics_helper.hh: Add to_metrics_summary function
Unified histogram, estimated_histogram, rates, and summaries
Split the timed_rate_moving_average into data and timer
utils/histogram.hh: should_sample should use a bitmask
estimated_histogram: add missing getter method
The series unifies memtable flush error handling into table::seal_active_memtable
following up on f6d9d6175f.
The goal here is to prevent an infinite retry loop as in #10498
by aborting on any error that is not bad_alloc.
Fixes#10498Closes#10691
* github.com:scylladb/scylla:
test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
test: memtable_test: failed_flush_prevents_writes: extend error injection
table: seal_active_memtable: abort if retried for too long
table: seal_active_memtable: abort on unexpected error
table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
dirty_memory_manager: flush_permit: add has_sstable_write_permit
dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
dirty_memory_manager: flush_permit: make _sstable_write_permit optional
table: reindent seal_active_memtable
table: coroutinize seal_active_memtable
memtable_list: mark functions noexcept
commitlog: make discard_completed_segments and friends noexcept
dirty_memory_manager: flush_when_needed: target error handling at flush_one
database: delete unused seal_delayed_fn_type
dirty_memory_manager: mark functions noexcept
memtable: mark functions noexcept
memtable: memtable_encoding_stats_collector: mark functions noexcept
encoding_state: mark functions noexcept
logalloc: mark free functions noexcept
logalloc: allocating_section: mark functions noexcept
logalloc: allocating_section: guard: mark constructor noexcept
logalloc: reclaim_lock: mark functions noexcept
logalloc: tracker_reclaimer_lock: mark constructor noexcept
logalloc: mark shard_tracker noexcept
logalloc: region: mark functions const/noexcept
logalloc: basic_region_impl: mark functions noexcept
logalloc: region_impl: mark functions noexcept
utils: log_heap: mark functions noexcept
logalloc: region_impl: object_descriptor: mark functions noexcept
logalloc: region_group: mark functions noexcept
logalloc: tracker: mark functions const/noexcept
logalloc: tracker::impl: make region_occupancy and friends const
logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
logalloc: tracker::impl: mark functions noexcept
logalloc: segment: mark functions const / noexcept
logalloc: segment_pool: add const variant of descriptor method
logalloc: segment_pool: move descriptor method to class definition
logalloc: segment_pool: mark functions const/noexcept
logalloc: segment_pool: delete unused free_or_restore_to_reserve method
utils: dynamic_bitset: mark functions noexcept
utils: dynamic_bitset: delete unused members
logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload
logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const
logalloc: segment_store: make can_allocate_more_segments const
logalloc: segment_store: mark functions noexcept
logalloc: segment_descriptor: mark functions noexcept
logalloc: occupancy_stats: mark functions noexcept
min_max_tracker: mark functions noexcept
gc_clock, db_clock: mark functions noexcept
dirty_memory_manager: region_group: mark functions noexcept
dirty_memory_manager: region_group: make simple constructor noexcept
dirty_memory_manager: region_group_reclaimer mark functions noexcept
logalloc: lsa_buffer: mark functions noexcept
We know that sstable_run is supposed to contain disjoint files only,
but this assumption can temporarily break when switching strategies
as TWCS, for example, can incorrectly pick the same run id for
sstables in different windows during segregation. So when switching
from TWCS to ICS, it could happen a sstable_run won't contain disjoint
files. We should definitely fix TWCS and any other strategy doing
that, but sstable_run should have disjointness as actual invariant,
not be relaxed on it. Otherwise, we cannot build readers on this
assumption, so more complicated logic have to be added to merge
overlapping files.
After this patch, sstable_run will reject insertion of a file that
will cause the invariant to break, so caller will have to check
that and push that file into a different sstable run.
Closes#11116