Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted.
The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading.
The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run.
The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold.
What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned.
Closes#11233
* github.com:scylladb/scylladb:
test: Add test for large partition splitting on compaction
compaction: Add support to split large partitions
sstable: Extend sstable_run to allow disjointness on the clustering level
sstables: simplify will_introduce_overlapping()
test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
test: lib: Fix inefficient merging of mutations in make_sstable_containing()
sstables: Keep track of first partition's first pos and last partition's last pos
sstables: Rename min/max position_range to a descriptive name
sstables_manager: Add sstable metadata reader concurrency semaphore
sstables: Add ability to find first or last position in a partition
Halted background fibers render raft server effectively unusable, so
report this explicitly to the clients.
Fix: #11352Closes#11370
* github.com:scylladb/scylladb:
raft server, status metric
raft server, abort group0 server on background errors
raft server, provide a callback to handle background errors
raft server, check aborted state on public server public api's
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.
Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.
Fixes#10637.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new method allows sstable to load the first row of the first
partition and last row of last partition.
That's useful for incremental reading of sstable run which will
be split at clustering boundary.
To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This series introduces two configurable options when working with TWCS tables:
- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.
Refs: #6923Fixes: #9029Closes#11445
* github.com:scylladb/scylladb:
tests: cql_query_test: add mixed tests for verifying TWCS guard rails
tests: cql_query_test: add test for TWCS window size
tests: cql_query_test: add test for TWCS tables with no TTL defined
cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
cql: add max window restriction for TimeWindowCompactionStrategy
time_window_compaction_strategy: reject invalid window_sizes
cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
"
The test in question plays with snitches to simulate the topology
over which tokens are spread. This set replaces explicit snitch
usage with temporary topology object.
Some snitch traces are still left, but those are for token_metadata
internal which still call global snitch for DC/RACK.
"
* 'br-tests-use-topology-not-snitch' of https://github.com/xemul/scylla:
network_topology_strategy_test: Use topology instead of snitch
network_topology_strategy_test: Populate explicit topology
dirty_memory_manager tracks lsa regions (memtables) under region_group:s,
in order to be able to pick up the largest memtable as a candidate for
flushing.
Just as region_group:s contain regions, they can also contain other
region_group:s in a nested structure. It also tracks the nested region_group
that contains the largest region in a binomial heap.
This latter facility is no longer used. It saw use when we had the system
dirty_memory_manager nested under the user dirty_memory_manager, but
that proved too complicated so it was undone. We still nest a virtual
region_group under the real region_group, and in fact it is the
virtual region_group that holds the memtables, but it is accessed
directly to find the largest memtable (region_group::get_largest_region)
and so all the mechanism that sorts region_group:s is bypassed.
Start to dismantle this house of cards by removing the subgroup
sorting. Since the hierarchy has exactly one parent and one child,
it's clearly useless. This is seen by the fact that we can just remove
everything related.
We still need the _subgroups member to hold the virtual region_group;
it's replaced by a vector. I verified that the non-intrusive vector
is exception safe since push_back() happens at the very end; in any
case this is early during setup where we aren't under memory pressure.
A few tests that check the removed functionality are deleted.
Closes#11515
This patch adds set of 10 cenarios that have been unveiled during additional testing.
In particular, most of the scenarios cover ALTER TABLE statements, which - if not handled -
may break the guardrails safe-mode. The situations covered are:
- STCS->TWCS with no TTL defined
- STCS->TWCS with small TTL
- STCS->TWCS with large TTL value
- TWCS table with small to large TTL
- No TTL TWCS to large TTL and then small TTL
- twcs_max_window_count LiveUpdate - Decrease TTL
- twcs_max_window_count LiveUpdate - Switch CompactionStrategy
- No TTL TWCS table to STCS
- Large TTL TWCS table, modify attribute other than compaction and default_time_to_live
- Large TTL STCS table, fail to switch to TWCS with no TTL explicitly defined
This patch adds a test for checking the validity of tables using TimeWindowCompactionStrategy
with an incorrect number of compaction windows.
The twcs_max_window_count LiveUpdate-able parameter is also disabled during the execution of the
test in order to ensure that users can effectively disable the enforcement, should they want.
This patch adds a testcase for TimeWindowCompactionStrategy tables created with no
default_time_to_live defined. It makes use of the LiveUpdate-able restrict_twcs_default_ttl
parameter in order to determine whether TWCS tables without TTL should be forbidden or not.
The test replays all 3 possible variations of the tri_mode_restriction and verifies tables
are correctly created/altered according to the current setting on the replica which receives
the request.
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Preparatory change for main sstable set to be moved into compaction
group. After that, tests can no longer direct access the main
set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
column_family_test::add_sstable will soon be changed to run in a thread,
and it's not needed in this procedure, so let's remove its usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.
The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.
In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.
In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.
Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharingCloses#11164
* github.com:scylladb/scylladb:
raft: broadcast_tables: add broadcast_kv_store test
raft: broadcast_tables: add returning query result
raft: broadcast_tables: add execution of intermediate language
raft: broadcast_tables: add compilation of cql to intermediate language
raft: broadcast_tables: add definition of intermediate language
db: system_keyspace: add broadcast_kv_store table
db: config: add BROADCAST_TABLES feature flag
and override it in table::table_state to get the tombstone_gc_state
from the table's compaction_manager.
It is going to be used in the next patched to pass the gc state
from the compaction_strategy down to sstables and compaction.
table_state_for_test was modified to just keep a null
tombstone_gc_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Most of the test's cases use rack-inferring snitch driver and get
DC/RACK from it via the test_dc_rack() helper. The helper was introduced
in one of the previous sets to populate token metadata with some DC/RACK
as normal tokens manipulations required respective endpoint in topology.
This patch removes the usage of global snitch and replaces it with the
pre-populated topology. The pre-population is done in rack-inferring
snitch like manner, since token_metadata still uses global snitch and
the locations from snitch and this temporary topology should match.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a test case that makes its own snitch driver that generates
pre-claculated DC/RACK data for test endpoints. This patch replaces this
custom snitch driver with a standalone topology object.
Note: to get DC/RACK info from this topo the get_location() is used
since the get_rack()/get_datacenter() are still wrappers around global
snitch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
"
The topology object maintains all sort of node/DC/RACK mappings on
board. When new entries are added to it the DC and RACK are taken
from the global snitch instance which, in turn, checks gossiper,
system keyspace and its local caches.
This set make topology population API require DC and RACK via the
call argument. In most of the cases the populating code is the
storage service that knows exactly where to get those from.
After this set it will be possible to remove the dependency knot
consiting of snitch, gossiper, system keyspace and messaging.
"
* 'br-topology-dc-rack-info' of https://github.com/xemul/scylla:
toplogy: Use the provided dc/rack info
test: Provide testing dc/rack infos
storage_service: Provide dc/rack for snitch reconfiguration
storage_service: Provide dc/rack from system ks on start
storage_service: Provide dc/rack from gossiper for replacement
storage_service: Provide dc/rack from gossiper for remotes
storage_service,dht,repair: Provide local dc/rack from system ks
system_keyspace: Cache local dc-rack on .start()
topology: Some renames after previous patch
topology: Require entry in the map for update_normal_tokens()
topology: Make update_endpoint() accept dc-rack info
replication_strategy: Accept dc-rack as get_pending_address_ranges argument
dht: Carry dc-rack over boot_strapper and range_streamer
storage_service: Make replacement info a real struct
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:
scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.
The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.
If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.
The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.
Fixes#11396
There's a test that's sensitive to correct dc/rack info for testing
entries. To populate them it uses global rack-inferring snitch instance
or a special "testing" snitch. To make it continue working add a helper
that would populate the topology properly (spoiler: next branch will
replace it with explicitly populated topology object).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question tries to be on the safest side and adds the
enpoint for which it updates the tokens into the topology. From now on
it's up to the caller to put the endpoint into topology in advance.
So most of what this patch does is places topology.update_endpoint()
into the relevant places of the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tests are the only users of batch tokens updating "sugar" which
actually makes things more complicated
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the state of LSA is scattered across a handful of global variables. This series consolidates all these into a single one: the shard tracker. Beyond reducing the number of globals (the less globals, the better) this paves the way for a planned de-globalization of the shard tracker itself.
There is one separate global left, the static migrators registry. This is left as-is for now.
Closes#11284
* github.com:scylladb/scylladb:
utils/logalloc: remove reclaim_timer:: globals
utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg
utils/logalloc: move global stat accessors to tracker
utils/logalloc: allocating_section: don't use the global tracker
utils/logalloc: pass down tracker::impl reference to segment_pool
utils/logalloc: move segment pool into tracker
utils/logalloc: add tracker member to basic_region_impl
utils/logalloc: make segment independent of segment pool
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.
Instead, iterate over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.
While at it, this series contains some other cleanups on this path
to improve the code readability and maybe make the compiler's life
easier as for optimizing the cleaned-up code.
Closes#11271
* github.com:scylladb/scylladb:
mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
mutation: consume_clustering_fragments: reindent
mutation: consume_clustering_fragments: shuffle emit_rt logic around
mutation: consume, consume_gently: simplify partition_start logic
mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
mutation: consume_clustering_fragments: keep the reversed schema in cookie
mutation: clustering_iterators: get rid of current_rt
mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.
This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.
Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.
Add a unit test that verifies that.
Fixes#11198Closes#11269
* github.com:scylladb/scylladb:
mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
frozen_mutation: consume and consume_gently in-order
frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.
This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.
Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.
Add a unit test that verifies that.
Fixes#11198
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
from Tomasz Grabiec
This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.
No known production impact.
Refs https://github.com/scylladb/scylladb/issues/11307Closes#11312
* github.com:scylladb/scylladb:
test: mutation_test: Add explicit test for mutation commutativity
test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
db: mutation_partition: Drop unnecessary maybe_shadow()
db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
mutation_partition: row: make row marker shadowing symmetric
This pull request introduces global secondary-indexing for non-frozen collections.
The intent is to enable such queries:
```
CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));
-- index on test(c) is the same as index on (values(c))
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);
SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;
```
We use here all-familiar materialized views (MVs). Scylla treats all the
collections the same way - they're a list of pairs (key, value). In case
of sets, the value type is dummy one. In case of lists, the key type is
TIMEUUID. When describing the design, I will forget that there is more
than one collection type. Suppose that the columns in the base table
were as follows:
```
pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2)
```
The MV schema is as follows (the names of columns which are not the same
as in base might be different). All the columns here form the primary
key.
```
-- for index over entries
indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over keys
indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over values
indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int
```
The reason for the last additional column is that the values from a collection might not be unique.
Fixes#2962Fixes#8745Fixes#10707
This patch does not implement **local** secondary indexes for collection columns: Refs #10713.
Closes#10841
* github.com:scylladb/scylladb:
test/cql-pytest: un-xfail yet another passing collection-indexing test
secondary index: fix paging in map value indexing
test/cql-pytest: test for paging with collection values index
cql, view: rename and explain bytes_with_action
cql, index: make collection indexing a cluster feature
test/cql-pytest: failing tests for oversized key values in MV and SI
cql: fix secondary index "target" when column name has special characters
cql, index: improve error messages
cql, index: fix default index name for collection index
test/cql-pytest: un-xfail several collecting indexing tests
test/cql-pytest/test_secondary_index: verify that local index on collection fails.
docs/design-notes/secondary_index: add `VALUES` to index target list
test/cql-pytest/test_secondary_index: add randomized test for indexes on collections
cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra
cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail.
test/boost/secondary_index_test: update for non-frozen indexes on collections
test/cql-pytest: Uncomment collection indexes tests that should be working now
cql, index: don't use IS NOT NULL on collection column
cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows
cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections
cql3, index: Use entries() indexes on collections for queries
cql3, index: Use keys() and values() indexes on collections for queries.
types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function
db/view/view.cc: compute view_updates for views over collections
view info: has_computed_column_depending_on_base_non_primary_key
column_computation: depends_on_non_primary_key_column
schema, index/secondary_index_manager: make schema for index-induced mv
index/secondary_index_manager: extract keys, values, entries types from collection
cql3/statements/: validate CREATE INDEX for index over a collection
cql3/statements/create_index_statement,index_target: rewrite index target for collection
column_computation.hh, schema.cc: collection_column_computation
column_computation.hh, schema.cc: compute_value interface refactor
Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.