Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.
Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than defining generate_random,
and use respectively in unit tests.
(It was inherited from raft::internal::tagged_id.)
This allows us to shorten counter_id's definition
to just using utils::tagged_uuid<struct counter_id_tag>.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add include statements to satisfy dependencies.
Delete, now unneeded, include directives from the upper level
source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For generating #include directives in the generated files,
so we don't have to hand-craft include the dependencies
in the right order.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass an optional truncated_at time_point to
truncate_table_on_all_shards instead of the over-complicated
timestamp_func that returns the same time_point on all shards
anyhow, and was only used for coordination across shards.
Since now we synchronize the internal execution phase in
truncate_table_on_all_shards, there is no longer need
for this timestamp_func.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
timestamp_func
Since in the drop_table case we want to discard ALL
sstables in the table, not only those with `max_data_age()`
up until drop started.
Fixes#11232
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Following up on 1c26d49fba,
apply mutations on the correct db shard in all test cases
before we define and use database::truncate_table_on_all_shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
These are the first commits out of #10815.
It starts by moving pytest logic out of the common `test/conftest.py`
and into `test/topology/conftest.py`, including removing the async
support as it's not used anywhere else.
There's a fix of a bug of leaving tables in `RandomTables.tables` after
dropping all of them.
Keyspace creation is moved out of `conftest.py` into `RandomTables` as
it makes more sense and this way topology tests avoid all the
workarounds for old version (topology needs ScyllaDB 5+ for Raft,
anyway).
And a minor fix.
Closes#11210
* github.com:scylladb/scylladb:
test.py: fix type hint for seed in ScyllaServer
test.py: create/drop keyspace in tables helper
test.py: RandomTables clear list when dropping all tables
test.py: move topology conftest logic to its own
test.py: async topology tests auto run with pytest_asyncio
Since all topology test will use the helper, create the keyspace in the
helper.
Avoid the need of dropping all tables per test and just drop the
keyspace.
While there, use blocking CQL execution so it can be used in the
constructor and avoids possible issues with scheduling on cleanup. Also,
creation and drop should happen only once per cluster and no test should
be running changes (either not started or finished).
All topology tests are for Scylla with Raft. So don't use the Cassandra
this_dc workaround as it's unnecessary for Scylla.
Remove return type of random_tables fixture to match other fixtures
everywhere else.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Clear the list of active tables when dropping them.
While there do the list element exchange atomically across active and
removed tables lists.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Move asyncio, Raft checks, and RandomTables to topology test suite's own
conftest file.
While there, use non-async version of pre-checks to avoid unnecessary
complexity (we want async tests, not async setup, for now).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Async tests and fixtures in the topology directory are expected to run
with pytest_asyncio (not other async frameworks). Force this with auto
mode.
CI has an older pytest_asyncio version lacking pytest_asyncio.fixture.
Auto mode helps avoiding the need of it and tests and fixtures can just
be marked with regular @pytest.mark.async.
This way tests can run in both older and newer versions of the packages.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
"
There are several helpers in this .cc file that need to get datacenter
for endpoints. For it they use global snitch, because there's no other
place out there to get that data from.
The whole dc/rack info is now moving to topology, so this set patches
the consistency_level.cc to get the topology. This is done two ways.
First, the helpers that have keyspace at hand may get the topology via
ks's effective_replication_map.
Two difficult cases are db::is_local() and db.count_local_endpoints()
because both have just inet_address at hand. Those are patched to be
methods of topology itself and all their callers already mess with
token metadata and can get topology from it.
"
* 'br-consistency-level-over-topology' of https://github.com/xemul/scylla:
consistency_level: Remove is_local() and count_local_endpoints()
storage_proxy: Use topology::local_endpoints_count()
storage_proxy: Use proxy's topology for DC checks
storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
storage_proxy: Use topology local_dc_filter in its methods
storage_proxy: Mark some digest_read_resolver methods private
forwarding_service: Use topology local_dc_filter
storage_service: Use topology local_dc_filter
consistency_level: Use topology local_dc_filter
consitency-level: Call count_local_endpoints from topology
consistency_level: Get datacenter from topology
replication_strategy: Remove hold snitch reference
effective_replication_map: Get datacenter from topology
topology: Add local-dc detection shugar
When the strategy is constructed there's no place to get snitch from
so the global instance is used. However, after previous patch the
replication strategy no longer needs snitch, so this dependency can
be dropped
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Implementing json2sstable functionality. It allows generating an sstable from a JSON description of its content. Uses identical schema to dump-data, so it is possible to regenerate an existing sstable, by feeding the output of dump-data to write.
Most of the scylla storage engine features are supported. The only non-supported features are counters and non-strictly atomic data types (including frozen collections, tuples and UDTs).
Example invocation:
```
scylla sstable write --system-schema system_schema.columns --input-file ./input.json --generation 0
```
Refs: https://github.com/scylladb/scylladb/issues/9681
Future plans:
* Complete support for remaining features (counters and non-atomic types).
* Make sstable format configurable on the command line.
Closes#11181
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add test for sstable write
test/cql-pytest: test-tools.py actually test with multiple sstables
test/cql-pytest: test_tools.py: reduce the number of test-cases
tools/scylla-sstable: introduce the write operation
tools/scylla-sstable: add support for writer operations
tools/scylla-sstable: dump-data: write bound-weight as int
tools/scylla-sstable: dump-data: always write deletion time for cell tombstones
tools/scylla-sstable: dump-data: add timezone to deletion_time
types: publish timestamp_from_string()
We can now do a full circle: dump an sstable to json, generate an
sstable from it, then dump again and compare to the original json.
Expand the existing simple_no_clustering_table and
simple_clustering_table schema/data to improve coverage of things like
TTL, tombstones and static rows.
The test-cases in this suite have a parameter to run with one or
multiple input sstables. This was broken as each test table generated a
single sstable. Fix this so we actually get single/multiple input
sstable coverage.
Currently this test-case exercises all the available component dumpers
with many different schemas. This doesn't add any value for most of the
dumpers, save for the dump-data one. It does have a cost however in
run-time of these test-cases. Test the dumpers which are mostly
indifferent to the schema with just a single one, cutting the number of
generated test-cases from 70 to 30.
Start compaction_manager as a sharded service
and pass a reference to it to the database rather
than having the database construct its own compaction_manager.
This is part of the wider scope effort to decouple compaction from replica database and table.
Closes#11099
* github.com:scylladb/scylladb:
compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges
compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges
main: start compaction_manager as a sharded service
compaction_manager: keep config as member
backlog_controller: keep scheduling_group by value
backlog_controller: scheduling_group: keep io_priority_class by value
backlog_controller: scheduling_group: define default member initializers
backlog_controller: get rid of _interval member
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.
The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.
Fixes: https://github.com/scylladb/scylladb/issues/9482Closes#11175
* github.com:scylladb/scylladb:
test/cql-pytest: add regression test for "IDL frame truncated" error
mutation_compactor: detach_state(): make it no-op if partition was exhausted
querier: use full_position in shard_mutation_querier
Calling WebAssembly UDFs requires wasmtime instance. Creating such an instance is expensive,
but these instances can be reused for subsequent calls of the same UDF on various inputs.
This patch introduces a way of reusing wasmtime instances: a wasm instance cache.
The cache stores a wasmtime instance for each UDF and scheduling group. The instances are
evicted using LRU strategy and their size is based on the size of their wasm memories.
The instances stored in the cache are also dropped when the UDF is dropped itself. For that reason,
the first patch modifies the current implementation of UDF dropping, so that the instance dropping may be added
later. The patch also removes the need of compiling the UDF again when dropping it.
The second patch contains the implementation and use of the new cache. The cache is implemented
in `lang/wasm_instance_cache.hh` and the main ways of using it are the `run_script` methods from `wasm.hh`
The third patch adds tests to `test_wasm.py` that check the correctness and performance of the new
cache. The tests confirm the instance reuse, size limits, instance eviction after timeout and after dropping the UDF.
Closes#10306
* github.com:scylladb/scylladb:
wasm: test instances reuse
wasm: reuse UDF instances
schema_tables: simplify merge_functions and avoid extra compilation
Currently they are copied for the get_sstables function
so this change reduces copies.
Also, it will allow further decoupling of compaction_manager
from replica::database, by letting the caller of
perform_cleanup and perform_sstable_upgrade get the
owned token ranges from db and pass it to the perform_*
functions in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And pass a reference to it to the database rather
than having the database construct its own compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and return status over the rest api' from Aleksandra Martyniuk
Currently, scrub returns to user the number indicating operation
result as follows:
- 1 when the operation was aborted;
- 3 in validate and segregate modes when validation errors were found
(and in segregate mode - fixed);
- 0 if operation ended successfully.
To achieve so, if an operation was aborted in abort mode, then
the exception is propagated to storage_service.cc. Also the number
of validation errors for current scrub is gathered and summed
from each shard there.
The number of validation errors is counted and registered in metrics.
Metrics provide common counters for all scrub operation within
a compaction manager, though. Thus, to check the exact number
of validation errors, the comparison of counter value before and after
scrub operation needs to be done.
Closes#11074
* github.com:scylladb/scylladb:
scrub compaction: return status indicating aborted operations over the rest api
test: move scylla_inject_error from alternator/ to cql-pytest/
scrub compaction: count validation errors and return status over the rest api
scrub compaction: count validation errors for specific scrub task
compaction: extract statistics in compaction_result
scrub compaction: register validation errors in metrics
scrub compaction: count validation errors
This reverts commit c3bad157e5, reversing
changes made to e66809d051. The checks it
adds are triggered by some dtests. While it's possible the check is
triggered due to an existing problem, better to investigate it out-of-tree.
Fixes#11169.
over the rest api
Performing compaction scrub user did not know whether an operation
was aborted.
If compaction scrub is aborted, return status the user gets over
rest api is set to 1.
Move scylla_inject_error from alternator/ to cql-pytest/ so it
can be reached from various tests dirs. alternator/util.py is
renamed to alternator/alternator_util.py to avoid name shadowing.
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.
The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.
Fixes: https://github.com/scylladb/scylla/issues/9482Closes#11137
* github.com:scylladb/scylladb:
test/cql-pytest: add regression test for "IDL frame truncated" error
query: query_result_builder: add check for missing partition-end
mutation_compactor: detach_state(): make it no-op if partition was exhausted
querier: use full_position in shard_mutation_querier
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables. In that case we don't want to add fuel
to the fire by creating yet another sstable.
Fixes#4116Closes#10954
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
compaction_manager: add maybe_wait_for_sstable_count_reduction
time_window_compaction_strategy: get_sstables_for_compaction: clean up code
time_window_compaction_strategy: make get_sstables_for_compaction idempotent
time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages
leveled_manifest: pass compaction_counter as const&
Currently logalloc::region is relying on boost binomial_heap handle to properly move listeners registration when the region (when derived from dirty_memory_manager_logalloc::size_tracked_region) is moved, like boost::intrusive link hooks do -
hence 81e20ceaab/dirty_memory_manager.cc (L89-L90) does nothing.
Unfortunately, this doesn't work as expected.
This series adds a unit test that verifies the move semantics
and a fix to size_tracked_region and region_group code to make it pass.
Also "logalloc: region: get_impl might be called on disengaged _impl when moved"
fixes a couple corner cases where the shared _impl could be dereferenced when disengaged, and
the change also adds a unit test for that too.
Closes#11141
* github.com:scylladb/scylla:
logalloc: region: properly track listeners when moved
logalloc: region_impl: add moved method
logalloc: region: merge: optimize getting other impl
logalloc: region: merge: call region_impl::unlisten
logalloc: region: call unlisten rather than open coding it
logalloc: region move-ctor: initialize _impl
logalloc: region: get_impl might be called on disengaged _impl when moved
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
(cherry picked from commit b5684aa96d)
(cherry picked from commit 25407a7e41)
consume_clustering_fragments already ignores dummy rows, but does it in
the wrong place. Currently they're ignored after comparing them with
range tombstones. This change skips them before any useful work is done
with them.
Consider a simplified mutation reversal scenario scenario (ckp is
clustering key prefix, -1, 0, 1 are bound_weights):
schema_ptr s = schema_builder{"ks", "cf"}
.with_column("pk", bytes_type, column_kind::partition_key)
.with_column("ck1", bytes_type, column_kind::clustering_key)
.build();
Input range tombstone positions:
{clustered, ckp{}, before}
{clustered, ckp{1}, after}
Clustering rows:
{clustered, ckp{2}, equal}
{clustered, ckp{}, after} // dummy row
During reversal, clustering rows are read backwards, and reversed range
tombstone positions are read forwards (because the range tombstones are
reversed and applied backwards). The read order in the example above is:
Reversed range tombstone positions:
1: {clustered, ckp{}, before}
2: {clustered, ckp{1}, before}
Clustering rows read backwards:
3: {clustered, ckp{}, after} // dummy row
4: {clustered, ckp{2}, equal}
Then we effectively do the merge part of merge sort, trying to put all
fragments in order according to their positions from the two lists
above. However, the dummy row is used in the comparison, and it compares
to be gt each of the reversed range tombstone positions. Then we
try to emit the clustering row, but only at that point we notice it's
dummy and should be skipped. Subsequent row with ckp{2} is compared to
the last used range tombstone position and the fragments are out of
order (in reversed schema, ckp{2} should come before ckp{1}).
The solution is to move the logic skipping the dummy clustering rows to
the beginning of the loop, so they can be ignored before they're used.
Fixes: https://github.com/scylladb/scylla/issues/11147Closes#11129
* github.com:scylladb/scylla:
mutation: Add test if mutations are consumed in order
test: Move validating_consumer to test/lib/mutation_assertions.hh
mutation: Ignore dummy rows when consuming clustering fragments
First check if _impl is engaged before accessing it
to set its _region = this in the move constructor and
move assignment operator.
Add unit test for these odd orner cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.
The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).
Closes#11058
* github.com:scylladb/scylla:
Add summary_test
database: Reduce the number of per-table metrics
replica/table.cc: Do not register per-table metrics for system
histogram_metrics_helper.hh: Add to_metrics_summary function
Unified histogram, estimated_histogram, rates, and summaries
Split the timed_rate_moving_average into data and timer
utils/histogram.hh: should_sample should use a bitmask
estimated_histogram: add missing getter method
The series unifies memtable flush error handling into table::seal_active_memtable
following up on f6d9d6175f.
The goal here is to prevent an infinite retry loop as in #10498
by aborting on any error that is not bad_alloc.
Fixes#10498Closes#10691
* github.com:scylladb/scylla:
test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
test: memtable_test: failed_flush_prevents_writes: extend error injection
table: seal_active_memtable: abort if retried for too long
table: seal_active_memtable: abort on unexpected error
table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
dirty_memory_manager: flush_permit: add has_sstable_write_permit
dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
dirty_memory_manager: flush_permit: make _sstable_write_permit optional
table: reindent seal_active_memtable
table: coroutinize seal_active_memtable
memtable_list: mark functions noexcept
commitlog: make discard_completed_segments and friends noexcept
dirty_memory_manager: flush_when_needed: target error handling at flush_one
database: delete unused seal_delayed_fn_type
dirty_memory_manager: mark functions noexcept
memtable: mark functions noexcept
memtable: memtable_encoding_stats_collector: mark functions noexcept
encoding_state: mark functions noexcept
logalloc: mark free functions noexcept
logalloc: allocating_section: mark functions noexcept
logalloc: allocating_section: guard: mark constructor noexcept
logalloc: reclaim_lock: mark functions noexcept
logalloc: tracker_reclaimer_lock: mark constructor noexcept
logalloc: mark shard_tracker noexcept
logalloc: region: mark functions const/noexcept
logalloc: basic_region_impl: mark functions noexcept
logalloc: region_impl: mark functions noexcept
utils: log_heap: mark functions noexcept
logalloc: region_impl: object_descriptor: mark functions noexcept
logalloc: region_group: mark functions noexcept
logalloc: tracker: mark functions const/noexcept
logalloc: tracker::impl: make region_occupancy and friends const
logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
logalloc: tracker::impl: mark functions noexcept
logalloc: segment: mark functions const / noexcept
logalloc: segment_pool: add const variant of descriptor method
logalloc: segment_pool: move descriptor method to class definition
logalloc: segment_pool: mark functions const/noexcept
logalloc: segment_pool: delete unused free_or_restore_to_reserve method
utils: dynamic_bitset: mark functions noexcept
utils: dynamic_bitset: delete unused members
logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload
logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const
logalloc: segment_store: make can_allocate_more_segments const
logalloc: segment_store: mark functions noexcept
logalloc: segment_descriptor: mark functions noexcept
logalloc: occupancy_stats: mark functions noexcept
min_max_tracker: mark functions noexcept
gc_clock, db_clock: mark functions noexcept
dirty_memory_manager: region_group: mark functions noexcept
dirty_memory_manager: region_group: make simple constructor noexcept
dirty_memory_manager: region_group_reclaimer mark functions noexcept
logalloc: lsa_buffer: mark functions noexcept