When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.
This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)
When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.
This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.
Fixes: #6200Fixes: #12014Closes#12031
* github.com:scylladb/scylladb:
cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
boost/restrictions-test: uncomment part of the test that passes now
cql-pytest: enable test for filtering combined multi column and regular column restrictions
cql3: don't ignore other restrictions when a multi column restriction is present during filtering
(cherry picked from commit 2d2034ea28)
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
(cherry picked from commit 7f80602b01)
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
(cherry picked from commit be9d1c4df4)
from Tomasz Grabiec
This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.
No known production impact.
Refs https://github.com/scylladb/scylladb/issues/11307Closes#11312
* github.com:scylladb/scylladb:
test: mutation_test: Add explicit test for mutation commutativity
test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
db: mutation_partition: Drop unnecessary maybe_shadow()
db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
mutation_partition: row: make row marker shadowing symmetric
(cherry picked from commit 484004e766)
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.
Closes#11260
(cherry picked from commit 8ee5b69f80)
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
(cherry picked from commit afa7960926)
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239Closes#11240
* github.com:scylladb/scylladb:
test: mvcc: Fix illegal use of maybe_refresh()
tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
tests: row_cache_test: Introduce one_shot mode to throttle
row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
Currently they are copied for the get_sstables function
so this change reduces copies.
Also, it will allow further decoupling of compaction_manager
from replica::database, by letting the caller of
perform_cleanup and perform_sstable_upgrade get the
owned token ranges from db and pass it to the perform_*
functions in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
over the rest api
Performing compaction scrub user did not know whether an operation
was aborted.
If compaction scrub is aborted, return status the user gets over
rest api is set to 1.
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables. In that case we don't want to add fuel
to the fire by creating yet another sstable.
Fixes#4116Closes#10954
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
compaction_manager: add maybe_wait_for_sstable_count_reduction
time_window_compaction_strategy: get_sstables_for_compaction: clean up code
time_window_compaction_strategy: make get_sstables_for_compaction idempotent
time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages
leveled_manifest: pass compaction_counter as const&
Currently logalloc::region is relying on boost binomial_heap handle to properly move listeners registration when the region (when derived from dirty_memory_manager_logalloc::size_tracked_region) is moved, like boost::intrusive link hooks do -
hence 81e20ceaab/dirty_memory_manager.cc (L89-L90) does nothing.
Unfortunately, this doesn't work as expected.
This series adds a unit test that verifies the move semantics
and a fix to size_tracked_region and region_group code to make it pass.
Also "logalloc: region: get_impl might be called on disengaged _impl when moved"
fixes a couple corner cases where the shared _impl could be dereferenced when disengaged, and
the change also adds a unit test for that too.
Closes#11141
* github.com:scylladb/scylla:
logalloc: region: properly track listeners when moved
logalloc: region_impl: add moved method
logalloc: region: merge: optimize getting other impl
logalloc: region: merge: call region_impl::unlisten
logalloc: region: call unlisten rather than open coding it
logalloc: region move-ctor: initialize _impl
logalloc: region: get_impl might be called on disengaged _impl when moved
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
(cherry picked from commit b5684aa96d)
(cherry picked from commit 25407a7e41)
consume_clustering_fragments already ignores dummy rows, but does it in
the wrong place. Currently they're ignored after comparing them with
range tombstones. This change skips them before any useful work is done
with them.
Consider a simplified mutation reversal scenario scenario (ckp is
clustering key prefix, -1, 0, 1 are bound_weights):
schema_ptr s = schema_builder{"ks", "cf"}
.with_column("pk", bytes_type, column_kind::partition_key)
.with_column("ck1", bytes_type, column_kind::clustering_key)
.build();
Input range tombstone positions:
{clustered, ckp{}, before}
{clustered, ckp{1}, after}
Clustering rows:
{clustered, ckp{2}, equal}
{clustered, ckp{}, after} // dummy row
During reversal, clustering rows are read backwards, and reversed range
tombstone positions are read forwards (because the range tombstones are
reversed and applied backwards). The read order in the example above is:
Reversed range tombstone positions:
1: {clustered, ckp{}, before}
2: {clustered, ckp{1}, before}
Clustering rows read backwards:
3: {clustered, ckp{}, after} // dummy row
4: {clustered, ckp{2}, equal}
Then we effectively do the merge part of merge sort, trying to put all
fragments in order according to their positions from the two lists
above. However, the dummy row is used in the comparison, and it compares
to be gt each of the reversed range tombstone positions. Then we
try to emit the clustering row, but only at that point we notice it's
dummy and should be skipped. Subsequent row with ckp{2} is compared to
the last used range tombstone position and the fragments are out of
order (in reversed schema, ckp{2} should come before ckp{1}).
The solution is to move the logic skipping the dummy clustering rows to
the beginning of the loop, so they can be ignored before they're used.
Fixes: https://github.com/scylladb/scylla/issues/11147Closes#11129
* github.com:scylladb/scylla:
mutation: Add test if mutations are consumed in order
test: Move validating_consumer to test/lib/mutation_assertions.hh
mutation: Ignore dummy rows when consuming clustering fragments
First check if _impl is engaged before accessing it
to set its _region = this in the move constructor and
move assignment operator.
Add unit test for these odd orner cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.
The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).
Closes#11058
* github.com:scylladb/scylla:
Add summary_test
database: Reduce the number of per-table metrics
replica/table.cc: Do not register per-table metrics for system
histogram_metrics_helper.hh: Add to_metrics_summary function
Unified histogram, estimated_histogram, rates, and summaries
Split the timed_rate_moving_average into data and timer
utils/histogram.hh: should_sample should use a bitmask
estimated_histogram: add missing getter method
The series unifies memtable flush error handling into table::seal_active_memtable
following up on f6d9d6175f.
The goal here is to prevent an infinite retry loop as in #10498
by aborting on any error that is not bad_alloc.
Fixes#10498Closes#10691
* github.com:scylladb/scylla:
test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
test: memtable_test: failed_flush_prevents_writes: extend error injection
table: seal_active_memtable: abort if retried for too long
table: seal_active_memtable: abort on unexpected error
table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
dirty_memory_manager: flush_permit: add has_sstable_write_permit
dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
dirty_memory_manager: flush_permit: make _sstable_write_permit optional
table: reindent seal_active_memtable
table: coroutinize seal_active_memtable
memtable_list: mark functions noexcept
commitlog: make discard_completed_segments and friends noexcept
dirty_memory_manager: flush_when_needed: target error handling at flush_one
database: delete unused seal_delayed_fn_type
dirty_memory_manager: mark functions noexcept
memtable: mark functions noexcept
memtable: memtable_encoding_stats_collector: mark functions noexcept
encoding_state: mark functions noexcept
logalloc: mark free functions noexcept
logalloc: allocating_section: mark functions noexcept
logalloc: allocating_section: guard: mark constructor noexcept
logalloc: reclaim_lock: mark functions noexcept
logalloc: tracker_reclaimer_lock: mark constructor noexcept
logalloc: mark shard_tracker noexcept
logalloc: region: mark functions const/noexcept
logalloc: basic_region_impl: mark functions noexcept
logalloc: region_impl: mark functions noexcept
utils: log_heap: mark functions noexcept
logalloc: region_impl: object_descriptor: mark functions noexcept
logalloc: region_group: mark functions noexcept
logalloc: tracker: mark functions const/noexcept
logalloc: tracker::impl: make region_occupancy and friends const
logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
logalloc: tracker::impl: mark functions noexcept
logalloc: segment: mark functions const / noexcept
logalloc: segment_pool: add const variant of descriptor method
logalloc: segment_pool: move descriptor method to class definition
logalloc: segment_pool: mark functions const/noexcept
logalloc: segment_pool: delete unused free_or_restore_to_reserve method
utils: dynamic_bitset: mark functions noexcept
utils: dynamic_bitset: delete unused members
logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload
logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const
logalloc: segment_store: make can_allocate_more_segments const
logalloc: segment_store: mark functions noexcept
logalloc: segment_descriptor: mark functions noexcept
logalloc: occupancy_stats: mark functions noexcept
min_max_tracker: mark functions noexcept
gc_clock, db_clock: mark functions noexcept
dirty_memory_manager: region_group: mark functions noexcept
dirty_memory_manager: region_group: make simple constructor noexcept
dirty_memory_manager: region_group_reclaimer mark functions noexcept
logalloc: lsa_buffer: mark functions noexcept
We know that sstable_run is supposed to contain disjoint files only,
but this assumption can temporarily break when switching strategies
as TWCS, for example, can incorrectly pick the same run id for
sstables in different windows during segregation. So when switching
from TWCS to ICS, it could happen a sstable_run won't contain disjoint
files. We should definitely fix TWCS and any other strategy doing
that, but sstable_run should have disjointness as actual invariant,
not be relaxed on it. Otherwise, we cannot build readers on this
assumption, so more complicated logic have to be added to merge
overlapping files.
After this patch, sstable_run will reject insertion of a file that
will cause the invariant to break, so caller will have to check
that and push that file into a different sstable run.
Closes#11116
Now that memtable flush error handling was moved entirely
to table::seal_active_memtable, we don't need to notify_soft_pressure
to keep retry going. The inifinite retry loop should
eventually either succeed or die (by isolating the node or aborting)
on its own.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And let seal_active_memtable decide about how to handle them
as now all flush error handling logic is implemented there.
In particular, unlike today, sstable write errors will
cause internal error rather than loop forever.
Also, check for shutdown earlier to ignore errors
like semaphore_broken that might happen when
the table is stopped.
Refs #10498
(The issue will be considered fixed when going
into maintenance mode on write errors rather than
throwing internal error and potentially retrying forever)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
logalloc manages regions of log-structured allocated memory, and region_groups
containing such regions and other region_groups. region_groups were introduced
for accounting purposes - first to limit the amount of memory in memtables, then to
match new dirty memory allocation rate with memtable flushing rate so we never
hit a situation where allocation rate exceeded flush rate, and we exceed our limit.
The problem is that the abstraction is very weak - if we want to change anything
in memtable flush control we'll need to change region_groups too - and also
expensive to maintain.
The solution is to break the abstraction and move region_groups to memtable
dirty memory management code. Instead introduce a new, simpler abstraction,
the region_listener, which communicates changes in region memory consumption
to an external piece of code, which can then choose to do with it what it likes.
The long term plan is to completely remove region_groups and fold them into dirty_memory_manager:
- make each memtable a region_listener so it gets called back after size changes
- make memtables inform their dirty_memory_manager about the size to dirty_memory_manager can decide to throttle writes and which memtable to pick to flush
Closes#10839
* github.com:scylladb/scylla:
logalloc: drop region_impl public accessors
logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc
logalloc: relax lifetime rules around region_listener
logalloc, dirty_memory_manager: move region_group and associated code
logalloc: expose tracker_reclaimer_lock
logalloc: reimplement tracker_reclaim_lock to avoid using hidden classes
logalloc: reduce friendship between region and region_group
logalloc: decouple region_group from region
memtable: stop using logalloc::region::group() to test for flushed memtables
This pull request introduces a "synchronous mode" for global views. In this mode, all view updates are applied synchronously as if the view was local.
Marking view as a synchronous one can be done using `CREATE MATERIALIZED VIEW` and `ALTER MATERIALIZED VIEW`. E.g.:
```cql
ALTER MATERIALIZED VIEW ks.v WITH synchronous_updates = true;
```
Marking view as a synchronous one was done using tags (originally used by alternator). No big modifications in the view's code were needed.
Fixes: https://github.com/scylladb/scylla/issues/10545Closes#11013
* github.com:scylladb/scylla:
cql-pytest: extend synchronous mv test with new cases
cql-pytest: allow extra parameters in new_materialized_view
docs: add a paragraph on view synchronous updates
test/boost/cql_query_test: add test setting synchronous updates property
test: cql-pytest: add a test for synchronous mode materialized views
db: view: react to synchronous updates tag
cql3: statements: cf_prop_defs: apply synchronous updates tag
alternator, db: move the tag code to db/tags
cql3: statements: add a synchronous_updates property
The region_group mechanism used an intrusive heap handle embedded in
logalloc::region to allow region_group:s to track the largest region. But
with region_group moved out of logalloc, the handle is out of place.
Move it out, introducing a new intermediate class size_tracked_region
to hold the heap handle. We might eventually merge the new class into
memtable (which derives from it), but that requires a large rearrangement
of unit tests, so defer that.
Currently, a region_listener is added during construction and removed
during destruction. This was done to mimick the old region(region_group&)
constructor, as region_listener replaces region_group.
However, this makes moving the binomial heap handle outside logalloc
difficult. The natural place for the handle is in a derived class
of logalloc::region (e.g. memtable), but members of this derived class
will be destroyed earlier than the logalloc::region here. We could play
trickes with an earlier base class but it's better to just decouple
region lifecycle from listener lifecycle.
Do that be adding listen()/unlisten() methods. Some small awkwardness
remains in that merge() implicitly unlistens (see comment in
region::unlisten).
Unit tests are adjusted.
region_group is an abstraction that allows accounting for groups of
regions, but the cost/benefit ratio of maintaining the abstraction
is poor. Each time we need to change decision algorithm of memtable
flushing (admittedly rarely), we need to distill that into an abstraction
for region_groups and then use it. An example is virtual regions groups;
we wanted to account for the partially flushed memtables and had to
invent region groups to stand in their place.
Rather than continuing to invest in the abstraction, break it now
and move it to the memtable dirty memory manager which is responsible
for making those decisions. The relevant code is moved to
dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and
a new unit test file is added as well.
A downside of the change is that unit testing will be more difficult.
Move closer to the goal of accepting a generic expression for WHERE
clause by accepting a generic expression in statement_restrictions. The
various callers will synthesize it from a vector of terms.
When analyzing a WHERE clause, we want to separate individual
factors (usually relations), and later partition them into
partition key, clustering key, and regular column relations. The
first step is separation, for which this helper is added.
Currently, it is not required since the grammar supplies the
expression in separated form, but this will not work once it is
relaxed to allow any expression in the WHERE clause.
A unit test is added.
This PR removes all code that used classes `restriction`, `restrictions` and their children.
There were two fields in `statement_restrictions` that needed to be dealt with: `_clustering_columns_restrictions` and `_nonprimary_key_restrictions`.
Each function was reimplemented to operate on the new expression representaiion and eventually these fields weren't needed anymore.
After that the restriction classes weren't used anymore and could be deleted as well.
Now all of the code responsible for analyzing WHERE clause and planning a query works on expressions.
Closes#11069
* github.com:scylladb/scylla:
cql3: Remove all remaining restrictions code
cql3: Move a function from restrictions class to the test
cql3: Remove initial_key_restrictions
cql3: expr: Remove convert_to_restriction
cql3: Remove _new from _new_nonprimary_key_restrictions
cql3: Remove _nonprimary_key_restrictions field
cql3: Reimplement uses of _nonprimary_key_restrictions using expression
cql3: Keep a map of single column nonprimary key restrictions
cql3: Remove _new from _new_clustering_columns_restrictions
cql3: Remove _clustering_columns_restrictions from statement_restrictions
cql3: Use a variable instead of dynamic cast
cql3: Use the new map of single column clustering restrictions
cql3: Keep a map of single column clustering key restrictions
cql3: Return an expression in get_clustering_columns_restrctions()
cql3: Reimplement _clustering_columns_restrictions->has_supporting_index()
cql3: Don't create single element conjunction
cql3: Add expr::index_supports_some_column
cql3: Reimplement has_unrestricted_components()
cql3: Reimplement _clustering_columns_restrictions->need_filtering()
cql3: Reimplement num_prefix_columns_that_need_not_be_filtered
cql3: Use the new clustering restrictions field instead of ->expression
cql3: Reimplement _clustering_columns_restrictions->size() using expressions
cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions
cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions
cql3: expr: Add has_only_eq_binops function
cql3: Reimplement _clustering_columns_restrictions->empty() using expressions
The classes restriction, restrictions and its children
aren't used anywhere now and can be safely removed.
Some includes need to be modified for the code to compile.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
statement_restrictions_test uses a function that is defined
in multi_column_restriction.hh.
This file will be removed soon and for the test to still work
the function is moved to the test source.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```
Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11066
* github.com:scylladb/scylla:
database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard
table: try_flush_memtable_to_sstable: consume: close reader on error
This PR extends #9209. It consists of 2 main points:
To enable parallelization of user-defined aggregates, reduction function was added to UDA definition. Reduction function is optional and it has to be scalar function that takes 2 arguments with type of UDA's state and returns UDA's state
All currently implemented native aggregates got their reducible counterpart, which return their state as final result, so it can be reduced with other result. Hence all native aggregates can now be distributed.
Local 3-node cluster made with current master. `node1` updated to this branch. Accessing node with `ccm <node-name> cqlsh`
I've tested belowed things from both old and new node:
- creating UDA with reduce function - not allowed
- selecting count(*) - distributed
- selecting other aggregate function - not distributed
Fixes: #10224Closes#10295
* github.com:scylladb/scylla:
test: add tests for parallelized aggregates
test: cql3: Add UDA REDUCEFUNC test
forward_service: enable multiple selection
forward_service: support UDA and native aggregate parallelization
cql3:functions: Add cql3::functions::functions::mock_get()
cql3: selection: detect parallelize reduction type
db,cql3: Move part of cql3's function into db
selection: detect if selectors factory contains only simple selectors
cql3: reducible aggregates
DB: Add `scylla_aggregates` system table
db,gms: Add SCYLLA_AGGREGATES schema features
CQL3: Add reduce function to UDA
gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```
Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.
Fixes#11076
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Current code accepts priotity class as an argument to various functions
that need it and all its callers use streaming class. Next patches will
needs to sometimes use default class, but it will require heavy patching
of the distributed loader. Things get simpler if the priority class is
kept on sstable_directory on start.
This change also simplifies the ongoing effort on unification of sched
and IO classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This work gets us a step closer to compaction groups.
Everything in compaction layer but compaction_manager was converted to table_state.
After this work, we can start implementing compaction groups, as each group will be represented by its own table_state. User-triggered operations that span the entire table, not only a group, can be done by calling the manager operation on behalf of each group and then merging the results, if any.
Closes#11028
* github.com:scylladb/scylla:
compaction: remove forward declaration of replica::table
compaction_manager: make add() and remove() switch to table_state
compaction_manager: make run_custom_job() switch to table_state
compaction_manager: major: switch to table_state
compaction_manager: scrub: switch to table_state
compaction_manager: upgrade: switch to table_state
compaction: table_state: add get_sstables_manager()
compaction_manager: cleanup: switch to table_state
compaction_manager: offstrategy: switch to table_state()
compaction_manager: rewrite_sstables(): switch to table_state
compaction_manager: make run_with_compaction_disabled() switch to table_state
compaction_manager: compaction_reenabler: switch to table_state
compaction_manager: make submit(T) switch to table_state
compaction_manager: task: switch to table_state
compaction: table_state: Add is_auto_compaction_disabled_by_user()
compaction: table_state: Add on_compaction_completion()
compaction: table_state: Add make_sstable()
compaction_manager: make can_proceed switch to table_state
compaction_manager: make stop compaction procedures switch to table_state
compaction_manager: make get_compactions() switch to table_state
compaction_manager: change task::update_history() to use table_state instead
compaction_manager: make can_register_compaction() switch to table_state
compaction_manager: make get_candidates() switch to table_state
compaction_manager: make propagate_replacement() switch to table_state
compaction: Move table::in_strategy_sstables() and switch to table_state
compaction: table_state: Add maintenance sstable set
compaction_manager: make has_table_ongoing_compaction() switch to table_state
compaction_manager: make compaction_disabled() switch to table_state
compaction_manager: switch to table_state for mapping of compaction_state
compaction_manager: move task ctor into source
and use an async thread around `directory_lister`
rather than `lister::scan_dir` to simplify the implementation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Based on the `clear_snapshot` test.
Test with multiple snapshots and different
combinations of parameters to database::clear_snapshot.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor test_drop_table_with_auto_snapshot out of
drop_table_with_snapshots, adding a auto_snapshot param,
controlling how to configure the cql_test_env db:.config::auto_snapshot,
so we can test both cases - auto_snapshot enabled and disabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of just `tmpdir_for_data`, so we can easily set auto_snapshot
for `drop_table_with_snapshots` in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>