Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
LWT `IF` (column_condition) duplicates the expression prepare and evaluation code. Annoyingly,
LWT IF semantics are a little different than the rest of CQL: a NULL equals NULL, whereas usually
NULL = NULL evaluates to NULL.
This series converts `IF` prepare and evaluate to use the standard expression code. We employ
expression rewriting to adjust for the slightly different semantics.
In a few places, we adjust LWT semantics to harmonize them with the rest of CQL. These are pointed
out in their own separate patches so the changes don't get lost in the flood.
Closes#12356
* github.com:scylladb/scylladb:
cql3: lwt: move IF clause expression construction to grammar
cql3: column_condition: evaluate column_condition as a single expression
cql3: lwt: allow negative list indexes in IF clause
cql3: lwt: do not short-circuit col[NULL] in IF clause
cql3: column_condition: convert _column to an expression
cql3: expr: generalize evaluation of subscript expressions
cql3: expr: introduce adjust_for_collection_as_maps()
cql3: update_parameters: use evaluation_inputs compatible row prefetch
cql3: expr: protect extract_column_value() from partial clustering keys
cql3: expr: extract extract_column_value() from evaluation machinery
cql3: selection: introduce selection_from_partition_slice
cql3: expr: move check for ordering on duration types from restrictions to prepare
cql3: expr: remove restrictions oper_is_slice() in favor of expr::is_slice()
cql3: column_condition: optimize LIKE with constant pattern after preparing
cql3: expr: add optimizer for LIKE with constant pattern
test: lib: add helper to evaluate an expression with bind variables but no table
cql3: column_condition: make the left-hand-side part of column_condition::raw
cql3: lwt: relax constraints on map subscripts and LIKE patterns
cql3: expr: fix search_and_replace() for subscripts
cql3: expr: fix function evaluation with NULL inputs
cql3: expr: add LWT IF clause variants of binary operators
cql3: expr: change evaluate_binop_sides to return more NULL information
It's only used by the single test and apparently exists since the times
seastar was missing the future::discard_result() sugar
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12803
Sometimes we want to defeat the expression optimizer's ability to
fold constant expressions. A bind variable is a convenient way to
do this, without the complexity of faking a schema and row inputs.
Add a helper to evaluate an expression with bind variable parameters,
doing all the paperwork for us.
A companion make_bind_variable() is added to likewise simplify
creating bind variables for tests.
We currently have two method families to generate partition keys:
* make_keys() in test/lib/simple_schema.hh
* token_generation_for_shard() in test/lib/sstable_utils.hh
Both work only for schemas with a single partition key column of `text` type and both generate keys of fixed size.
This is very restrictive and simplistic. Tests, which wanted anything more complicated than that had to rely on open-coded key generation.
Also, many tests started to rely on the simplistic nature of these keys, in particular two tests started failing because the new key generation method generated keys of varying size:
* sstable_compaction_test.sstable_run_based_compaction_test
* sstable_mutation_test.test_key_count_estimation
These two tests seems to depend on generated keys all being of the same size. This makes some sense in the case of the key count estimation test, but makes no sense at all to me in the case of the sstable run test.
Closes#12657
* github.com:scylladb/scylladb:
test/lib/sstable_utils: remove now unused token_generation_for_shard() and friends
test/lib/simple_schema: remove now unused make_keys() and friends
test: migrate to tests::generate_partition_key[s]()
test/lib/test_services: add table_for_tests::make_default_schema()
test/lib: add key_utils.hh
test/lib/random_schema.hh: value_generator: add min_size_in_bytes
The system_keyspace defines several auxiliary methods to help view_builder update system.scylla_views_builds_in_progress and system.built_views tables. All use global qctx thing.
It only takes adding view_builder -> system_keyspace dependency in order to de-static all those helpers and let them use query-processor from it, not the qctx.
Closes#12728
* github.com:scylladb/scylladb:
system_keysace: De-static calls that update view-building tables
storage_service: Coroutinize mark_existing_views_as_built()
api: Unset column_famliy endpoints
api: Carry sharded<db::system_keyspace> reference over
view_builder: Add system_keyspace dependency
Introduces task manager's compaction module. That's an initial
part of integration of compaction with task manager.
When fully integrated, task manager will allow user to track compaction
operations, check status and progress of each individual one. It will help
with creating an asynchronous version of rest api that forces any compaction.
Currently, users can see with /task_manager/list_modules api call that
compaction is one of the modules accessible through task manager.
They won't get any additional information though, since compaction
tasks are not created yet.
A shared_ptr to compaction module is kept in compaction manager.
Closes#12635
* github.com:scylladb/scylladb:
compaction: test: pass task_manager to compaction_manager in test environment
compaction: create and register task manager's module for compaction
tasks: add task_manager constructor without arguments
This series switches memtable and cache to use a new representation for mutation data,
called `mutation_partition_v2`. In this representation, range tombstone information is stored
in the same tree as rows, attached to row entries. Each entry has a new tombstone field,
which represents range tombstone part which applies to the interval between this entry and
the previous one. See docs/dev/mvcc.md for more details about the format.
The transient mutation object still uses the old model in order to avoid work needed to adapt
old code to the new model. It may also be a good idea to live with two models, since the
transient mutation has different requirements and thus different trade-offs can be made.
Transient mutation doesn't need to support eviction and strong exception guarantees,
so its algorithms and in-memory representation can be simpler.
This allows us to incrementally evict range tombstone information. Before this series,
range tombstones were accumulated and evicted only when the whole partition entry was evicted. This
could lead to inefficient use of cache memory.
Another advantage of the new representation is that reads don't have to lookup
range tombstone information in a different tree while reading. This leads to simpler
and more efficient readers.
There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes.
Secondly, update algorithms are more complex because they need to deoverlap range tombstone
information. Also, to handle preemption and provide strong exception guarantees, update
algorithms may need to allocate sentinel entries, which adds complexity and reduces performance.
The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.
Remaining work:
- performance optimizations to apply_monotonically() to avoid regressions
- performance testing
- preemption support in apply_to_incomplete (cache update from memtable)
Fixes#2578Fixes#3288Fixes#10587Closes#12048
* github.com:scylladb/scylladb:
test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction
test: mvcc: Extract mvcc_container::allocate_in_region()
row_cache, lru: Introduce evict_shallow()
test: mvcc: Avoid copies of mutation under failure injection
test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic
mutation_partition_v2: Avoid full scan when applying mutation to non-evictable
Pass is_evictable to apply()
tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1
tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint
tests: mutation_partition_v2: Add test for exception safety of mutation merging
tests: Add tests for the mutation_partition_v2 model
mutation_partition_v2: Implement compact()
cache_tracker: Extract insert(mutation_partition_v2&)
mvcc, mutation_partition: Document guarantees in case merging succeeds
mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically()
mutation_partition_v2: Simplify get_continuity()
row_cache: Distinguish dummy insertion site in trace log
db: Use mutation_partition_v2 in mvcc
range_tombstone_change_merger: Introduce peek()
readers: Extract range_tombstone_change_merger
mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots
mvcc: partition_snapshot_row_cursor: Support digest calculation
mutation_partition_v2: Store range tombstones together with rows
db: Introduce mutation_partition_v2
doc: Introduce docs/dev/mvcc.md
db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU
db: Print range_tombstone bounds as position_in_partition
test: memtable_test: Relax test_segment_migration_during_flush
test: cache_flat_mutation_reader: Avoid timestamp clash
test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows
test: mvcc: Fix sporadic failures due to compact_for_compaction()
test: lib: random_mutation_generator: Produce partition tombstone less often
test: lib: random_utils: Introduce with_probability()
test: lib: Improve error message in has_same_continuity()
test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker
test: mvcc: Insert entries in the tracker
test: mvcc_test: Do not set dummy::no on non-clustering rows
mutation_partition: Print full position in error report in append_clustered_row()
db: mutation_cleaner: Extract make_region_space_guard()
position_in_partition: Optimize equality check
mvcc: Fix version merging state resetting
mutation_partition: apply_resume: Mark operator bool() as explicit
The view builder updates system.scylla_views_builds_in_progress and
.built_views tables and thus needs the system keyspace instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Each instance of compaction manager should have compaction module pointer
initialized. All contructors get task_manager reference with which
the module is created.
Now any boost test can run with multiple compaction groups by default,
without any change in the boost test cases whatsoever.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Scylla unit tests are limited to command line options defined by
Seastar testing framework.
For extending the set of options, Scylla unit tests can now
include test/lib/scylla_test_case.hh instead of seastar/testing/test_case.hh,
which will "hijack" the entry point and will process the command line
options, then feed the remaining options into seastar testing entry
point.
This is how it looks like when asking for help:
Scylla tests additional options:
--help Produces help message
--x-log2-compaction_groups arg (=0) Controls static number of compaction
groups per table per shard. For X groups,
set the option to log (base 2) of X.
Example: Value of 3 implies 8 groups.
Running 1 test case...
App options:
-h [ --help ] show help message
--help-seastar show help message about seastar options
--help-loggers print a list of logger names and exit
--random-seed arg Random number generator seed
--fail-on-abandoned-failed-futures arg (=1)
Fail the test if there are any
abandoned failed futures
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Use the newly introduced key generation facilities, instead of the the
old inflexible alternatives and hand-rolled code.
Most of the migrations are mechanic, but there are two tests that
were tricky to migrate:
* sstable_compaction_test.sstable_run_based_compaction_test
* sstable_mutation_test.test_key_count_estimation
These two tests seems to depend on generated keys all being of the same
size. This makes some sense in the case of the key count estimation
test, but makes no sense at all to me in the case of the sstable run
test.
Creating the default schema, used in the default constructor of
table_for_tests. Allows for getting the default schema without creating
an instance first.
Contains methods to generate partition and clustering keys. In the case
of the former, one can specify the shard to generate keys for.
We currently have some methods to generate these but they are not
generic. Therefore the tests are littered by open-coded variants.
The methods introduced here are completely generic: they can generate
keys for any schema.
Allow caller to specify the minimum size in bytes of the generated
value. Only really works with string-like types (and collections of
these).
Also fixed max size enforcement for strings: before this patch, the
provided max size was dividied by wide string size, instead of the
char width of the actual string type the value is generated for.
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.
The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.
Range tombstone eviction in cache has now fine granularity, like with
rows.
Fixes#2578Fixes#3288Fixes#10587
This tombstone has a high chance of obliterating all data, which will
make tests which involve partition version merging not very
interesting. The result will be an empty partition with a
tombstone. Reduce its frequency, so that in MVCC there is a
significant chance of having live data in the combined entry where
individual versions come from the generator.
This adds the test/lib's tmpdir instance _and_ configures the
data_file_directories with this path. This makes sure sstables manager
and the rest of the test use the same directory for sstables. For now
it doesn't change anything, but helps next patching.
(A neat side effect of this change is that sstable_test_env is now
configured the same way as cql_test_env does)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make it possible to make config with custom-initialized
options in test_env::impl's constructor initializer list (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The CQL protocol and specification call for lists with NULLs in
some places. For example, the statement:
```cql
UPDATE tab
SET x = 3
IF y IN (1, 2, NULL)
WHERE pk = 4
```
has a list `(1, 2, NULL)` that contains NULL. Although the syntax is tuple-like, the value is a list;
consider the same statement as a prepared statement:
```cql
UPDATE tab
SET x = :x
IF y IN :y_values
WHERE pk = :pk
```
`:y_values` must have a list type, since the number of elements is unknown.
Currently, this is done with special paths inside LWT that bypass normal
evaluation, but if we want to unify those paths, we must allow NULLs in
lists (except in storage). This series does that.
Closes#12411
* github.com:scylladb/scylladb:
test: materialized view: add test exercising synthetic empty-type columns
cql3: expr: relax evaluate_list() to allow allow NULL elements
types: allow lists with NULL
test: relax NULL check test predicate
cql3, types: validate listlike collections (sets, lists) for storage
types: make empty type deserialize to non-null value
following tests are integrated into scylla executable
- perf_fast_forward
- perf_row_cache_update
- perf_simple_query
- perf_row_cache_update
- perf_sstable
before this change
```console
$ size build/release/scylla
text data bss dec hex filename
82284664 288960 335897 82909521 4f11951 build/release/scylla
$ ls -l build/release/scylla
-rwxrwxr-x 1 kefu kefu 1719672112 Jan 19 17:51 build/release/scylla
```
after this change
```console
$ size build/release/scylla
text data bss dec hex filename
84349449 289424 345257 84984130 510c142 build/release/scylla
$ ls -l build/release/scylla
-rwxrwxr-x 1 kefu kefu 1774204800 Jan 19 17:52 build/release/scylla
```
Fixes#12484Closes#12558
* github.com:scylladb/scylladb:
main: move perf_sstable into scylla
main: move perf_row_cache_update into scylla
test: perf_row_cache_update: add static specifier to local functions
main: move perf_fast_forward into scylla
main: move perf_simple_query into scylla
test: extract debug::the_database out
main: shift the args when checking exec_name
main: extract lookup_main_func() out
Commitlog O_DSYNC is intended to make Raft and schema writes durable
in the face of power loss. To make O_DSYNC performant, we preallocate
the commitlog segments, so that the commitlog writes only change file
data and not file metadata (which would require the filesystem to commit
its own log).
However, in tests, this causes each ScyllaDB instance to write 384MB
of commitlog segments. This overloads the disks and slows everything
down.
Fix this by disabling O_DSYNC (and therefore preallocation) during
the tests. They can't survive power loss, and run with
--unsafe-bypass-fsync anyway.
Closes#12542
we want to integrate some perf test into scylla executable, so we
can run them on a regular basis. but `test/lib/cql_test_env.cc`
shares `debug::the_database` with `main.cc`, so we cannot just
compile them into a single binary without changing them.
before this change, both `test/lib/cql_test_env.cc`
and `main.cc` define `debug::the_database`.
after this change, `debug::the_database` is extracted into
`debug.cc`, so it compiles into a separate compiling unit.
and scylla and tests using seastar testing framework are linked
against `debug.cc` via `scylla_core` respectively. this paves the road to
integrating scylla with the tests linking aginst
`test/lib/cql_test_env.cc`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The reader concurrency semaphore has no mechanism to limit the memory consumption of already admitted read. Once memory collective memory consumption of all the admitted reads is above the limit, all it can do is to not admit any more. Sometimes this is not enough and the memory consumption of the already admitted reads balloons to the point of OOMing the node. This pull-request offers a solution to this: it introduces two more layers of defense above this: a soft and a hard limit. Both are multipliers applied on the semaphores normal memory limit.
When the soft limit threshold is surpassed, all readers but one are blocked via a new blocking `request_memory()` call which is used by the `tracking_file_impl`. The reader to be allowed to proceed is chosen at random, it is the first reader which happens to request memory after the limit is surpassed. This is both very simple and should avoid situations where the algorithm choosing the reader to be allowed to proceed chooses a reader which will then always time out.
When the hard limit threshold is surpassed, `reader_concurrency_semaphore::consume()` starts throwing `std::bad_alloc`. This again will result in eliminating whichever reader was unlucky enough to request memory at the right moment.
With this, the semaphore is now effectively enforcing an upper bound for memory consumption, defined by the hard limit.
Refs: https://github.com/scylladb/scylladb/issues/11927Closes#11955
* github.com:scylladb/scylladb:
test: reader_concurrency_semaphore_test: add tests for semaphore memory limits
reader_permit: expose operator<<(reader_permit::state)
reader_permit: add id() accessor
reader_concurrency_semaphore: add foreach_permit()
reader_concurrency_semaphore: document the new memory limits
reader_concurrency_semaphore: add OOM killer
reader_concurrency_semaphore: make consume() and signal() private
test: stop using reader_concurrency_semaphore::{consume,signal}() directly
reader_concurrency_semaphore: move consume() out-of-line
reader_permit: consume(): make it exception-safe
reader_permit: resource_units::reset(): only call consume() if needed
reader_concurrency_semaphore: tracked_file_impl: use request_memory()
reader_concurrency_semaphore: add request_memory()
reader_concurrency_semaphore: wrap wait list
reader_concurrency_semaphore: add {serialize,kill}_limit_multiplier parameters
test/boost/reader_concurrency_semaphore_test: dummy_file_impl: don't use hardoced buffer size
reader_permit: add make_new_tracked_temporary_buffer()
reader_permit: add get_state() accessor
reader_permit: resource_units: add constructor for already consumed res
reader_permit: resource_units: remove noexcept qualifier from constructor
db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier
scylla-gdb.py: scylla-memory: extract semaphore stats formatting code
scylla-gdb.py: fix spelling of "graphviz"
expression tests often need to create instances of untyped_constant.
Creating them by hand is tedious because the required code is overly verbose.
Having convenience functions for it speeds up test writing.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Tests are similarly relaxed. A test is added in lwt_test to show
that insertion of a list with NULL is still rejected, though we
allow NULLs in IF conditions.
One test is changed from a list of longs to a list of ints, to
prevent churn in the test helper library.
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.
Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.
Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
The CQL binary protocol introduced "unset" values in version 4
of the protocol. Unset values can be bound to variables, which
cause certain CQL fragments to be skipped. For example, the
fragment `SET a = :var` will not change the value of `a` if `:var`
is bound to an unset value.
Unsets, however, are very limited in where they can appear. They
can only appear at the top-level of an expression, and any computation
done with them is invalid. For example, `SET list_column = [3, :var]`
is invalid if `:var` is bound to unset.
This causes the code to be littered with checks for unset, and there
are plenty of tests dedicated to catching unsets. However, a simpler
way is possible - prevent the infiltration of unsets at the point of
entry (when evaluating a bind variable expression), and introduce
guards to check for the few cases where unsets are allowed.
This is what this long patch does. It performs the following:
(general)
1. unset is removed from the possible values of cql3::raw_value and
cql3::raw_value_view.
(external->cql3)
2. query_options is fortified with a vector of booleans,
unset_bind_variable_vector, where each boolean corresponds to a bind
variable index and is true when it is unset.
3. To avoid churn, two compatiblity structs are introduced:
cql3::raw_value{,_view}_vector_with_unset, which can be constructed
from a std::vector<raw_value{,_view/}>, which is what most callers
have. They can also be constructed with explicit unset vectors, for
the few cases they are needed.
(cql3->variables)
4. query_options::get_value_at() now throws if the requested bind variable
is unset. This replaces all the throwing checks in expression evaluation
and statement execution, which are removed.
5. A new query_options::is_unset() is added for the users that can tolerate
unset; though it is not used directly.
6. A new cql3::unset_operation_guard class guards against unsets. It accepts
an expression, and can be queried whether an unset is present. Two
conditions are checked: the expression must be a singleton bind
variable, and at runtime it must be bound to an unset value.
7. The modification_statement operations are split into two, via two
new subclasses of cql3::operation. cql3::operation_no_unset_support
ignores unsets completely. cql3::operation_skip_if_unset checks if
an operand is unset (luckily all operations have at most one operand that
tolerates unset) and applies unset_operation_guard to it.
8. The various sites that accept expressions or operations are modified
to check for should_skip_operation(). This are the loops around
operations in update_statement and delete_statement, and the checks
for unset in attributes (LIMIT and PER PARTITION LIMIT)
(tests)
9. Many unset tests are removed. It's now impossible to enter an
unset value into the expression evaluation machinery (there's
just no unset value), so it's impossible to test for it.
10. Other unset tests now have to be invoked via bind variables,
since there's no way to create an unset cql3::expr::constant.
11. Many tests have their exception message match strings relaxed.
Since unsets are now checked very early, we don't know the context
where they happen. It would be possible to reintroduce it (by adding
a format string parameter to cql3::unset_operation_guard), but it
seems not to be worth the effort. Usage of unsets is rare, and it is
explicit (at least with the Python driver, an unset cannot be
introduced by ommission).
I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't
recognize unsets) with cql3::maybe_unset_value (that does), but that
caused huge amounts of churn, so I abandoned that in favor of the
current approach.
Closes#12517
The CQL binary protocol version 3 was introduced in 2014. All Scylla
version support it, and Cassandra versions 2.1 and newer.
Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer
use 32-bit collection sizes.
Unfortunately, we implemented support for multiple serialization formats
very intrusively, by pushing the format everywhere. This avoids the need
to re-serialize (sometimes) but is quite obnoxious. It's also likely to be
broken, since it's almost untested and it's too easy to write
cql_serialization_format::internal() instead of propagating the client
specified value.
Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's
easy to verify that they are no longer in use on a running system by
examining the `system.clients` table before upgrade.
Fixes#10607Closes#12432
* github.com:scylladb/scylladb:
treewide: drop cql_serialization_format
cql: modification_statement: drop protocol check for LWT
transport: drop cql protocol versions 1 and 2
Unlike other experimental feature we want to raft to be opt in even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
* 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev:
raft: replace experimental raft option with dedicated flag
main: move supervisor notification about group registry start where it actually starts
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).
A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.
Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.
One test checking the 16-bit serialization format is removed.
raft_group0 used to register RPC verbs only on shard 0.
This worked on clusters with the same --smp setting on
all nodes, since RPCs in this case are (usually)
processed on the same shard as the calling code,
and raft_group0 methods only run on shard 0.
A new test test_nodes_with_different_smp was added
to identify the problem.
Fixes: #12252
Unlike other experimental feature we want to raft to be optional even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
This is to define the API sstable needs from underlying storage. When implementing object-storage backend it will need to implement those. The API looks like
future<> snapshot(const sstable& sst, sstring dir, absolute_path abs) const;
future<> quarantine(const sstable& sst, delayed_commit_changes* delay);
future<> move(const sstable& sst, sstring new_dir, generation_type generation, delayed_commit_changes* delay);
void open(sstable& sst, const io_priority_class& pc); // runs in async context
future<> wipe(const sstable& sst) noexcept;
future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity);
It doesn't have "list" or alike, because it's not a method of an individual sstable, but rather the one from sstables_manager. It will come as separate PR.
Closes#12217
* github.com:scylladb/scylladb:
sstable, storage: Mark dir/temp_dir private
sstable: Remove get_dir() (well, almost)
sstable: Add quarantine() method to storage
sstable: Use absolute/relative path marking for snapshot()
sstable: Remove temp_... stuff from sstable
sstable: Move open_component() on storage
sstable: Mark rename_new_sstable_component_file() const
sstable: Print filename(type) on open-component error
sstable: Reorganize new_sstable_component_file()
sstable: Mark filename() private
sstable: Introduce index_filename()
tests: Disclosure private filename() calls
sstable: Move wipe_storage() on storage
sstable: Remove temp dir in wipe_storage()
sstable: Move unlink parts into wipe_storage
sstable: Remove get_temp_dir()
sstable: Move write_toc() to storage
sstable: Shuffle open_sstable()
sstable: Move touch_temp_dir() to storage
sstable: Move move() to storage
sstable: Move create_links() to storage
sstable: Move seal_sstable() to storage
sstable: Tossing internals of seal_sstable()
sstable: Move remove_temp_dir() to storage
sstable: Move create_links_common() to storage
sstable: Move check_create_links_replay() to storage
sstable: Remove one of create_links() overloads
sstable: Remove create_links_and_mark_for_removal()
sstable: Indentation fix after prevuous patch
sstable: Coroutinize create_links_common()
sstable: Rename create_links_common()'s "dir" argument
sstable: Make mark_for_removal bool_class
sstable, table: Add sstable::snapshot() and use in table::take_snapshot
sstable: Move _dir and _temp_dir on filesystem_storage
sstable: Use sync_directory() method
test, sstable: Use component_basename in test
sstables: Move read_{digest|checksum} on sstable
This PR fixes several bugs related to handling of non-full
clustering keys.
One is in trim_clustering_row_ranges_to(), which is broken for non-full keys in reverse
mode. It will trim the range to position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.
Fixes#12180
after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.
This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.
It probably already causes problems for such keys as after_key() is used
in various parts in the read path.
Refs #1446Closes#12234
* github.com:scylladb/scylladb:
position_in_partition: Make after_key() work with non-full keys
position_in_partition: Introduce before_key(position_in_partition_view)
db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order
types: Fix comparison of frozen sets with empty values
The sstable::filename() is going to become private method. Lots of tests
call it, but tests do call a lot of other sstable private methods,
that's OK. Make the sstable::filename() yet another one of that kind in
advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method initiates the sstable creation. Effectively it's the first
step in sstable creation transaction implemented on top of rename()
call. Thus this method is moved onto storage under respective name.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When an sstable is prepared to be written on disk the .write_toc() is
called on it which created temporary toc file. Prior to this, the writer
code calls generate_toc() to collect components on the sstable.
This patch adds the .open_sstable() API call that does both. This
prepares the write_toc() part to be moved to storage, because it's not
just "write data into TOC file", it's the first step in transaction
implemeted on top of rename()s.
The test need care -- there's rewrite_toc_without_scylla_component()
thing in utils that doesn't want the generate_toc() part to be called.
It's not patched here and continues calling .write_toc().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is currently used in two places: sstable::snapshot() and
sstable::seal_sstable(). The latter additionally touches the target
backup/ subdir.
This patch moves the whole thing on storage and adds touch for all the
cases. For snapshots this might be excessive, but harmless.
Tests get their private-disclosure way to access sstable._storage in
few places to call create_links directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them -- one API call and the other one that just
"seals" it. The latter one also changes the _marked_for_deletion bit on
the sstable.
This patch makes the latter method prepared to be moved onto storage,
because sealing means comitting TOC file on disk with the help of rename
system call which is purely storage thing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those two fields define the way sstable is stored as collection of
on-disk files. First step towards making the storage access abstract is
in moving the paths onto filesystem_storage embedded class.
Both are made public for now, the rest of the code is patched to access
them via _storage.<smth>. The rest of the set moves parts of sstable::
methods into the filesystem_storage, then marks the paths private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This fixes a long standing bug related to handling of non-full
clustering keys, issue #1446.
after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.
This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.
It probably already causes problems for such keys.