This patch series splits up parts of repair pipeline to allow unit testing
various bits of code without having to run full dtest suite. The reason why
repair pipeline has no unit tests is that by definition repair requires multiple
nodes, while unit test environment works only for a single node.
However, it is possible to explicitly define interfaces between various parts of the
pipeline, inject dependencies and test them individually. This patch series is focused
on taking repair_rows_on_wire (frozen mutation representation of changes coming from
another node) and flushing them to an sstable.
The commits are split into the following parts:
- pulling out classes to separate headers so that they can be included (potentially indirectly) from the test,
- pulling out repair_meta::to_repair_rows_list and part of repair_meta::flush_rows_in_working_row_buf so that they can be tested,
- refactoring repair_writer so that the actual writing logic can be injected as dependency,
- creating the unit test.
tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)
Closes#10345
* github.com:scylladb/scylla:
repair: Add unit test for flushing repair_rows_on_wire to disk.
repair: Extract mutation_fragment_queue and repair_writer::impl interfaces.
repair: Make parts of repair_writer interface private.
repair: Rename inputs to flush_rows.
repair: Make repair_meta::flush_rows a free function.
repair: Split flush_rows_in_working_row_buf to two functions and make one static.
repair: Rename inputs to to_repair_rows_list.
repair: Make to_repair_rows_list a free function.
repair: Make repair_meta::to_repair_rows_list a static function
repair: Fix indentation in repair_writer.
repair: Move repair_writer to separate header.
repair: Move repair_row to a separate header.
repair: Move repair_sync_boundary to a separate header.
repair: Move decorated_key_with_hash to separate header.
repair: Move row_repair hashing logic to separate class and file.
"
Optimize consuming from a single partition.
This gives us significant improvement with single, small mutations,
as shown with perf_mutation_readers, compared to the vector-based
flat_mutation_reader_from_mutations_v2.
These are expected to be common on the write path,
and can be optimized for view building.
results from: perf_mutation_readers -c1 --random-seed=840478750
(userspace cpu-frequency governer, 2.2GHz)
test iterations median mad min max
Before:
combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns
After:
combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns
combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns
The grand plan is to follow up
with make_flat_mutation_reader_from_frozen_mutation_v2
so that we can read directly from either a mutation
or frozen_mutation without having to unfreeze it e.g. in
table::push_view_replica_updates.
Test: unit(dev)
Perf: perf_mutation_readers(release)
"
* tag 'flat_mutation_reader_from_mutation-v3' of https://github.com/bhalevy/scylla:
perf: perf_mutation_readers: add one_mutation case
test: mutation_query_test: make make_source static
mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2
mutation readers: add make_flat_mutation_reader_from_mutation_v2
readers: delete slice_mutation.hh
test: flat_mutation_reader_test: mock_consumer: add debug logging
test: flat_mutation_reader_test: mock_consumer: make depth counter signed
"
There's a generic way to start-stop services in scylla, that includes
5 "actions" (some are optional and/or implicit though)
service_config cfg = ...
sharded<service>.start(cfg)
service.invoke_on_all(&service::start)
service.invoke_on_all(&service::shutdown)
service.invoke_on_all(&servuce::stop)
sharded<service>.stop()
and most of the service out there conforms to that scheme. Not snitch
(spoiler: and not tracing), for which there's a couple of helpers that
do all that magic behind the scenes, "configuring" snitch is done with
the help of overloaded constructors. The latter is extra complicated
with the need to register snitch drivers in class-registry for each
constructor overload. Also there's an external shards synchronization
on stop.
This set brings snitch start/stop code to the described standard: the
create/stop helpers are removed, creation acceps the config structure,
per-shard start/stop (snitch has no drain for now) happens in the
simple invoke-on-all manner.
The intended side effect of this change is the ability to add explicit
dependencies to snitch (in the future, not in this set).
tests: unit(dev)
"
* 'br-snitch-config' of https://github.com/xemul/scylla:
snitch: Remove create_snitch/stop_snitch
snitch: Simplify stop (and pause_io)
snitch: Move io_is_stopped to property-file driver
snitch: Remove init_snitch_obj()
snitch: Move instance creation into snitch_ptr constructor
snitch: Make config-based construction of all drivers
snitch: Declare snitch_ptr peering and rework container() method
snitch: Introduce container() method
We want to return stop_iteration::yes once we crossed
the initial depth threshold, with an unsigned depth counter,
it might wraparound and look > 1.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no user impact.
Fixes#10364.
Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no known user impact.
Fixes#10363.
Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
The unit test executes a simplified repair scenario by:
- producing a random stream of mutation mutation_fragments,
- convering them to repair_rows_on_wire,
- convering them to list of repair_rows using the conversion logic
extracted in previous commits from repair_meta,
- flushing the rows to an sstable using the logic extracted in previous
commits from repair_meta,
- comparing the sstable contents with the originally produced mutation
fragments.
The test checks only the flushing part and is not concerned with any
other piece of the repair pipeline.
After previous patches both, create_snitch() and stop_snitch() no look
like the classica sharded service start/stop sequence. Finally both
helpers can be removed and the rest of the user can just call start/stop
on locally obtained sharded references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently snitch drivers register themselves in class-registry with all
sorts of construction options possible. All those different constuctors
are in fact "config options".
When later snitch will declare its dependencies (gossiper and system
keyspace), it will require patching all this registrations, which's very
inconvenient.
This patch introduces the snitch_config struct and replaces all the
snitch constructors with the snitch_driver(snitch_config cfg) one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series is part of the shared storage project.
The STORAGE option is designed to hold a map of options
used for customizing storage for given keyspace.
The option is kept in a system_schema.scylla_keyspaces table.
This option is guarded with a schema feature, because it's kept in a new schema table: `system_schema.scylla_keyspaces`.
Example of the contents of the new table:
```cql
cassandra@cqlsh> select * from system_schema.scylla_keyspaces;
keyspace_name | storage_options | storage_type
---------------+------------------------------------------------+--------------
ksx | {'bucket': '/tmp/xx', 'endpoint': 'localhost'} | S3
```
Native storage options are not kept in the table, as this format doesn't hold any extra options and it would therefore just be a waste of storage.
Closes#10144
* github.com:scylladb/scylla:
test: regenerate schema_change_test for storage options case
test: improve output of schema_change_test regeneration
docs: add a paragraph on keyspace storage options
test: add test cases for keyspace storage options
database,cql3: add STORAGE option to keyspaces
db: add keyspace-storage-options experimental feature
db,schema_tables: add scylla_keyspaces table
db,gms: add SCYLLA_KEYSPACE schema feature
db,gms: add KEYSPACE_STORAGE_OPTIONS feature
While reviewing "utils/chunked_managed_vector: Fix corruption in case there is more
than one chunk", I was worried that there could be a correctness issue
when pop_back() pops off the first element of the last chunk, but turns
out I made an off-by-one error in my theory. Anyway, I wrote a unit test
to verify my assumption and I found worth submitting it upstream.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408133555.12397-2-raphaelsc@scylladb.com>
Keyspace storage options series adds a new schema table:
system_schema.scylla_keyspaces. The regenerated cases ensure
that this new table is taken into account when the schema feature
is available.
Schema change test operates on pre-generated sstables, and sometimes
this set of sstables needs to be regenerated. In order to make the
regeneration process more ergonomic, the output is now directly
copyable as valid C++ representation of UUIDs.
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.
Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.
Currently, the container is only used in the sstable partition index
cache.
Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.
Introduced in 78e5b9fd85 (4.6.0)
Fixes#10290
Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
There's a public call on replica::table to get back the compaction
manager reference. It's not needed, actually. The users of the call are
distributed loader which already has database at hand, and a test that
creates itw own instance of compaction manager for its testing tables
and thus also has it available.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220406171351.3050-1-xemul@scylladb.com>
Allowing to consume the frozen_mutation directly
to a stream rather than unfreezing it first
and then consuming the unfrozen mutation.
Streaming directly from the frozen_mutation
saves both cpu and memory, and will make it
easier to be made async as a follow, to allow
yielding, e.g. between rows.
This is used today only in to_data_query_result
which is invoked on the read-repair path.
Refs #10038Fixes#10021
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220405055807.1834494-1-bhalevy@scylladb.com>
Bucket awareness in cleanup was introduced in a69d98c3d0.
STCS and TWCS already support it, and now LCS will receive it.
The goal of bucket awareness is to reduce writeamp in cleanup,
therefore reducing operation time. Additionally, garbage collection
becomes more efficient as shadowed data can now be potentially
compacted with the data that shadows it, assuming they're on
the same level.
The implementation for LCS is simple. Will reuse the procedure
for STCS for returning jobs in level 0. And one job will be
returned for each non-empty level > 0. What allows us to do it
is our incremental selection approach used in compaction,
that sets a limit on memory usage and disk space requirement.
Fixes#10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220331173417.211257-1-raphaelsc@scylladb.com>
"
First migrate all users to the v2 variant, all of which are tests.
However, to be able to properly migrate all tests off it, a v2 variant
of the restricted reader is also needed. All restricted reader users are
then migrated to the freshly introduced v2 variant and the v1 variant is
removed.
Users include:
* replica::table::make_reader_v2()
* streaming_virtual_table::as_mutation_source()
* sstables::make_reader()
* tests
This allows us to get rid of a bunch of conversions on the query path,
which was mostly v2 already.
With a few tests we did kick the can down the road by wrapping the v2
reader in `downgrade_to_v1()`, but this series is long enough already.
Tests: unit(dev), unit(boost/flat_mutation_reader_test:debug)
"
* 'remove-reader-from-mutations-v1/v3' of https://github.com/denesb/scylla:
readers: remove now unused v1 reader from mutations
test: move away from v1 reader from mutations
test/boost/mutation_reader_test: use fragment_scatterer
test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh
test/boost: mutation_fragment_test: refactor fragment_scatterer
readers: remove now unused v1 reversing reader
test/boost/flat_mutation_reader_test: convert to v2
frozen_mutation: fragment_and_freeze(): convert to v2
frozen_mutation: coroutinize fragment_and_freeze()
readers: migrate away from v1 reversing reader
db/virtual_table: use v2 variant of reversing and forwardable readers
replica/table: use v2 variant of reversing reader
sstables/sstable: remove unused make_crawling_reader_v1()
sstables/sstable: remove make_reader_v1()
readers: add v2 variant of reversing reader
readers/reversing: remove FIXME
readers: reader from mutations: use mutation's own schema when slicing
Instead of taking an output parameter in the constructor, take just the
desired number of mutations to build and return the mutation list from
`consume_end_of_stream()`.
No external users, only used internally, by make_reader(), who delegates
cases currently unsupported by v2 to it. The code needed from
make_reader_v1() is inlined into make_reader() and the former is
removed.
The v2 format allows for a much simpler reversing mechanism since
clustering fragments can simply be reversed as they are read. Fragments
are directly pushed in the reader's buffer eliminating a separate move
phase.
Existing reverse reader unit tests are converted to test the v2 one.
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
"
Quoting patch 3/4:
"This continues the work in a69d98c3d0,
by implementing the cleanup method in TWCS to make it bucket aware.
Till now, the default impl was used which cleanups on file at a
time, starting from the smallest.
The cleanup strategy for TWCS is simple. It's simply calling the
size tiered cleanup method for each bucket, so there will be
one job for each tier in each window.
The next strategies to receive this improvement are LCS and ICS
(the latter one being only available in enterprise).
Refs #10097."
** Simply put, the goal is to reduce writeamp when performing cleanup
on a TWCS table, therefore reducing the operation time. **
tests: unit(dev).
"
* 'twcs_cleanup_bucket_aware/v1' of https://github.com/raphaelsc/scylla:
tests: sstable_compaction_test: Add test for TWCS' bucket-aware cleanup
compaction: TWCS: Implement cleanup method for bucket awareness
compaction: TWCS: change get_buckets() signature to work with const qualified functions
compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value
"
Cleanup compaction works by rewriting all sstables that need clean up, one at
a time.
This approach can cause bad write amplification because the output data is
being made incrementally available for regular compaction.
Cleanup is a long operation on large data sets, and while it's happening,
new data can be written to buckets, triggering regular compaction.
Cleanup fighting for resources with regular compaction is a known problem.
With cleanup adding one file at a time to buckets, regular may require multiple
rounds to compact the data in a given bucket B, producing bad writeamp.
To fix this problem, cleanup will be made bucket aware. As each compaction
strategy has its own definition of bucket, strategies will implement their
own method to retrieve cleanup jobs. The method will be implemented such that
all files in a bucket B will be cleaned up together, and on completion,
they'll be made available for regular at once.
For STCS / ICS, a bucket is a size tier.
For TWCS, a bucket is a window.
For LCS, a bucket is a level.
In this way, writeamp problem is fixed as regular won't have to perform
multiple rounds to compact the data in a given bucket. Additionally, cleanup
will now be able to deduplicate data and will become way more efficient at
garbage collecting expired data.
The space requirement shouldn't be an issue, as compacting an entire bucket
happens during regular compaction anyway.
With leveled strategy, compacting an entire level is also not a problem because
files in a level L don't overlap and therefore incremental compaction is
employed to limit the space requirement.
By the time being, only STCS cleanup was made bucket aware. The others will be
using a default method, where one file is cleaned up at a time. Making cleanup
of other strategies bucket aware is relatively easy now and will be done soon.
Refs #10097.
"
* 'cleanup-compaction-revamp/v3' of https://github.com/raphaelsc/scylla:
test: sstable_compaction_test: Add test for strategy cleanup method
compaction: STCS: Implement cleanup strategy
compaction_manager: Wire cleanup task into the strategy cleanup method
compaction_strategy: Allow strategies to define their own cleanup strategy
compaction: Introduce compaction_descriptor::sstables_size
compaction: Move decision of garbage collection from strategy to task type
"
The only real user is view building, which is converted to v2 and then
the v1 version of the mutation from fragments reader is removed.
Tests: unit(dev, release)
"
* 'v2-only-from-fragments-mutations/v1' of https://github.com/denesb/scylla:
readers: remove now unused v1 reader from fragments
test/boost: flat_mutation_reader_test: remove reader from fragments test
replica/table: migrate generate_and_propagate_view_updates() to v2
replica/table: migrate populate_views() to v2
db/view: convert view_update_builder interface to v2
db/view: migrate view_update_builder to v2
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.
That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.
The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.
As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.
can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().
The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In CQL, table names are limited to so-called word characters (letters,
numbers and underscores), but column names don't have such a limitation.
When we create a secondary index, its default name is constructed from
the column name - so can contain problematic characters. It can include
even the "/" character. The problem is that the index name is then used,
like a table name, to create a directory with that name.
The test included in this patch demonstrates that before this patch, this
can be misused to create subdirectories anywhere in the filesystem, or to
crash Scylla when it fails to create a directory (which it considers an
unrecoverable I/O error).
In this patch we do what Cassandra does - remove all non-word
characters from the indexed column name before constructing the default
index name. In the included test - which can run on both Scylla and
Cassandra - we verify that the constructed index name is the same as
in Cassandra, which is useful to know (e.g., because knowing the index
name is needed to DROP the index).
Also, this patch adds a second line of defense against the security problem
described above: It is now an error to create a schema with a slash or
null (the two characters not allowed in Unix filenames) in the keyspace
or table names. So if the first line of defense (CQL checking the validity
of its commands) fails, we'll have that second line of defense. I verified
that if I revert the default-index-name fix, the second line of defense
kicks in, and the index creation is aborted and cannot create files in
the wrong place to crash Scylla.
Fixes#3403
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220320162543.3091121-1-nyh@scylladb.com>