Flushing schema tables is important for crash recovery (without a flush,
we might have sstables using a new schema before the commitlog entry
noting the schema change has been replayed), but not important for tests
that do not test crash recovery. Avoiding those flushes reduces system,
user, and real time on tests running on a consumer-level SSD.
before:
real 8m51.347s
user 7m5.743s
sys 5m11.185s
after:
real 7m4.249s
user 5m14.085s
sys 2m11.197s
Note real time is higher that user+sys time divided by the number
of hardware threads, indicating that there is still idle time due
to the disk flushing, so more work is needed.
Closes#9319
the name compaction_options is confusing as it overlaps in meaning
with compaction_descriptor. hard to reason what are the exact
difference between them, without digging into the implementation.
compaction_options is intended to only carry options specific to
a give compaction type, like a mode for scrub, so let's rename
it to compaction_type_options to make it clearer for the
readers.
[avi: adjust for scrub changes]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>
Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)
For example:
```
Traceback (most recent call last):
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
self.cleanUpCluster()
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
matches = node.grep_log(expr)
File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
for line in f:
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```
Test: unit(dev)
DTest: scrub_with_one_node_expect_data_loss_test
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210730105428.2844668-1-bhalevy@scylladb.com>
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.
That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:
1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
documentation in the future.
Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.
Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.
Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.
Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.
Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.
So, now the snapshots schema looks like this:
CREATE TABLE raft_snapshots (
group_id timeuuid,
server_id uuid,
snapshot_id uuid,
idx int,
term int,
-- no `config` field here, moved to `raft_config` table
PRIMARY KEY ((group_id, server_id))
)
CREATE TABLE raft_config (
group_id timeuuid,
my_server_id uuid,
server_id uuid,
disposition text, -- can be either 'CURRENT` or `PREVIOUS'
can_vote bool,
ip_addr inet,
PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
);
This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.
Tests: unit(dev)
* manmanson/raft_snapshots_new_schema_v2:
test: adjust `schema_change_test` to include new `system.raft_config` table
raft: new schema for storing raft snapshots
raft: pass server id to `raft_sys_table_storage` instance
The interval template member functions mostly accept
tri-comparators but a few functions accept less-comparators.
To reduce the chance of error, and to provide better error
messages, constrain comparator parameters to the expected
signature.
In one case (db/size_estimates_virtual_reader.cc) the caller
had to be adjusted. The comparator supported comparisons
of the interval value type against other types, but not
against itself. To simplify things, we add that signature too,
even though it will never be called.
Closes#9291
column_value_tuple overlaps both column_value and tuple_constructor
(in different respects) and can be replaced by a combination: a
tuple_constructor of column_value. The replacement is more expressive
(we can have a tuple of column_value and other expression types), though
the code (especially grammar) do not allow it yet.
So remove column_value_tuple and replace it everywhere with
tuple_constructor. Visitors get the merged behavior of the existing
tuple_constructor and column_value_tuple, which is usually trivial
since tuple_constructor and column_value_tuple came from different
hierarchies (term::raw and relation), so usually one of the types
just calls on_internal_error().
The change results in awkwards casts in two areas: WHERE clause
filtering (equal() and related), and clustering key range evaluations
(limits() and related). When equal() is replaced by recursive
evaluate(), the casts will go way (to be replaced by the evaluate())
visitor. Clustering key range extraction will remain limited
to tuples of column_value, so the prepare phase will have to vet
the expressions to ensure the casts don't fail (and use the
filtering path if they will).
Tests: unit (dev)
Closes#9274
Currently scrub compaction filters-out sstables that are undergoing
(regular) compaction. This is surprising to the user and we would like
scrub (in validate mode or otherwise) to examine all sstables in the
table.
Scrub in VALIDATE mode is read-only, therefore it can run in parallel to
regular compaction. However, this series makes sure it selects all
sstables in the table, without filtering sstables undergoing compaction.
For scrub in non-validation mode, we would like to ensure that it
examined all sstables that were sealed when it started and it fixed any
corruption (based on the scrub mode). Therefore, we stop ongoing
compactions when running scrub in non-validation modes. Otherwise
compaction might just copy the corrupt data onto new sstables, requiring
scrub to run again.
Also, acquire _compaction_locks write lock for the table to serialize
with other custom compaction jobs like major compaction, reshape, and
reshard.
Fixes#9256
Test: unit(dev) DTest:
nodetool_additional_test.py:TestNodetool.{validate_sstable_with_invalid_fragment_test,
validate_ks_sstable_with_invalid_fragment_test,validate_with_one_node_expect_data_loss_test}
Closes#9258
* github.com:scylladb/scylla:
compaction_manager: rewrite_sstables: acquire _compaction_locks
compaction_manager: perform_sstable_scrub: run_with_compaction_disabled
compaction: don't rule out compacting sstables in validate-mode scrub
A tool which can be used to examine the content of sstable(s) and
execute various operations on them. The currently supported operations
are:
* dump - dumps the content of the sstable(s), similar to sstabledump;
* dump-index - dumps the content of the sstable index(es), similar to scylla-sstable-index;
* writetime-histogram - generates a histogram of all the timestamps in
the sstable(s);
* custom - a hackable operation for the expert user (until scripting
support is implemented);
* validate - validate the content of the sstable(s) with the mutation
fragment stream validator, same as scrub in validate mode;
The sstables to-be-examined are passed as positional command line
arguments. Sstables will be processed by the selected operation
one-by-one (can be changed with `--merge`). Any number of sstables can
be passed but mind the open file limits. Pass the full path to the data
component of the sstables (*-Data.db). For now it is required that the
sstable is found at a valid data path:
/path/to/datadir/{keyspace_name}/{table_name}-{table_id}/
The schema to read the sstables is read from a `schema.cql` file. This
should contain the keyspace and table definitions, as well as any UDTs
used.
Filtering the sstable(s) to process only certain partition(s) is supported
via the `--partition` and `--partitions-file` command line flags.
Partition keys are expected to be in the hexdump format used by scylla
(hex representation of the raw buffer).
Operations write their output to stdout, or file(s). The tool logs to
stderr, with a logger called `scylla-sstable-crawler`.
Examples:
# dump the content of the sstable
$ scylla-sstable-crawler --dump /path/to/md-123456-big-Data.db
# dump the content of the two sstable(s) as a unified stream
$ scylla-sstable-crawler --dump --merge /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db
# generate a joint histogram for the specified partition
$ scylla-sstable-crawler --writetime-histogram --partition={{myhexpartitionkey}} /path/to/md-123456-big-Data.db
# validate the specified sstables
$ scylla-sstable-crawler --validate /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db
Future plans:
* JSON output for dump.
* A simple way of generating `schema.cql` for any schema, other than copying it from snapshots, or copying from `cqlsh`. None of these generate a complete output.
* Relax sstable path checks, so sstables can be loaded from any path.
* Add scripting support (Lua), allowing custom operations to be written
in a scripting language.
Refs: #9241Closes#9271
* github.com:scylladb/scylla:
tools: remove scylla-sstable-index
tools: introduce scylla-sstable
tools: extract finding selected operation (handler) into function
tools: add schema_loader
cql3: query_processor: add parse_statements()
cql3: statements/create_type: expose create_type()
cql3: statements/create_keyspace: add get_keyspace_metadata()
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
This series is the first step towards implementing efficient reverse
reads. It allows us to remove all the special casing we have in various
places for reverse reads and thus treating reverse streams transparently
in all the middle layers. The only layers that have to know about the
actual reversing are mutation sources proper. The plan is that when
reading in reverse we create a reversed schema in the top layer then
pass this down as the schema for the read. There are two layers that
will need to act on this reversed schema:
* The layer sitting on top of the first layer which still can't handle
reversed streams, this layer will create a reversed reader to handle
the transition.
* The mutation source proper: which will obtain the underlying schema
and will emit the data in reverse order.
Once all the mutation sources are able to handle reverse reads, we can
get rid of the reverse reader entirely.
Refs: #1413
Tests: unit(dev)
TODO:
* v2
* more testing
Also on: https://github.com/denesb/scylla.git reverse-reads/v3
Changelog
v3:
* Drop the entire schema transformation mechanism;
* Drop reversing from `schema_builder()`;
* Don't keep any information about whether the schema is reversed or not
in the schema itself, instead make reversing deterministic w.r.t.
schema version, such that:
`s.version() == s.make_reversed().make_reversed().version()`;
* Re-reverse range tombstones in `streaming_mutation_freezer`, so
`reconcilable_results` sent to the coordinator during read repair
still use the old reverse format;
v2:
* Add `data_type reversed(data_type)`;
* Add `bound_kind reverse_kind(bound_kind)`;
* Make new API safer to use:
- `schema::underlying_type()`: return this when unengaged;
- `schema::make_transformed()`: noop when applying the same
transformation again;
* Generalize reversed into transformation. Add support to transferring
to remote nodes and shards by way of making `schema_tables` aware of
the transformation;
* Use reverse schema everywhere in reverse reader;
Closes#9184
* github.com:scylladb/scylla:
range_tombstone_accumulator: drop _reversed flag
test/boost/mutation_test: add test for mutation::consume() monotonicity
test/boost/flat_mutation_reader_test: more reversed reader tests
flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range)
flat_mutation_reader: make_reversing_reader(): take ownership of the reader
test/lib/mutation_source_test: add consistent log to all methods
mutation: introduce reverse()
mutation_rebuilder: make it standalone
mutation: make copy constructor compatible with mutation_opt
treewide: switch to native reversed format for reverse reads
mutation: consume(): add native reverse order
mutation: consume(): don't include dummy rows
query: add slice reversing functions
partition_slice_builder: add range mutating methods
partition_slice_builder: add constructor with slice
query: specific_ranges: add non-const ranges accessor
range_tombstone: add reverse()
clustering_bounds_comparator: add reverse_kind()
schema: introduce make_reversed()
schema: add a transforming copy constructor
utils: UUID_gen: introduce negate()
types: add reversed(data_type)
docs: design-notes: add reverse-reads.md
Check that the reverse reader emits a stream identical to that emitted
by a reader reading in native order from a table with reversed
clustering order.
Most test methods log their own name either via testlog.info() or
BOOST_TEST_MESSAGE() so failures can be more easily located. Not all do
however. This commit fixes this and also converts all those using
BOOST_TEST_MESSAGE() for this to testlog.info(), for consistency.
Currently `_data` is assumed to be engaged by the copy constructor which
is not necessarily the case with `mutation_opt` objects (which is an
`optimized_optional<mutation>`). Fix this by only copying `_data` if
non-null.
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
The existing consume_in_reverse::yes is renamed to
consume_in_reverse::legacy_half_reverse and consume_in_reverse::yes now
means native reverse order. This is because we expect the legacy order
to die out at one point and when that happens we can just remove that
ugly third option and will be left with yes and no as before.
Intended to be used to modify an existing slice. We want to move the
slice into the direction where the schema is at: make it completely
immutable, all mutations happening through the slice builder class.
Take write lock for cf to serialize cleanup/upgrade sstables/scrub
with major compaction/reshape/reshard.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
since we might potentially have ongoing compactions, and we
must ensure that all sstables created before we run are scrubbed,
we need to barrier out any previously running compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
even sstables being compacted must be validated. otherwise scrub
validate may return false negative.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
`make_revered()` creates a schema identical to the schema instance it is
called on, with clustering order reversed. To distinguish the reverse
schema from the original one, the node-id part of its version UUID is
bit-flipped. This ensures that reversing a schema twice will result in
the identical schema to the original one (although a different C++
object).
This reversed schema will be used in reversed reads, so intermediate
layers can be ignorant of the fact that the read happens in reverse.
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
- view builder creates a reader permit in order to be able
to read from system tables
- CQL server fails to start, shutdown procedure begins
- view builder stop() is not called (because it was not scheduled
yet), so it holds onto its reader permit
- database shutdown procedure waits for all permits to be destroyed,
and it hangs indefinitely because view builder keeps holding
its permit.
Fixes#9306Closes#9308
* github.com:scylladb/scylla:
main: schedule view builder stopping earlier
db,view: split stopping view builder to drain+stop
The empty-range check causes more bugs than it fixes. Replace it with
an explicit check for =NULL (see #7852).
Fixes#9311.
Fixes#9290.
Tests: unit (dev), cql-pytest on Cassandra 4.0
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9314
There is typically just 1-2 digests per query, so we can allocate
space for them in digest_read_resolver using small_vector, saving
an allocation.
Results: (perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10)
before: median 215301.75 tps ( 75.1 allocs/op, 12.1 tasks/op, 45238 insns/op)
after: median 221121.37 tps ( 74.1 allocs/op, 12.1 tasks/op, 45186 insns/op)
While the throughput numbers are not reliable due to frequency throttling,
it's clear there are fewer allocations and instuctions executed.
Closes#9296
In order to avoid a deadlock described in the previous commit,
view builder stopping is registered earlier, so that its destructor
is called and its reader permit is released before the database starts
shutting down.
Note that draining the view builder is still scheduled later,
because it needs to happen before storage proxy drain
to keep the existing deinitialization order.
Fixes#9306
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
- view builder creates a reader permit in order to be able
to read from system tables
- CQL server fails to start, shutdown procedure begins
- view builder stop() is not called (because it was not scheduled
yet), so it holds onto its reader permit
- database shutdown procedure waits for all permits to be destroyed,
and it hangs indefinitely because view builder keeps holding
its permit.
We previously forbade selecting a static column when an index is
used. But Cassandra allows it, so we should, too -- see #8869.
After removing the static-column check, the existing code gets the
correct result without any further changes (though it may read
multiple rows from the same partition).
Fixes#8869.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9307
In the quest to have explicit dependencies and the abiliy to run
multiple nodes in one process, remove some uses of get_gossiper() and
get_local_gossiper() and replace them with parameters passed from main()
or its equivalents.
Some uses still remain, mostly in snitch, but this series removes a
majority.
* https://github.com/avikivity/scylla.git gossiper-deglobal-1/v1
alternator: remove uses of get_local_gossiper()
storage_service: remove stray get_gossiper(), get_local_gossiper() calls
migration_manager: remove use of get_gossiper() from passive_announce()
storage_proxy: start_hints_manager(): don't require caller to provide gossiper
migration_manager: remove uses of get_local_gossiper()
storage_proxy: remove uses of get_local_gossiper()
gossiper: remove get_local_gossiper() from some inline helpers
gossiper: remove get_gossiper() from stop_gossiping()
gossiper: remove uses of get_local_gossiper for its rpc server
api: remove use of get_local_gossiper()
gossiper: remove calls to global get_gossiper from within the gossiper itself
migration_manager already has a reference to _gossiper, but
passive_announce is static and so can't use it. Luckily the only
caller (in storage_service) uses it as it it wasn't static, so
we can just unstaticify it.
Pass gossiper as a constructor parameter instead. cql_test_env
gains a use of get_gossiper() instead, but at least these uses
are concentrated in one place.