Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.
Should probably be backported to all CDC enabled versions.
Fixes#10473Closes#10474
"
- Alternator gets gossiper for its proxy dependency
- Forward service method that takes global gossiper can re-use
proxy method (forward -> proxy reference is already there)
- Table code is patched to require gossiper argument
- Snitch gets a dependency reference on snitch_ptr and some extra
care for snitch driver vs snitch-ptr interaction and gossip test
- Cql test env should carry gossiper reference on-board
- Few places can re-use the existing local gossiper reference
- Scylla-gdb needs to get gossiper from debug namespace and needs
_not_ to get feature service from gossiper
"
* 'br-gossiper-deglobal-2' of https://github.com/xemul/scylla:
code: De-globalize gossiper
scylla-gdb, main: Get feature service without gossiper help
test: Use cql-test-env gossiper
cql test env: Keep gossiper reference on board
code: Use gossiper reference where possible
snitch: Use local gossiper in drivers
snitch: Keep gossiper reference
test: Remove snitch from manual gossip test
gossiper: Use container() instead of the global pointer
main, cql_test_env: Start snitch later
snitch: Move snitch_base::get_endpoint_info()
forward service: Re-use proxy's helper with duplicated code
table: Don't use global gossiper
alternator: Don't use global gossiper
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectTest.java into our our cql-pytest framework.
This large test file includes 78 tests for various types of SELECT
operations. Four additional tests require UDF in Java syntax,
and were skipped.
All 78 tests pass on Cassandra. 25 of the tests fail on Scylla
reproducing 3 already known Scylla issues and 8 previously-unknown
issues:
Previously known issues:
Refs #2962: Collection column indexing
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Refs #8627: Cleanly reject updates with indexed values where
value > 64k
Newly-discovered issues:
Refs #10354: SELECT DISTINCT should allow filter on static columns,
not just partition keys
Refs #10357: Spurious static row returned from query with filtering,
despite not matching filter
Refs #10358: Comparison with UNSET_VALUE should produce an error
Refs #10359: "CONTAINS NULL" and "CONTAINS KEY NULL" restrictions
should match nothing
Refs #10361: Null or UNSET_VALUE subscript should generate an
invalid request error
Refs #10366: Enforce Key-length limits during SELECT
Refs #10443: SELECT with IN and ORDER BY orders rows per partition
instead of for the entire response
Refs #10448: The CQL token() function should validate its parameters
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10449
No code uses global gossiper instance, it can be removed. The main and
cql-test-env code now have their own real local instances.
This change also requires adding the debug:: pointer and fixing the
scylle-gdb.py to find the correct global location.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's yet another -test-env -- the alternator- one -- which needs
gossiper. It now uses global reference, but can grab gossiper reference
from the cql-test-env which partitipates in initialization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference is already available at the env initialization, but it's
not kept on the env instance itself. Will be used by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some places in the code has function-local gossiper reference but
continue to use global instance. Re-use the local reference (it's going
to become sharded<> instance soon).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reference is put on the snitch_ptr because this is the sharded<>
thing and because gossiper reference is the same for different snitch
drivers. Also, getting gossiper from snitch_ptr by driver will look
simpler than getting it from any base class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Snitch depends on gossiper and system keyspace, so it needs to be
started after those two do.
fixes#10402
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
Today, both operations are picking the highest level as the ideal level for
placing the output, but the size of input should be used instead.
The formula for calculating the ideal level is:
ceil(log base(fan_out) of (total_input_size / max_fragment_size))
where fan_out = 10 by default,
total_input_size = total size of input data and
max_fragment_size = maximum size for fragment (160M by default)
such that 20 fragments will be placed at level 2, as level 1
capacity is 10 fragments only.
By placing the output in the incorrect level, tons of backlog will be generated
for LCS because it will either have to promote or demote fragments until the
levels are properly balanced.
"
* 'optimize_lcs_major_and_reshape/v2' of https://github.com/raphaelsc/scylla:
compaction: LCS: avoid needless work post major compaction completion
compaction: LCS: avoid needless work post reshape completion
compaction: LCS: extract calculation of ideal level for input
compaction: LCS: Fix off-by-one in formula used to calculate ideal level
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.
When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.
This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.
New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.
Fixes#10440.
Closes#10456
* github.com:scylladb/scylla:
loading_cache_test: test prepared stmts cache
loading_cache: force minimum size of unprivileged
loading_cache: extract dropping entries to lambdas
loading_cache: separately track size of sections
loading_cache: fix typo in 'privileged'
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.
Fixes: #10450Closes#10451
* github.com:scylladb/scylla:
replica/database: drop_column_family(): drop querier cache entries after waiting for ops
replica/database: finish coroutinizing drop_column_family()
replica/database: make remove(const column_family&) private
Fix hangs on Scylla node startup with Raft enabled that were caused by:
- a deadlock when enabling the USES_RAFT feature,
- a non-voter server forgetting who the leader is and not being able to forward a `modify_config` entry to become a voter.
Read the commit messages for details.
Fixes: #10379
Refs: #10355Closes#10380
* github.com:scylladb/scylla:
raft: actively search for a leader if it is not known for a tick duration
raft: server: return immediately from `wait_for_leader` if leader is known
service: raft: don't support/advertise USES_RAFT feature
Add a new test that sets up a CQL environment with a very small prepared
statements cache. The test reproduces a scenario described in #10440,
where a privileged section of prepared statement cache gets large
and that could possibly starve the unprivileged section, making it
impossible to execute BATCH statements. Additionally, at the end of the
test, prepared statements/"simulated batches" with prepared statements
are executed a random number of times, stressing the cache.
To create a CQL environment with small prepared cache, cql_test_config
is extended to allow setting custom memory_config value.
This patch enforces a minimum size of unprivileged section when
performing shrink() operation.
When the cache is shrank, we still drop entries first from unprivileged
section (as before this commit), however if this section is already small
(smaller than max_size / 2), we will drop entries from the privileged
section.
For example if the cache could store at most 50 entries and there are 49
entries in privileged section, after adding 5 entries (that would go to
unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
New tests are added to check this behavior and bookkeeping of section
sizes.
Fixes#10440.
That's done by picking the ideal level for the input, such
that LCS won't have to either promote or demote data, because
the output level is not the best candidate for having the
size of the output data.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
After the recent conversion of the row-cache, two v1 mutation sources
remained: the memtable and the kl sstable reader.
This series converts both to a native v2 implementation. The conversion
is shallow: both continue to read and process the underlying (v1) data
in v1, the fragments are converted to v2 right before being pushed to
the reader's buffer. This conversion is simple, surgical and low-risk.
It is also better than the upgrade_to_v2() used previously.
Following this, the remaining v1 reader implementations are removed,
with the exception of the downgrade_to_v1(), which is the only one left
at this point. Removing this requires converting all mutation sinks to
accept a v2 stream.
upgrade_to_v2() is now not used in any production code. It is still
needed to properly test downgrade_to_v1() (which is till used), so we
can't remove it yet. Instead it hidden as a private method of
mutation_source. This still allows for the above mentioned testing to
continue, while preventing anyone from being tempted to introduce new
usage.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191
"
* 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla:
readers: make upgrade_to_v2() private
test/lib/mutation_source_test: remove upgrade_to_v2 tests
readers: remove v1 forwardable reader
readers: remove v1 empty_reader
readers: remove v1 delegating_reader
sstables/kl: make reader impl v2 native
sstables/kl: return v2 reader from factory methods
sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
partition_snapshot_reader: convert implementation to native v2
mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because it is
incorrectly assuming all files being deleted unconditionally has
TOC, but that's not true for partial files that need to be removed.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
Let's fix this by taking into account temp TOC for partial files.
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10411
* github.com:scylladb/scylla:
sstables: Fix deletion of partial SSTables
sstables: Fix fsync_directory()
sstables: Rename dirname() to a more descriptive name
We don't have any upgrade_to_v2() left in production code, so no need to
keep testing it. Removing it from this test paves the way for removing
it for good (not in this series).
The only user is row level repair: it is replaced with
downgrade_to_v1(make_empty_flat_reader_v2()). The row level reader has
lots of downgrade_to_v1() calls, we will deal with these later all at
once.
Another use is the empty mutation source, this is trivially converted to
use the v2 variant.
Reads (part of operations) running concurrent to `drop_column_family()`
can create querier cache entries while we wait for them to finish in
`await_pending_ops()`. Move the cache entry eviction to after this, to
ensure such entries are also cleaned up before destroying the table
object.
This moves the `_querier_cache.evict_all_for_table()` from
`database::remove()` to `database::drop_column_family()`. With that the
former doesn't have to return `future<>` anymore. While at it (changing
the signature) also rename `column_family` -> `table`.
Also add a regression unit test.
We don't need the database to determine the shard of the mutation,
only its schema. So move the implementation to the respecive
definitions of mutation and frozen_mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10430
DynamoDB limits partition-key length to 2048 bytes and sort-key length
to 1024 bytes. Alternator currently has no such limits officially, but
if a user tries a key length of over 64 KB, the result will be an
"internal server error" as Alternator runs into Scylla's low-level key
length limit of 64 KB.
In this patch we add (mostly xfailing) tests confirming all the above
observations. The tests include extensive comments on what they are
testing and why. Some of these tests (specifically, the ones checking
what happens above 64 KB) should pass once Alternator is fixed. Other
tests - requiring that the limits be exactly what they are in DynamoDB -
may either not pass or change in the future, depending on what we decide
the limits should be in Alternator.
Refs #10347
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10438
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.
Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.
Closes#10433
For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.
The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.
Fixes#10379
In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues #10361, #10399, #10401. The existing test `test_null.py::test_map_subscript_null` reproduced all these bugs sporadically.
In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then **define** both `m[NULL]` and `NULL[2]` to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., `NULL < 2` is also defined as resulting in NULL.
However, this decision differs from Cassandra, where `m[NULL]` is considered an error but `NULL[2]` is allowed. We believe that making `m[NULL]` be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like `m[a]`, where the column `a` can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.
Fixes#10361Fixes#10399Fixes#10401Closes#10420
* github.com:scylladb/scylla:
expressions: don't dereference invalid map subscript in filter
expressions: fix invalid dereference in map subscript evaluation
test/cql-pytest: improve tests for map subscripts and nulls
If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).
Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:
* For NULL, we consider m[NULL] to result in NULL - instead of an error.
This behavior is more consistent with other expressions that contain
null - for example NULL[2] and NULL<2 both result in NULL as well.
Moreover, if in the future we allow more complex expressions, such
as m[a] (where a is a column), we can find the subscript to be null
for some rows and non-null for other rows - and throwing an "invalid
query" in the middle of the filtering doesn't make sense.
* For UNSET_VALUE, we do consider this an error like Cassandra, and use
the same error message as Cassandra. However, the current implementation
checks for this error only when the expression is evaluated - not
before. It means that if the scan is empty before the filtering, the
error will not be reported and we'll silently return an empty result
set. We currently consider this ok, but we can also change this in the
future by binding the expression only once (today we do it on every
evaluation) and validating it once after this binding.
Fixes#10361Fixes#10399
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When we have an filter such as "WHERE m[2] = 3" (where m is a map
column), if a row had a null value for m, our expression evaluation
code incorrectly dereferences an unset optional, and continued
processing the result of this dereference which resulted in undefined
behavior - sometimes we were lucky enough to get "marshaling error"
but other times Scylla crashed.
The fix is trivial - just check before dereferencing the optional value
of the map. We return null in that case, which means that we consider
the result of null[2] to be null. I think this is a reasonable approach
and fits our overall approach of making null dominate expressions (e.g.,
the value of "null < 2" is also null).
The test test_filtering.py::test_filtering_null_map_with_subscript,
which used to frequently fail with marshaling errors or crashes, now
passes every time so its "xfail" mark is removed.
Fixes#10417
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_null.py::test_map_subscript_null turned out to reproduce
multiple bugs related to using map subscripts in filtering expressions.
One was issue #10361 (m[null] resulted in a bizarre error) or #10399
(m[null] resulted in a crash), and a different issue was #10401 (m[2]
resulted in a bizarre error or a crash if m itself was null). Moreover,
the same test uncovered different bugs depending how it was run - alone
or with other tests - because it was using a shared table.
In this patch we introduce two separate tests in test_filtering.py
which are designed to reproduce these separate bugs instead of mixing
them into one test. The new tests also cover a few more corners which
the previous test (which focused on nulls) missed - such as UNSET_VALUE.
The two new tests (and the old test_map_subscript_null) pass on
Cassandra so still assume that the Cassandra behavior - that m[null]
should be an error - is the correct behavior. We may want to change
the desired behavior (e.g., to decide that m[null] be null, not an
error), and change the tests accordingly later - but for now the
tests follow Cassandra's behavior exactly, and pass on Cassandra
and fail on Scylla (so are marked xfail).
The bugs reproduced by these tests involve randomness or reading
uninitialized memory, so these tests sometimes pass, sometimes fail,
and sometimes even crash (as reported in #10399 and #10401). So to
reproduce these bugs run the tests multiple times. For example:
test/cql-pytest/run --count 100 --runxfail
test_filtering.py::test_filtering_null_map_with_subscript
Refs #10361
Refs #10399
Refs #10401
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
This means there is still conversion ingoing when reading from cache and
when populating, but when reading from underlying, the stream can now be
passed through as-is without conversions.
Also, any future v2 related changes to the in-memory storage will now be
limited to the cache reader implementation itself.
In the process, the non-forwarding reader, whose only user is the cache,
is also converted to v2.
"
Performance results reported by Botond:
"
build/release/test/perf/perf_simple_query -c1 -m2G --flush --
duration=20
BEFORE
median 130421.76 tps ( 71.1 allocs/op, 12.1 tasks/op, 47462
insns/op)
median absolute deviation: 319.64
maximum: 131028.33
minimum: 127502.55
AFTER
median 133297.41 tps ( 64.1 allocs/op, 12.2 tasks/op, 45406
insns/op)
median absolute deviation: 2964.24
maximum: 137581.56
minimum: 123739.4
Getting rid of those upgrade/downgrade was good for allocs and ops.
Curiously there is a 0.1 rise in number of tasks though.
"
* 'row-cache-readers-v2/v1' of https://github.com/denesb/scylla:
row_cache: update reader implementations to v2
range_tombstone_change_generator: flush(): add end_of_range
readers/nonforwardable: convert to v2
read_context: fix indentation
read_context: coroutinize move_to_next_partition()
row_cache: cache_entry::read(): return v2 reader
row_cache: return v2 readers from make_reader*()
readers/delegating_v2: s/make_delegating_reader_v2/make_delegating_reader/
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
The patchset embeds the mutation_fragment upgrading logic from v1 to v2 into the mutation_fragment_queue. This way the mutation fragments coming to the mutation_fragment_queue can be v1, but the underlying query_reader receives mutation_fragment_v2, eliminating the last usage of query_reader (v1). The last commit removes query_reader, query_reader_handle and associated factory functions.
tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)
Closes#10371
* github.com:scylladb/scylla:
readers: Remove queue_reader v1 and associated code.
repair: Make mutation_fragment_queue internally upgrade fragments to v2
repair: Make mutation_fragment_queue::impl a seastar::shared_ptr
It makes mutation_fragment_queue copyable and makes the pointer to
pending mutation fragments in next commit stable. This allows moving the
mutation_fragment_queue without breaking the underlying
upgrading_consumer.
And adjust callers. The factory functions just sprinkle upgrade_to_v2()
on returned readers for now.
One test in row_cache_test.cc had to be disabled, because the upgrade to
v2 wrapper we now have over cache readers doesn't allow it to directly
control the reader's buffer size and so the test fails. There is a FIXME
left in the test code and the test will be re-enabled once a native v2
reader implementation allows us to get rid of the upgrade wrapper.
It turns out that Cassandra does not allow IN restrictions together with
filtering, except, curiously, when the restriction is on a clustering key.
There is no real reason for this limitation - the error message even says
it is not *yet* supported.
Scylla, on the other hand, does support this case. Of course it's not
enough that we support it - we need to support it correctly... But we don't
have a full regression test that this support is correct - in
filtering_test.cc we test it with clustering and regular columns - but not
partition key columns.
So this patch adds a simple cql-pytest test that this sort of filtering
works in Scylla correctly for partition, clustering and regular columns
(and also confirms that these cases don't work, yet, on Cassandra).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220420075553.1008062-1-nyh@scylladb.com>
This patch series splits up parts of repair pipeline to allow unit testing
various bits of code without having to run full dtest suite. The reason why
repair pipeline has no unit tests is that by definition repair requires multiple
nodes, while unit test environment works only for a single node.
However, it is possible to explicitly define interfaces between various parts of the
pipeline, inject dependencies and test them individually. This patch series is focused
on taking repair_rows_on_wire (frozen mutation representation of changes coming from
another node) and flushing them to an sstable.
The commits are split into the following parts:
- pulling out classes to separate headers so that they can be included (potentially indirectly) from the test,
- pulling out repair_meta::to_repair_rows_list and part of repair_meta::flush_rows_in_working_row_buf so that they can be tested,
- refactoring repair_writer so that the actual writing logic can be injected as dependency,
- creating the unit test.
tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)
Closes#10345
* github.com:scylladb/scylla:
repair: Add unit test for flushing repair_rows_on_wire to disk.
repair: Extract mutation_fragment_queue and repair_writer::impl interfaces.
repair: Make parts of repair_writer interface private.
repair: Rename inputs to flush_rows.
repair: Make repair_meta::flush_rows a free function.
repair: Split flush_rows_in_working_row_buf to two functions and make one static.
repair: Rename inputs to to_repair_rows_list.
repair: Make to_repair_rows_list a free function.
repair: Make repair_meta::to_repair_rows_list a static function
repair: Fix indentation in repair_writer.
repair: Move repair_writer to separate header.
repair: Move repair_row to a separate header.
repair: Move repair_sync_boundary to a separate header.
repair: Move decorated_key_with_hash to separate header.
repair: Move row_repair hashing logic to separate class and file.
"
Optimize consuming from a single partition.
This gives us significant improvement with single, small mutations,
as shown with perf_mutation_readers, compared to the vector-based
flat_mutation_reader_from_mutations_v2.
These are expected to be common on the write path,
and can be optimized for view building.
results from: perf_mutation_readers -c1 --random-seed=840478750
(userspace cpu-frequency governer, 2.2GHz)
test iterations median mad min max
Before:
combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns
After:
combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns
combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns
The grand plan is to follow up
with make_flat_mutation_reader_from_frozen_mutation_v2
so that we can read directly from either a mutation
or frozen_mutation without having to unfreeze it e.g. in
table::push_view_replica_updates.
Test: unit(dev)
Perf: perf_mutation_readers(release)
"
* tag 'flat_mutation_reader_from_mutation-v3' of https://github.com/bhalevy/scylla:
perf: perf_mutation_readers: add one_mutation case
test: mutation_query_test: make make_source static
mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2
mutation readers: add make_flat_mutation_reader_from_mutation_v2
readers: delete slice_mutation.hh
test: flat_mutation_reader_test: mock_consumer: add debug logging
test: flat_mutation_reader_test: mock_consumer: make depth counter signed
"
There's a generic way to start-stop services in scylla, that includes
5 "actions" (some are optional and/or implicit though)
service_config cfg = ...
sharded<service>.start(cfg)
service.invoke_on_all(&service::start)
service.invoke_on_all(&service::shutdown)
service.invoke_on_all(&servuce::stop)
sharded<service>.stop()
and most of the service out there conforms to that scheme. Not snitch
(spoiler: and not tracing), for which there's a couple of helpers that
do all that magic behind the scenes, "configuring" snitch is done with
the help of overloaded constructors. The latter is extra complicated
with the need to register snitch drivers in class-registry for each
constructor overload. Also there's an external shards synchronization
on stop.
This set brings snitch start/stop code to the described standard: the
create/stop helpers are removed, creation acceps the config structure,
per-shard start/stop (snitch has no drain for now) happens in the
simple invoke-on-all manner.
The intended side effect of this change is the ability to add explicit
dependencies to snitch (in the future, not in this set).
tests: unit(dev)
"
* 'br-snitch-config' of https://github.com/xemul/scylla:
snitch: Remove create_snitch/stop_snitch
snitch: Simplify stop (and pause_io)
snitch: Move io_is_stopped to property-file driver
snitch: Remove init_snitch_obj()
snitch: Move instance creation into snitch_ptr constructor
snitch: Make config-based construction of all drivers
snitch: Declare snitch_ptr peering and rework container() method
snitch: Introduce container() method
Measure performance of the single-mutation reader:
make_flat_mutation_reader_from_mutation_v2.
Comparable to the `one_row` case that consumes the
single mutation using the multi-mutatio reader:
make_flat_mutation_reader_from_mutations_v2
perf_mutation_readers shows ~20-30% improvement of
make_flat_mutation_reader_from_mutation_v2
the same single mutation, just given as a single-item vector
to make_flat_mutation_reader_from_mutations_v2.
test iterations median mad min max
Before:
combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns
After:
combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns
combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>