With the new context manager it's now easier to request an error
to be injected via REST API. Note that error injection is only
enabled in certain build modes (dev, debug, sanitize)
and the test case will be skipped if it's not possible to use
this mechanism.
Some of the tests in test/cql-pytest share the same table but use
different keys to ensure they don't collide. Before this patch we used a
random key, which was usually fine, but we recently noticed that the
pytest-randomly plugin may cause different tests to run through the *same*
sequence of random numbers and ruin our intent that different tests use
different keys.
So instead of using a *random* key, let's use a *unique* key. We can
achieve this uniqueness trivially - using a counter variable - because
anyway the uniqueness is only needed inside a single temporary table -
which is different in every run.
Another benefit is that it will now be clearer that the tests are
deterministic and not random - the intent of a random_string() key
was never to randomly walk the entire key space (random_string()
anyway had a pretty narrow idea of what a random string looks like) -
it was just to get a unique key.
Refs #9988 (fixes it for cql-pytest, but not for test/alternator)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.
Fix by running emplace() under allocating section, which disables
memory reclamation.
The bug manifests with assert failures, e.g:
./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.
Fixes#9915
Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
It was observed (perhaps it depends on the Python implementation)
that an identical seed was used for multiple test cases,
which violated the assumption that generated values are in fact
unique. Using a global generator instead makes sure that it was
only seeded once.
Tests: unit(dev) # alternator tests used to fail for me locally
before this patch was applied
Message-Id: <315d372b4363f449d04b57f7a7d701dcb9a6160a.1643365856.git.sarna@scylladb.com>
When performing a change through group 0 (which right now means schema
changes), clear entries from group 0 history table which are older
than one week.
This is done by including an appropriate range tombstone in the group 0
history table mutation.
* kbr/g0-history-gc-v2:
idl: group0_state_machine: fix license blurb
test: unit test for clearing old entries in group0 history
service: migration_manager: clear old entries from group 0 history when announcing
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.
For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.
Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
The compactor recently acquired the ability to consume a v2 stream. The
v2 spec requires that all streams end with a null tombstone.
`range_tombstone_assembler`, the component the compactor uses for
converting the v2 input into its v1 output enforces this with a check on
`consume_end_of_partition()`. Normally the producer of the stream the
compactor is consuming takes care of closing the active tombstone before
the stream ends. The compactor however (or its consumer) can decide to
end the consume early, e.g. to cut the current page. When this happens
the compactor must take care of closing the tombstone itself.
Furthermore it has to keep this tombstone around to re-open it on the
next page.
This patch implements this mechanism which was left out of 134601a15e.
It also adds a unit test which reproduces the problems caused by the
missing mechanism.
The compactor now tracks the last clustering position emitted. When the
page ends, this position will be used as the position of the closing
range tombstone change. This ensures the range tombstone only covers the
actually emitted range.
Fixes: #9907
Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.
The guard will be a method of ensuring linearizability for group 0
operations.
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
Without this fix, generate_random_content could generate 0 entries
and the expected exception would never be injected.
With it, we generate at least 1 entry and the test passes
with the offending random-seed:
```
random-seed=1898914316
Generated 1 dir entries
Aborting lister after 1 dir entries
test/boost/lister_test.cc(96): info: check 'exception "expected_exception" raised as expected' has passed
```
Fixes#9953
Test: lister_test.test_lister_abort --random-seed=1898914316(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123122921.14017-1-bhalevy@scylladb.com>
Take into account that get_reshaping_job selects only
buckets that have more than min_threashold sstables in them.
Therefore, with 256 disjoint sstables in different windows,
allow first or last windows to not be selected by get_reshaping_job
that will return at least disjoint_sstable_count - min_threshold + 1
sstables, and not more than disjoint_sstable_count.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123090044.38449-2-bhalevy@scylladb.com>
Base tables that use compact storage may have a special artificial
column that has an empty type.
c010cefc4d fixed the main CDC path to
handle such columns correctly and to not include them in the CDC Log
schema.
This patch makes sure that generation of preimage ignores such empty
column as well.
Fixes#9876Closes#9910
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This pull request fixes two preexisting issues related to snapshot_ctl::true_snapshots_size
https://github.com/scylladb/scylla/issues/9897https://github.com/scylladb/scylla/issues/9898
And adds a couple unit tests to tests the snapshot_ctl functionality.
Test: unit(dev), database_test.{test_snapshot_ctl_details,test_snapshot_ctl_true_snapshots_size}(debug)
Closes#9899
* github.com:scylladb/scylla:
table: get_snapshot_details: count allocated_size
snapshot_ctl: cleanup true_snapshots_size
snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
snapshot_ctl uses map_reduce over all database shards,
each counting the size of the snapshots directory,
which is shared, not per-shard.
So the total live size returned by it is multiples by the number of shards.
Add a unit test to test that.
Fixes#9897
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#9922
storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.
This adds check of nested types in exception handling, as well as a test
for the routine itself.
Closes#9932
* github.com:scylladb/scylla:
database/storage_proxy: Use "is_timeout_exception" instead of catch match
utils::is_timeout_exception: Ensure we handle nested exception types
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.
To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.
Fixes#9938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
"
Said method should take care of checking that parsing stopped in a valid
state. This patch-set expands the existing but very lacking
implementation by improving the existing error message and adding an
additional check for prematurely exiting the parser in the middle of
parsing an index entry, something we've seen recently in #9446.
To help in debugging such issues, some additional information is added
to the trace messages.
The series also fixes a bug in the error handling code of the partition
index cache.
Refs: #9446
Tests: unit(dev)
"
* 'index-reader-better-verify-end-state/v2.1' of https://github.com/denesb/scylla:
sstables/index_reader: process_state(): add additional information to trace logging
sstables/index_reader: verify_end_state(): add check for premature EOS
sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception
sstables/index_reader: add const sstable& to index_consume_entry_context
sstables/index_reader: remove unused members from index_consume_entry_context
If the compared mutations have binary keys, `colordiff` will declare the
file as binary and will refuse to compare them, beyond a very unhelpful
"binary files differ" summary. Add "-a" to the command line to force a
treating all files as text.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220117131347.106585-1-bdenes@scylladb.com>
It's currently used only by unit tests
and it is dangerous to use on a populated token_metadata
as update_normal_tokens assumes that the set of tokens
owned by the given endpoint is compelte, i.e. previous
tokens owned by the endpoint are no longer owned by it,
but the single-token update_normal_token interface
seems commulative (and has no documentation whatsoever).
It is better to remove this interface and calculate a
complete map of endpoint->tokens from the tests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117101242.122512-1-bhalevy@scylladb.com>
Add the missing partition-key validation in INSERT JSON statements.
Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).
Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.
The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.
Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.
This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.
Reported by @FortTell
Fixes#9853.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
Fixes#9922
storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.
This adds check of nested types in exception handling, as well as a test
for the routine itself.
Some of the CQL tests translated from Cassandra into the test/cql-pytest
framework used the flush() function to force a flush to sstables -
presumably because this exercised yet another code path, or because it
reproduced bugs that Cassandra once had that were only visible when
reading from sstables - not from memtables.
Until now, this flush() function was stubbed and did nothing.
But we do have in test/cql-pytest a flush() implementation in
nodetool.py - which uses the REST API if possible and if not (e.g., when
running against Cassandra) uses the external "nodetool" command.
So in this patch flush() starts to use nodetool.flush() instead of
doing nothing.
The tests continue to pass as before after this patch, and there is no
noticable slowdown (the flush does take time, but the few times it's
done is negligible in these tests).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220117073112.83994-1-nyh@scylladb.com>
"
As a prerequisite the mutation fragment stream validator is converted to
v2 as well (but it still supports v1). We get one step closer to
eliminate conversions altogether from compaction.cc.
Tests: unit(dev)
"
* 'scrub-v2/v1' of https://github.com/denesb/scylla:
mutation_writer: remove v1 version segregate_by_partition()
compaction/compaction: remove v1 version of validate and scrub reader factory methods
tools/scylla-sstable: migrate to v2
test/boost/sstable_compaction_test: migrate validation tests to v2
test/boost/sstable_compaction_test: migrate scrub tests to v2
test/lib/simple_schema: add v2 of make_row() and make_static_row()
compaction: use v2 version of mutation_writer::segregate_by_partition()
mutation_writer: add v2 version of segregate_by_partition()
compaction: migrate scrub and validate to v2
mutation_fragment_stream_validator: migrate validator to v2
Now that issues #7586 and #9487 were fixed, reverse queries - even in
long partitions - work well, we can drop the claim in
alternator/docs/compatibility.md that reverse queries are buggy for
large partitions.
We can also remove the "xfail" mark from the tes that checks this
feature, as it now passes.
Refs #7586
Refs #9487
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#9831
dirty_memory_manager monitors memory and triggers memtable flushing if
there is too much pressure. If bad_alloc happens during the flush, it
may break the loop and flushes won't be triggered automatically, leading
to blocked writes as memory won't be automatically released.
The solution is to add exception handling to the loop, so that the inner
part always returns a non-exceptional future (meaning the loop will
break only on node shutdown).
try/catch is used around on_internal_error instead of
on_internal_error_noexcept, as the latter doesn't have a version that
accepts an exception pointer. To get the exception message from
std::exception_ptr a rethrow is needed anyway, so this was a simpler
approach.
Fixes: #4174
Message-Id: <20220114082452.89189-1-mikolaj.sieluzycki@scylladb.com>
Just a facade using converters behind the scenes. The actual segregator
is not worth migrating to v2 while mutation and the flushing readers
don't have a v2 versions. Still, migrating all users to a v2 API allows
the conversion to happen at a single point where more work is necessary,
instead of scattered around all the users.
We leave the v1 version in place to aid incremental migration to the v2
one.
"
With this series the mutation compactor can now consume a v2 stream. On
the output side it still uses v1, so it can now act as an online
v2->v1 converter. This allows us to push out v2->v1 conversion to as far
as the compactor, usually the next to last component in a read pipeline,
just before the final consumer. For reads this is as far as we can go,
as the intra-node ABI and hence the result-sets built are v1. For
compaction we could go further and eliminate conversion altogether, but
this requires some further work on both the compactor and the sstable
writer and so it is left to be done later.
To summarize, this patchset enables a v2 input for the compactor and it
updates compaction and single partition reads to use it.
"
* 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla:
table: add make_reader_v2()
querier: convert querier_cache and {data,mutation}_querier to v2
compaction: upgrade compaction::make_interposer_consumer() to v2
mutation_reader: remove unecessary stable_flattened_mutations_consumer
compaction/compaction_strategy: convert make_interposer_consumer() to v2
mutation_writer: migrate timestamp_based_splitting_writer to v2
mutation_writer: migrate shard_based_splitting_writer to v2
mutation_writer: add v2 clone of feed_writer and bucket_writer
flat_mutation_reader_v2: add reader_consumer_v2 typedef
mutation_reader: add v2 clone of queue_reader
compact_mutation: make start_new_page() independent of mutation_fragment version
compact_mutation: add support for consuming a v2 stream
compact_mutation: extract range tombstone consumption into own method
range_tombstone_assembler: add get_range_tombstone_change()
range_tombstone_assembler: add get_current_tombstone()
seastar::later() was recently deprecated and replaced with two
alternatives: a cheap seastar::yield() and an expensive (but more
powerful) seastar::check_for_io_immediately(), that corresponds to
the original later().
This patch replaces all later() calls with the weaker yield(). In
all cases except one, it's unambiguously correct. In one case
(test/perf scheduling_latency_measurer::stop()) it's not so ambiguous,
since check_for_io_immediately() will additionally force a poll and
so will cause more work to be done (but no additional tasks to be
executed). However, I think that any measurement that relies on
the measuring the work on the last tick to be inaccurate (you need
thousands of ticks to get any amount of confidence in the
measurement) that in the end it doesn't matter what we pick.
Tests: unit (dev)
Closes#9904
Split off of #9835.
The series removes extraneous includes of lister.hh from header files
and adds a unit test for lister::scan_dir to test throwing an exception
from the walker function passed to `scan_dir`.
Test: unit(dev)
Closes#9885
* github.com:scylladb/scylla:
test: add lister_list
lister: add more overloads of fs::path operator/ for std::string and string_view
resource_manager: remove unnecessary include of lister.hh from header file
sstables: sstable_directory: remove unncessary include of lister.hh from header file
Test the lister class.
In particular the ability to abort the lister
when the walker function throws an exception.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The first patch introduces evictable_reader_v2, and the second one
further simplifies it. We clone instead of converting because there
is at least one downstream (by way of multishard_combining_reader) use
that is not itself straightforward to convert at the moment
(multishard_mutation_query), and because evictable_reader instances
cannot be {up,down}graded (since users also access the undelying
buffers). This also means that shard_reader, reader_lifecycle_policy
and multishard_combining_reader have to be cloned.
"
* tag 'clone-evictable-reader-to-v2/v3' of https://github.com/cmm/scylla:
convert make_multishard_streaming_reader() to flat_mutation_reader_v2
convert table::make_streaming_reader() to flat_mutation_reader_v2
convert make_flat_multi_range_reader() to flat_mutation_reader_v2
view_update_generator: remove unneeded call to downgrade_to_v1()
introduce multishard_combining_reader_v2
introduce shard_reader_v2
introduce the reader_lifecycle_policy_v2 abstract base
evictable_reader_v2: further code simplifications
introduce evictable_reader_v2 & friends
After the rewrite of the test/scylla-gdb, the test for "scylla fiber"
was disabled - and this patch brings it back.
For the "scylla fiber" operation to do something interesting (and not just
print an error message and seem to succeed...) it needs a real task pointer.
The old code interrupted Scylla in a breakpoint and used get_local_tasks(),
but in the new test framework we attach to Scylla while it's idle, so
there are no ready tasks. So in this patch we use the find_vptrs()
function to find a continuation from http_server::do_accept_one() - it
has an interesting fiber of 5 continuations.
After this patch all 33 tests in test/scylla-gdb/test_misc.py pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220110211813.581807-1-nyh@scylladb.com>
distributed_loader is replica-side thing, so it belongs in the
replica module ("distributed" refers to its ability to load
sstables in their correct shards). So move it to the replica
module.
The change exposes a dependency on the construction
order of static variables (which isn't defined), so we remove
the dependency in the first two patches.
Closes#9891
* github.com:scylladb/scylla:
replica: move distributed_loader into replica module
tracing: make sure keyspace and table names are available to static constructors
auth: make sure keyspace and table names are available to static constructors
distributed_loader is replica-side thing, so it belongs in the
replica module ("distributed" refers to its ability to load
sstables in their correct shards). So move it to the replica
module.
As already noted in commit eac6fb8, many of the scylla-gdb tests fail on
aarch64 for various reasons. The solution used in that commit was to
have test/scylla-gdb/run pretend to succeed - without testing anything -
when not running on x86_64. This workaround was accidentally lost when
scylla-gdb/run was recently rewritten.
This patch brings this workaround back, but in a slightly different form -
Instead of the run script not doing anything, the tests do get called, but
the "gdb" fixture in test/scylla-gdb/conftest.py causes each individual
test to be skipped.
The benefit of this approach is that it can easily be improved in the
future to only skip (or xfail) specific tests which are known to fail on
aarch64, instead of all of them - as half of the tests do pass on aarch64.
Fixes#9892.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220109152630.506088-1-nyh@scylladb.com>
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.