Commit Graph

2427 Commits

Author SHA1 Message Date
Calle Wilund
e314158708 cdc: Ensure columns removed from log table are registered as dropped
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.

Should probably be backported to all CDC enabled versions.

Fixes #10473
Closes #10474

(cherry picked from commit 78350a7e1b)
2022-05-05 11:34:56 +02:00
Tomasz Grabiec
46586532c9 loading_cache: Make invalidation take immediate effect
There are two issues with current implementation of remove/remove_if:

  1) If it happens concurrently with get_ptr(), the latter may still
  populate the cache using value obtained from before remove() was
  called. remove() is used to invalidate caches, e.g. the prepared
  statements cache, and the expected semantic is that values
  calculated from before remove() should not be present in the cache
  after invalidation.

  2) As long as there is any active pointer to the cached value
  (obtained by get_ptr()), the old value from before remove() will be
  still accessible and returned by get_ptr(). This can make remove()
  have no effect indefinitely if there is persistent use of the cache.

One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:

  User Defined Type value contained too many fields (expected 5, got 6)

The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.

The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.

Fixes #10117

Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
2022-05-04 15:38:11 +03:00
Avi Kivity
0114244363 Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.

Fixes: #10450

Closes #10451

* github.com:scylladb/scylla:
  replica/database: drop_column_family(): drop querier cache entries after waiting for ops
  replica/database: finish coroutinizing drop_column_family()
  replica/database: make remove(const column_family&) private

(cherry picked from commit 7f1e368e92)
2022-05-01 17:11:52 +03:00
Tomasz Grabiec
0265d56173 utils/chunked_managed_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no user impact.

Fixes #10364.

Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
(cherry picked from commit 0c365818c3)
2022-04-13 10:29:30 +03:00
Tomasz Grabiec
e50452ba43 utils/chunked_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no known user impact.

Fixes #10363.

Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)

[avi: make max_chunk_capacity() public for backport]
2022-04-13 10:29:03 +03:00
Tomasz Grabiec
f136b5b950 utils/chunked_managed_vector: Fix corruption in case there is more than one chunk
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.

Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.

Currently, the container is only used in the sstable partition index
cache.

Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.

Introduced in 78e5b9fd85 (4.6.0)

Fixes #10290

Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
(cherry picked from commit 41fe01ecff)
2022-04-08 10:53:52 +03:00
Piotr Sarna
00bb1e8145 cql3: fix qualifying restrictions with IN for indexing
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.

Fixes #10300

Closes #10302

(cherry picked from commit c0fd53a9d7)
2022-04-03 11:21:43 +03:00
Benny Halevy
b77ca07709 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.

Fixes #10173

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
2022-03-24 18:08:07 +02:00
Benny Halevy
bb0a38f889 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.

The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.

This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.

If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.

Fixes #10156

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
2022-03-24 18:08:07 +02:00
Piotr Jastrzebski
93cf43ae4b cdc: Handle compact storage correctly in preimage
Base tables that use compact storage may have a special artificial
column that has an empty type.

c010cefc4d fixed the main CDC path to
handle such columns correctly and to not include them in the CDC Log
schema.

This patch makes sure that generation of preimage ignores such empty
column as well.

Fixes #9876
Closes #9910

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 09d4438a0d)
2022-03-10 14:25:02 +02:00
Nadav Har'El
2f2d22a864 cql: INSERT JSON should refuse empty-string partition key
Add the missing partition-key validation in INSERT JSON statements.

Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).

Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.

The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.

Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.

This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.

Reported by @FortTell
Fixes #9853.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
(cherry picked from commit 8fd5041092)
2022-03-02 22:00:15 +02:00
Avi Kivity
5f92f54f06 Merge 'utils: cached_file: Fix alloc-dealloc mismatch during eviction' from Tomasz Grabiec
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.

The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.

The series also adds two safety checks to LSA to catch such problems earlier.

Fixes #10056

\cc @slivne @bhalevy

Closes #10130

* github.com:scylladb/scylla:
  lsa: Abort when trying to free a standard allocator object not allocated through the region
  lsa: Abort when _non_lsa_memory_in_use goes negative
  tests: utils: cached_file: Validate occupancy after eviction
  test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
  utils: cached_file: Fix alloc-dealloc mismatch during eviction

(cherry picked from commit ff2cd72766)
2022-02-26 11:28:53 +02:00
Nadav Har'El
4c780d0265 alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
2022-02-08 11:48:18 +02:00
Tomasz Grabiec
5b5a300a9e util: cached_file: Fix corruption after memory reclamation was triggered from population
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.

Fix by running emplace() under allocating section, which disables
memory reclamation.

The bug manifests with assert failures, e.g:

./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.

Fixes #9915

Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
(cherry picked from commit b734615f51)
2022-01-31 01:24:47 +02:00
Pavel Emelyanov
2c65c4a569 Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec
Appending an empty range adjacent to an existing range tombstone would
not deoverlap (by dropping the empty range tombstone) resulting in
different (non canoncial) result depending on the order of appending.

Suppose that range tombstone [a, b] covers range tombstone [x, x), and [a, x) and [x, b) are range tombstones which correspond to [a, b] split around position x.

Appending [a, x) then [x, b) then [x, x) would give [a, b)
Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b)

The fix is to drop empty range tombstones in range_tombstone_list so that the result is canonical.

Fixes #9661

Closes #9764

* github.com:scylladb/scylla:
  range_tombstone_list: Deoverlap adjacent empty ranges
  range_tombstone_list: Convert to work in terms of position_in_partition

(cherry picked from commit b2a62d2b59)
2022-01-20 12:35:21 +02:00
Nadav Har'El
7b82aaf939 alternator: fix error on UpdateTable for non-existent table
When the UpdateTable operation is called for a non-existent table, the
appropriate error is ResourceNotFoundException, but before this patch
we ran into an exception, which resulted in an ugly "internal server
error".

In this patch we use the existing get_table() function which most other
operations use, and which does all the appropriate verifications and
generates the appropriate Alternator api_error instead of letting
internal Scylla exceptions escape to the user.

This patch also includes a test for UpdateTable on a non-existent table,
which used to fail before this patch and pass afterwards. We also add a
test for DeleteTable in the same scenario, and see it didn't have this
bug. As usual, both tests pass on DynamoDB, which confirms we generate
the right error codes.

Fixes #9747.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206181605.1182431-1-nyh@scylladb.com>
(cherry picked from commit 31eeb44d28)
2021-12-29 22:59:25 +02:00
Calle Wilund
11bb03e46d commitlog: Recalculate footprint on delete_segment exceptions
Fixes #9348

If we get exceptions in delete_segments, we can, and probably will, loose
track of footprint counters. We need to recompute the used disk footprint,
otherwise we will flush too often, and even block indefinately on new_seg
iff using hard limits.
2021-11-29 14:56:48 +00:00
Calle Wilund
810e410c5d commitlog_test: Add test for exception in alloc w. deleted underlying file
Tests that we can handle exception-in-alloc cleanup if the file actually
does not exist. This however uncovers another weakness (addressed in next
patch) - that we can loose track of disk footprint here, and w. hard limits
end up waiting for disk space that never comes. Thus test does not use hard
limit.
2021-11-29 14:56:43 +00:00
Calle Wilund
97f6da0c3e commitlog: Ensure failed-to-create-segment is re-deleted
Fixes #9343

If we fail in allocate_segment_ex, we should push the file opened/created
to the delete set to ensure we reclaim the disk space. We should also
ensure that if we did not recycle a file in delete_segments, we still
wake up any recycle waiters iff we made a file delete instead.

Included a small unit test.
2021-11-29 14:51:39 +00:00
Tomasz Grabiec
afc18d5070 cql: Fix missing data in indexed queries with base table short reads
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.

Fix by restoring the "remaining" count when adjusting the paging state
on short read.

Tests:

  - index_with_paging_test
  - secondary_index_test

Fixes #9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>

(cherry picked from commit 1e4da2dcce)
2021-11-23 11:22:00 +02:00
Tomasz Grabiec
31bc1eb681 Merge 'Memtable reversing reader: fix computing rt slice, if there was previously emitted range tombstone.' from Michał Radwański
This PR started by realizing that in the memtable reversing reader, it
never happened on tests that `do_refresh_state` was called with
`last_row` and `last_rts` which are not `std::nullopt`.

Changes
- fix memtable test (`tesst_memtable_with_many_versions_conforms_to_mutation_source`), so that there is a background job forcing state refreshes,
- fix the way rt_slice is computed (was `(last_rts, cr_range_snapshot.end]`, now is `[cr_range_snapshot.start, last_rts)`).

Fixes #9486

Closes #9572

* github.com:scylladb/scylla:
  partition_snapshot_reader: fix indentation in fill_buffer
  range_tombstone_list: {lower,upper,}slice share comparator implementation
  test: memtable: add full_compaction in background
  partition_snapshot_reader: fix obtaining rt_slice, if Reversing and _last_rts was set
  range_tombstone_list: add lower_slice
2021-11-05 15:27:03 +01:00
Nadav Har'El
b95e431228 alternator: fix bug in ReturnValues=ALL_NEW
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.

The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.

So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.

This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.

In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().

Fixes #9542

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
2021-11-04 16:34:58 +01:00
Michał Radwański
cac9ac5126 test: memtable: add full_compaction in background
Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source
in background. Without it, some paths in the partition snapshot reader
weren't covered, as the tests always managed to read all range
tombstones and rows which cover a given clustering range from just a
single snapshot. Now, when full_compaction happens in process of reading
from a clustering range, we can force state refresh with non-nullopt
positions of last row and last range tombstone.

Note: this inability to test affected only the reversing reader.
2021-11-04 16:19:54 +01:00
Pavel Emelyanov
6e97d2ce87 Merge branch 'compaction_cleanup_and_improvements_v2' from Raphael S. Carvalho
Cleanup and improvements for compaction

* 'compaction_cleanup_and_improvements_v2' of https://github.com/raphaelsc/scylla:
  compaction: fix outdated doc of compact_sstables()
  table: fix indentation in compact_sstables()
  table: give a more descriptive name to compaction_data in compact_sstables()
  compaction_manager: rename submit_major_compaction to perform_major_compaction
  compaction: fix indentantion in compaction.hh
  compaction: move incremental_owned_ranges_checker into cleanup_compaction
  compaction: make owned ranges const in cleanup_compaction
  compaction: replace outdated comment in regular_compaction
  compaction: give a more descriptive name to compaction_data
  compaction_manager: simplify creation of compaction_data
2021-11-04 17:27:07 +03:00
Raphael S. Carvalho
8ce9cda391 compaction_manager: rename submit_major_compaction to perform_major_compaction
for symmetry, let's call it perform_* as it doesn't work like submission
functions which doesn't wait for result, like the one for minor
compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:54:00 -03:00
Raphael S. Carvalho
63dc4e2107 compaction_manager: simplify creation of compaction_data
there's no need for wrapping compaction_data in shared_ptr, also
let's kill unused params in create_compaction_data to simplify
its creation.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:35:49 -03:00
Pavel Emelyanov
12cf69e5f5 test, raft: Keep many-400 case out of debug mode
This case takes 45+ minutes which is 1.5 times longer then the
second longest case out there. I propose to keep the many-400
case out of debug runs, there's many-100 one which is configured
the same way but uses 4x times less "nodes".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-04 10:47:13 +03:00
Avi Kivity
08ce119703 Merge "Fix twcs reshape disjoint test case" from Pavel E
"
There are 3 overlapping problems with the test case. It
has use after move that covers wrond window selection and
relies on a time-since-epoch being aligned with the time
window by chance.

tests: unit(dev)
"

* 'br-twcs-test-fixes' of https://github.com/xemul/scylla:
  test, compaction: Do not rely on random timestamp
  test, compaction: Fix use after move in twcs reshape
2021-11-03 17:38:29 +02:00
Pavel Emelyanov
9628d72964 test, compaction: Do not rely on random timestamp
Again, there's a sub-case with sequential time stamps that still
works by chance. This time it's because splitting 256 sstables
into buckets of maximum 8 ones is allowed to have the 1st and the
last ones with less than 8 items in it, e.g. 3, 8, ..., 8, 5. The
exact generation depends on the time-since-epoch at which it
starts.

When all the cases are run altogether this time luckily happens
to be well-aligned with 8-hours and the generated buckets are
filled perfectly. When this particular test-case is run all alone
(e.g. by --run_test or --parallel-cases) then the starting time
becomes different and it gets less than 4 sstables in its first
bucket.

The fix is in adjusting the starting time to be aligned with the
8 hours window.

Actually, the 8 hours appeared in the previous patch, before which
it was 24 hours. Nonetheless, the above reasoning applies to any
size of the time window that's less than 256, so it's still an
independent fix.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-03 15:41:19 +03:00
Pavel Emelyanov
c6ce6b9ca1 test, compaction: Fix use after move in twcs reshape
The options are std::move-d twice -- first into schema builder then
into compaction strategy. Surprisingly, but the 2nd move makes the
test work.

There's a sub-case in this case that checks sstables with incremental
timestamps with 1 hour step -- 0h, 1h, 2h, ... 255h. Next, the twcs
buckets generator obeys a minimal threshold of 4 sstables per bucket.
Those with less sstables in are not included in the job. Finally,
since the options used to create the twcs are empty after the 1st
move the default window of 24 hours is used. If they were configured
correctly with 1 hour window then all buckets would contain 1 sstable
and the generated job would become empty.

So the fix is both -- don't move after move and make the window size
large enough to fit more sstables than the mentioned minimum.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-03 15:09:15 +03:00
Piotr Sarna
f36bbe05b4 Merge 'alternator: add support for AttributeUpdates Add operation' from Nadav Har'El
In UpdateItem's AttributeUpdates (old-style parameter) we were missing
support for the ADD operation - which can increment a number, or add
items to sets (or to lists, even though this fact isn't documented).

This two-patch series add this missing feature. The first patch just moves
an existing function to where we can reuse it, and the second patch is
the actual implementation of the feature (and enabling its test).

Fixes #5893

Closes #9574

* github.com:scylladb/scylla:
  alternator: add support for AttributeUpdates ADD operation
  alternator: move list_concatenate() function
2021-11-03 09:33:50 +01:00
Nadav Har'El
00335b1901 alternator: add support for AttributeUpdates ADD operation
In UpdateItem's AttributeUpdates (old-style parameter) we were missing
support for the ADD operation - which can increment a number, or add
items to sets (or to lists, even though this fact isn't documented).

This patch adds this feature, and the test for it begins to pass so its
"xfail" marker is removed.

Fixes #5893

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-11-03 10:19:26 +02:00
Nadav Har'El
56eb994d8f alternator: allow Authorization header to be without spaces
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.

At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.

In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).

Fixes #9568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
2021-11-03 06:38:28 +02:00
Avi Kivity
2f23d22739 Merge 'Scrub compaction: add a new in-memory partition segregation method' from Botond Dénes
The current disk-based segregation method works well enough for most cases, but it struggles greatly when there are a lot of partitions in the input. When this is the case it will produce tons of buckets (sstables), in the order of hundreds or even thousands. This puts a huge strain on different parts of the system.
This series introduces a new segregation method which specializes on the lots of small partitions case. If the conditions are right, it can cause a drastic reduction of buckets. In one case I tested, a 1.1GB sstable with 3.6M partitions in it produced just 2 output sstables, down from the 500+ with the on-disk method.
This new method uses a memtable to sort out-of-order partitions. In-order partitions bypass this sorting altogether and go to the disk directly. This method is not suitable for cases where either the partition are large or the total amount of data is large. For those the disk-based method should be used. Scrub compaction decides on the method to use based on heuristics.

Tests: unit(dev)

Closes #9548

* github.com:scylladb/scylla:
  compaction: scrub_compaction: add bucket count to finish message
  test/boost: mutation_writer_test: harden the partition-based segregator test
  mutation_writer: remove now unused on-disk partition segregator
  compaction,test: use the new in-memory segregator for scrub
  mutation_writer/partition_based_splitting_writer: add memtable-based segregator
2021-11-02 15:18:41 +02:00
Tomasz Grabiec
00814dcadc Merge "raft: randomized_nemesis_test: perform cluster reconfigurations" from Kamil
We introduce a new operation to the framework: `reconfiguration`.
The operation sends a reconfiguration request to a Raft cluster. It
bounces a few times in case of `not_a_leader` results.

A side effect of the operation is modifying a `known` set of nodes which
the operation's state has a reference to. This `known` set can then be
used by other operations (such as `raft_call`s) to find the current
leader.

For now we assume that reconfigurations are performed sequentially. If a
reconfiguration succeeds, we change `known` to the new configuration. If
it fails, we change `known` to be the set sum of the previous
configuration and the current configuration (because we don't know what
the configuration will eventually be - the old or the attempted one - so
any member of the set sum may eventually become a leader).

We use a dedicated thread (similarly to the network partitioning thread)
to periodically perform random reconfigurations.

* kbr/reconfig-v2:
  test: raft: randomized_nemesis_test: perform reconfigurations in basic_generator_test
  test: raft: randomized_nemesis_test: improve the bouncing algorithm
  test: raft: randomized_nemesis_test: handle more error types
  test: raft: randomized_nemesis_test put `variant` and `monostate` `ostream` `operator<<`s into `std` namespace
  test: raft: randomized_nemesis_test: `reconfiguration` operation
2021-11-02 13:55:45 +01:00
Botond Dénes
e4e369053b test/boost: mutation_writer_test: harden the partition-based segregator test
Test both methods: the "old" disk-based one and the recently added
in-memory one, with different configurations and also add additional
checks to ensure they don't loose data.
2021-11-02 12:24:37 +02:00
Botond Dénes
74f2290e49 mutation_writer: remove now unused on-disk partition segregator
Also removes related tests, including the exception safety test which
just spins forever with the memtable method.
2021-11-02 12:24:33 +02:00
Botond Dénes
f2f529855d compaction,test: use the new in-memory segregator for scrub 2021-11-02 09:00:44 +02:00
Nadav Har'El
e6d17d8de2 test/cql-pytest: remove "xfail" label on a reproducer for a fixed bug
The two cql-pytest tests:
        test_frozen_collection.py::test_wrong_set_order_in_nested
        test_frozen_collection.py::test_wrong_set_order_in_nested_2

Which used to fail, and therefore marked "xfail", got fixed by commit
5589f348e7 ("cql3: expr: Implement
evaluate(expr::bind_variable). That commit made the handling of bound
variables in prepared statements more rigorous, and in particular
made sure that sets are re-sorted not only if they are at the top
level of the value (as happened in the old code), but also if they
are nested inside some other container. This explains the surprising
fact that we could only reproduce bug with prepared statements, and
only with nested sets - while top-level sets worked correctly.

As the tests no longer failed and the bug tested by them really did
get fixed, in this patch we remove the "xfail" marker from these
tests.

Closes #7856. This issue was really fixed by the aforementioned commit,
but let's close it now.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211029221731.1113554-1-nyh@scylladb.com>
2021-11-01 10:29:32 +02:00
Avi Kivity
bd0b573a92 Merge 'Partition based splitting writer exception safety' from Botond Dénes
The partition based splitting writer (used by scrub) was found to be exception-unsafe, converting an `std::bad_alloc` to an assert failure. This series fixes the problem and adds a unit test checking the exception safety against `std::bad_alloc`:s fixing any other related problems found.

Fixes: https://github.com/scylladb/scylla/issues/9452

Closes #9453

* github.com:scylladb/scylla:
  test: mutation_writer_test: add exception safety test for segregate_by_partition()
  mutation_writer: segregate_by_partition(): make exception safe
  mutation_reader: queue_reader_handle: make abandoned() exception safe
  mutation_writer: feed_writers(): make it a coroutine
  mutation_writer: partition_based_splitting_writer: erase old bucket if we fail to create replacement
2021-10-31 21:15:19 +02:00
Nadav Har'El
6ae0ea0c48 alternator: return the correct Content-Type header
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.

Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).

Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.

This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.

Fixes #9554.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
2021-10-31 10:50:25 +01:00
Nadav Har'El
666017f2f0 Merge 'Convert last uses of sprint() to fmt::format()' from Avi Kivity
sprint() uses the printf-style formatting language while most of our
code uses the Python-derived format language from fmt::format().

The last mass conversion of sprint() to fmt (in 1129134a4a)
missed some callers (principally those that were on multiple lines, and
so the automatic converter missed them). Convert the remainder to
fmt::format(), and some sprintf() and printf() calls, so we have just
one format language in the code base. Seastar::sprint() ought to be
deprecated and removed.

Test: unit (dev)

Closes #9529

* github.com:scylladb/scylla:
  utils: logalloc: convert debug printf to fmt::print()
  utils: convert fmt::fprintf() to fmt::print()
  main: convert fprint() to fmt::print()
  compress: convert fmt::sprintf() to fmt::format()
  tracing: replace seastar::sprint() with fmt::format()
  thrift: replace seastar::sprint() with fmt::format()
  test: replace seastar::sprint() with fmt::format()
  streaming: replace seastar::sprint() with fmt::format()
  storage_service: replace seastar::sprint() with fmt::format()
  repair: replace seastar::sprint() with fmt::format()
  redis: replace seastar::sprint() with fmt::format()
  locator: replace seastar::sprint() with fmt::format()
  db: replace seastar::sprint() with fmt::format()
  cql3: replace seastar::sprint() with fmt::format()
  cdc: replace seastar::sprint() with fmt::format()
  auth: replace seastar::sprint() with fmt::format()
2021-10-28 22:33:23 +03:00
Benny Halevy
531e32957d compaction: time_window_compaction_strategy: get_reshaping_job: consider disjointness only when trimming
With 062436829c,
we return all input sstables in strict mode
if they are dosjoint even if they don't need reshaping at all.

This leads to an infinite reshape loop when uploading sstables
with TWCS.

The optimization for disjoint sstables is worth it
also in relaxed mode, so this change first makes sorting of the input
sstables by first_key order independent of reshape_mode,
and then it add a check for sstable_set_overlapping_count
before trimming either the multi_window vector or
any single_window bucket such that we don't trim the list
if the candidates are disjoint.

Adjust twcs_reshape_with_disjoint_set_test accordingly.

And also add some debug logging in
time_window_compaction_strategy::get_reshaping_job
so one can figure out what's going on there.

Test: unit(dev)
DTest: cdc_snapshot_operation.py:CDCSnapshotOperationTest.test_create_snapshot_with_collection_list_with_base_rows_delete_type

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211025071828.509082-1-bhalevy@scylladb.com>
2021-10-28 14:35:51 +03:00
Avi Kivity
27a2c74b64 test: replace seastar::sprint() with fmt::format()
sprint() is obsolete.
2021-10-27 17:02:00 +03:00
Botond Dénes
6a76e12768 mutation_partition: row: make row marker shadowing symmetric
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.

This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.

There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.

Fixes: #9483

Tests: unit(dev)

Closes #9519
2021-10-26 20:40:31 +02:00
Benny Halevy
4062cd17e0 test: hashers_test: mutation_fragment_sanity_check: stop semaphore
To stop the semaphore as required we need run
the test in a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211024053402.990142-1-bhalevy@scylladb.com>
2021-10-24 11:29:23 +03:00
Botond Dénes
0d744fd3fa test: mutation_writer_test: add exception safety test for segregate_by_partition() 2021-10-21 06:50:22 +03:00
Benny Halevy
0746b5add6 storage_service: replicate_to_all_cores: update all keyspaces
Currently we update the effective_replication_map
only on non-system keyspace, leaving the system keyspace,
that uses the local replication strategy, with the empty
replication_map, as it was first initialized.

This may lead to a crash when get_ranges is called later
as seen in #9494 where get_ranges was called from the
perform_sstable_upgrade path.

This change updates the effective_replication_map on all
keyspaces rather than just on the non-system ones
and adds a unit test that reproduces #9494 without
the fix and passes with it.

Fixes #9494

Test: unit(dev), database_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211020143217.243949-1-bhalevy@scylladb.com>
2021-10-20 17:54:23 +03:00
Nadav Har'El
e4a6569258 config: experimental flag UNUSED_CDC shouldn't be distinct from UNUSED
When an experimental feature graduates from being experimental, we want
to continue allow the old "--experimental-features=..." option to work,
in case some user's configuration uses it - just do nothing. The way
we do it is to map in db::experimental_features_t::map() the feature's
name to the UNUSED value - this way the feature's name is accepted, but
doesn't change anything.

When the CDC feature graduated from being experimental, a new bit
UNUSED_CDC was introduced to do the same thing. This separate bit was
not actually necessary - if we ever check for UNUSED_CDC bit anywhere
in the code it means the flag isn't actually unused ;-) And we don't
check it.

So simplify the code by conflating UNUSED_CDC into UNUSED.
This will also make it easy to build from db::experimental_features_t::map()
a list of current experimental features - now it will simply be those that
do not map to UNUSED.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013105107.123544-1-nyh@scylladb.com>
2021-10-20 17:54:17 +03:00
Nadav Har'El
88afcc7fe3 Merge 'cql-pytest: Forbid deletions based on secondary index' from Piotr Sarna
This series fixes a bug which allowed using a secondary index in a restriction for a DELETE statement, which resulted in generating incorrect slices and deleting the whole partition instead. Secondary indexes are not meant to be used for deletes, which this series enforces by marking the indexes as not queriable.

It also comes with a reproducing test case, originally provided by @fee-mendes (thanks!).

Fixes #9495
Tests: unit(release)

Closes #9496

* github.com:scylladb/scylla:
  cql-pytest: add reproducer for deleting based on secondary index
  cql3: forbid querying indexes for deletions
2021-10-20 17:54:17 +03:00