"
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.
Fixes#6928.
"
* 'fix_twcs_agressiveness_after_data_segregation_v2' of github.com:raphaelsc/scylla:
compaction/twcs: improve further debug messages
compaction/twcs: Improve debug log which shows all windows
test: Check that TWCS properly performs size-tiered compaction on past windows
compaction/twcs: Make task estimation take into account the size-tiered behavior
compaction/stcs: Export static function that estimates pending tasks
compaction/stcs: Make get_buckets() static
compact/twcs: Perform size-tiered compaction on past time windows
compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
compaction/twcs: Make newest_bucket() non-static
compaction/twcs: Move TWCS implementation into source file
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.
This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.
Fixes: #6892
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
Previously system.paxos TTL was set as max(3h, gc_grace_seconds).
Introduce new per-table option named `paxos_grace_seconds` to set
the amount of seconds which are used to TTL data in paxos tables
when using LWT queries against the base table.
Default value is equal to `DEFAULT_GC_GRACE_SECONDS`,
which is 10 days.
This change allows to easily test various issues related to paxos TTL.
Fixes#6284
Tests: unit (dev, debug)
Co-authored-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200816223935.919081-1-pa.solodovnikov@scylladb.com>
While Alternator doesn't yet support creating a table with a different
"server-side encryption" (a.k.a. encryption-at-rest) parameters, the
SSESpecification option with Enabled=false should still be allowed, as
it is just the default, and means exactly the same as would a missing
SSESpecification.
This patch also adds a test for this case, which failed on Alternator
before this patch.
Fixes#7031.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812205853.173846-1-nyh@scylladb.com>
This patch adds a test that attributes which serve as a key for a
secondary index still appear in the OldImage in an Alternator Stream.
This is a special case, because although usually Alternator attributes
are saved as map elements, not stand-alone Scylla columns, in the special
case of secondary-index keys they *are* saved as actual Scylla columns
in the base table. And it turns out we produce wrong results in this case:
CDC's "preimage" does not currently include these columns if they didn't
change, while DynamoDB requires that all columns, not just the changed ones,
appear in OldImage. So the test added in this patch xfails on Alternator
(and as usual, passes on DynamoDB).
Refs #7030.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812144656.148315-1-nyh@scylladb.com>
select() is too generic for the method that retrieve sstable runs,
and it has a completely different meaning that the former select
method used to select sstables based on token range.
let's give it a more descriptive name.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811193401.22749-1-raphaelsc@scylladb.com>
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.
When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.
Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.
Fixes#6938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
This testcase was temporarily commented out in 37ebe52, because it
relied on buggy (#6369) behaviour fixed by that commit. Specifically,
it expected a NULL comparison to match a NULL cell value. We now
bring it back, with corrected result expectation.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/7028
By Calle Wilund:
Changes the "preimage" option from binary true/false to on/off/full (accepting true/false, and using old style notation for normal to string - for upgrade reasons), where "full" will force us to include all columns in pre image log rows.
Adds small test (just adding the case to preimage test).
Uses the feature in alternator
Fixes#7030
alternator: Set "preimage" to "full" for streams
cdc_test: Do small test of "full"
cdc: Make pre image optionally "full" (include all columns)
* 'espindola/move-out-of-line' of https://github.com/espindola/scylla:
test: Move code in sstable_run_based_compaction_strategy_for_tests.hh out of line
test: Drop ifdef now that we always use c++20
test: Move sstable_run_based_compaction_strategy_for_tests.hh to test/lib
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.
Use the table's row_cache stats instead.
Fixes#6773
Test: cql_query_test.test_cache_bypass(dev, debug)
Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.
Fixes#6940Fixes#6975Fixes#6976
"
* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
repair: Use clear_gently in get_sync_boundary to avoid stall
utils: Add clear_gently
repair: Use merge_to_gently to merge two lists
utils: Add merge_to_gently
"
This series adds support for the "md" sstable format.
Support is based on the following:
* do not use clustering based filtering in the presence
of static row, tombstones.
* Disabling min/max column names in the metadata for
formats older than "md".
* When updating the metadata, reset and disable min/max
in the presence of range tombstones (like Cassandra does
and until we process them accurately).
* Fix the way we maintain min/max column names by:
keeping whole clustering key prefixes as min/max
rather than calculating min/max independently for
each component, like Cassandra does in the "md" format.
Fixes#4442
Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"
* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
config: enable_sstables_md_format by default
test: cql_query_test: add test_clustering_filtering unit tests
table: filter_sstable_for_reader: allow clustering filtering md-format sstables
table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
table: filter_sstable_for_reader: adjust to md-format
table: filter_sstable_for_reader: include non-scylla sstables with tombstones
table: filter_sstable_for_reader: do not filter if static column is requested
table: filter_sstable_for_reader: refactor clustering filtering conditional expression
features: add MD_SSTABLE_FORMAT cluster feature
config: add enable_sstables_md_format
database: add set_format_by_config
test: sstable_3_x_test: test both mc and md versions
test: Add support for the "md" format
sstables: mx/writer: use version from sstable for write calls
sstables: mx/writer: update_min_max_components for partition tombstone
sstables: metadata_collector: support min_max_components for range tombstones
sstable: validate_min_max_metadata: drop outdated logic
sstables: rename mc folder to mx
sstables: may_contain_rows: always true for old formats
sstables: add may_contain_rows
...
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
"
Make sure to close sstable files also on error paths.
Refs #5509Fixes#6448
Tests: unit (dev)
"
* tag 'sstable-close-files-on-error-v6' of github.com:bhalevy/scylla:
sstable: file_writer: auto-close in destructor
sstable: file_writer: add optional filename member
sstable: add make_component_file_writer
sstable: remove_by_toc_name: accept std::string_view
sstable: remove_by_toc_name: always close file and input stream
sstable: delete_sstables: delete outdated FIXME comment
sstable: remove_by_toc_name: drop error_handler parameter
sstable: remove_by_toc_name: make static
sstable: read_toc: always close file
sstable: mark read_toc and methods calling it noexcept
sstable: read_toc: get rid of file_path
sstable: open_data, create_data: set member only on success.
sstable: open_file: mark as noexcept
sstable: new_sstable_component_file: make noexcept
sstable: new_sstable_component_file: close file on failure
sstable: rename_new_sstable_component_file: do not pass file
sstable: open_sstable_component_file_non_checked: mark as noexcept
sstable: open_integrity_checked_file_dma: make noexcept
sstable: open_integrity_checked_file_dma: close file on failure
Add unit tests reproducing https://github.com/scylladb/scylla/issues/3552
with clustering-key filtering enabled.
enable_sstables_md_format option is set to true as clustering-key
filtering is enabled only for md-format sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
MD format is disabled by default at this point.
The option extends enable_sstables_mc_format
so that both are needed to be set for supporting
the md format.
The MD_FORMAT cluster feature will be added in
a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is required for test applications that may select a sstable
format different than the default mc format, like perf_fast_forward.
These apps don't use the gossip-based sstables_format_selector
to set the format based on the cluster feature and so they
need to rely on the db config.
Call set_format_by_config in single_node_cql_env::do_with.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Run the test cases that write sstables using both the
mc and md versions. Note that we can still compare the
resulting Data, Index, Digest, and Filter components
with the prepared mc sstables we have since these
haven't changed in md.
We take special consideration around validating
min/max column names that are now calculated using
a revised algorithm in the md format.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test also the md format in all_sstable_versions.
Add pre-computed md-sstable files generated using Cassandra version 3.11.7
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Partition tombstones represent an implicit clustering range
that is unbound on both sides, so reflect than in min/max
column names metadata using empty clustering key prefixes.
If we don't do that, when using the sstable for filtering, we have no
other way of distinguishing range tombstones from partition tombstones
given the sstable metadata and we would need to include any sstable
with tombstones, even if those are range tombstone, for which
we can do a better filtering job, using the sstable min/max
column names metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We essentially treat min/max column names as range bounds
with min as incl_start and max as incl_end.
By generating a bound_view for min/max column names on the fly,
we can correctly track and compare also short clustering
key prefixes that may be used as bounds for range tombstones.
Extend the sstable_tombstone_metadata_check unit test
to cover these cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we compare each min/max component independently.
This may lead to suboptimal, inclusive clustering ranges
that do not indicate any actual key we encountered.
For example: ['a', 2], ['b', 1] will lead to min=['a', 1], max=['b', 2]
instead of the keys themselves.
This change keeps the min or max keys as a whole.
It considers shorter clustering prefixes (that are possible with compact
storage) as range tombstone bounds, so that a shorter key is considered
less than the minimum if the latter has a common prefix, and greater
than the maximum if the latter has a common prefix.
Extend the min_max_clustering_key_test to test for this case.
Previously {"a", "2"}, {"b", "1"} clustering keys would erronuously
end up with min={"a", "1"} max={"b", "2"} while we want them to be
min={"a", "2"} max={"b", "1"}.
Adjust sstable_3_x_test to ignore original mc sstables that were
previously computed with different min/max column names.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:
<collection>.erase(
std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
<collection>.end());
In C++20 the same can be expressed with:
std::erase_if(<collection>, <predicate>);
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
read_toc can be marked as noexcept now that new_sstable_component_file is.
With that, other methods that call it can be marked noexcept too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
While answering a stackoverflow question on how to create an item but only
if we don't already have an item with the same key, I realized that we never
actually tested that ConditionExpressions works on key columns: all the
tests we had in test_condition_expression.py had conditions on non-key
attributes. So in this patch we add two tests with a condition on the key
attribute.
Most examples of conditions on the key attributes would be silly, but
in these two tests we demonstrate how a test on key attributes can be
useful to solve the above need of creating an item if no such item
exists yet. We demonstrate two ways to do this using a condition on
the key - using either the "<>" (not equal) operator, or the
"attribute_not_exists()" function.
These tests pass - we don't have a bug in this. But it's nice to have
a test that confirms that we don't (and don't regress in that area).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200806200322.1568103-1-nyh@scylladb.com>
"
The current implementation of B+ benefits from using SIMD
instruction in intra-nodes keys search. This set adds this
functionality.
The general idea behind the implementation is in "asking"
the less comparator if it is the plain "<" and allows for
key simplification to do this natural comparison. If it
does, the search key is simplified to int64_t, the node's
array of keys is casted to array of integers, then both are
fed into avx-optimized searcher.
The searcher should work on nodes that are not filled with
keys. For performance the "unused" keys are set to int64_t
minimum, the search loop compares them too (!) and adjusts
the result index by node size. This needs some care in the
maybe_key{} wrapper.
fixes: #186
tests: unit(dev)
"
* 'br-bptree-avx-b' of https://github.com/xemul/scylla:
utils: AVX searcher
bptree: Special intra-node key search when possible
bptree: Add lesses to maybe_key template
token: Restrict TokenCarrier concept with noexcept
In case of an initialization failure after
db.get_compaction_manager().enable();
But before stop_database, we would never stop the compaction manager
and it would assert during destruction.
I am trying to add a test for this using the memory failure injector,
but that will require fixing other crashes first.
Found while debugging #6831.
Refs #6831.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200805181840.196064-1-espindola@scylladb.com>
If the key type is int64_t and the less-comparator is "natural" (i.e. it's
literally 'a < b') we may use the SIMD instructions to search for the key
on a node. Before doing so, the maybe_key and the searcher should be prepared
for that, in particular:
1. maybe_key should set unused keys to the minimal value
2. the searcher for this case should call the gt() helper with
primitive types -- int64_t search key and array of int64_t values
To tell to B+ code that the key-less pair is such the less-er should define
the simplify_key() method converting search keys to int64_t-s.
This searcher is selected automatically, if any mismatch happens it silently
falls back to default one. Thus also add a static assertion to the row-cache
to mitigate this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The single-node test Scylla run by test/alternator/run uses, as the
default, 256 vnodes. When we have 256 vnodes and two shards, our CDC
implementation produces 512 separate "streams" (called "shards" in
DynamoDB lingo). This causes each of the tests in test_streams.py
which need to read data from the stream to need to do 1024 (!) API
requests (512 calls to GetShardIterator and 512 calls to GetRecords)
which takes significant time - about a second per test.
In this patch, we reduce the number of vnodes to 16. We still have
a non-negligible number of stream "shards" (32) so this part of the
CDC code is still exercised. Moreover, to ensure we still routinely
test the paging feature of DescribeStream (whose default page size
is 100), the patch changes the request to use a Limit of 10, so
paging will still be used to retrieve the list of 32 shards.
The time to run the 27 tests in test_streams.py, on my laptop:
Before this patch: 26 seconds
After this patch: 6 seconds.
Fixes#6979
Message-Id: <20200805093418.1490305-1-nyh@scylladb.com>
This patch adds additional tests for Alternator Streams, which helped
uncover 9 new issues.
The stream tests are noticibly slower than most other Alternator tests -
test_streams.py now has 27 tests taking a total of 20 seconds. Much of this
slowness is attributed to Alternator Stream's 512 "shards" per stream in the
single-node test setup with 256 vnodes, meaning that we need over 1000 API
requests per test using GetRecords. These tests could be made significantly
faster (as little as 4 seconds) by setting a lower number of vnodes.
Issue #6979 is about doing this in the future.
The tests in this patch have comments explaining clearly (I hope) what they
test, and also pointing to issues I opened about the problems discovered
through these tests. In particular, the tests reproduce the following bugs:
Refs #6918
Refs #6926
Refs #6930
Refs #6933
Refs #6935
Refs #6939
Refs #6942
The tests also work around the following issues (and can be changed to
be more strict and reproduce these issues):
Refs #6918
Refs #6931
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
"
This patch series converts a few more global variables from sstring to
constexpr std::string_view.
Doing that makes it impossible for them to be part of any
initialization order problems.
"
* 'espindola/more-constexpr-v2' of https://github.com/espindola/scylla:
auth: Turn DEFAULT_USER_NAME into a std::string_view variable
auth: Turn SALTED_HASH into a std::string_view variable
auth: Turn meta::role_members_table::qualified_name into a std::string_view variable
auth: Turn meta::roles_table::qualified_name into a std::string_view variable
auth: Turn password_authenticator_name into a std::string_view variable
auth: Inline default_authorizer_name into only use
auth: Turn allow_all_authorizer_name into a std::string_view variable
auth: Turn allow_all_authenticator_name into a std::string_view variable