"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.
Fixes#6940Fixes#6975Fixes#6976
"
* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
repair: Use clear_gently in get_sync_boundary to avoid stall
utils: Add clear_gently
repair: Use merge_to_gently to merge two lists
utils: Add merge_to_gently
"
This series adds support for the "md" sstable format.
Support is based on the following:
* do not use clustering based filtering in the presence
of static row, tombstones.
* Disabling min/max column names in the metadata for
formats older than "md".
* When updating the metadata, reset and disable min/max
in the presence of range tombstones (like Cassandra does
and until we process them accurately).
* Fix the way we maintain min/max column names by:
keeping whole clustering key prefixes as min/max
rather than calculating min/max independently for
each component, like Cassandra does in the "md" format.
Fixes#4442
Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"
* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
config: enable_sstables_md_format by default
test: cql_query_test: add test_clustering_filtering unit tests
table: filter_sstable_for_reader: allow clustering filtering md-format sstables
table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
table: filter_sstable_for_reader: adjust to md-format
table: filter_sstable_for_reader: include non-scylla sstables with tombstones
table: filter_sstable_for_reader: do not filter if static column is requested
table: filter_sstable_for_reader: refactor clustering filtering conditional expression
features: add MD_SSTABLE_FORMAT cluster feature
config: add enable_sstables_md_format
database: add set_format_by_config
test: sstable_3_x_test: test both mc and md versions
test: Add support for the "md" format
sstables: mx/writer: use version from sstable for write calls
sstables: mx/writer: update_min_max_components for partition tombstone
sstables: metadata_collector: support min_max_components for range tombstones
sstable: validate_min_max_metadata: drop outdated logic
sstables: rename mc folder to mx
sstables: may_contain_rows: always true for old formats
sstables: add may_contain_rows
...
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
Add unit tests reproducing https://github.com/scylladb/scylla/issues/3552
with clustering-key filtering enabled.
enable_sstables_md_format option is set to true as clustering-key
filtering is enabled only for md-format sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Run the test cases that write sstables using both the
mc and md versions. Note that we can still compare the
resulting Data, Index, Digest, and Filter components
with the prepared mc sstables we have since these
haven't changed in md.
We take special consideration around validating
min/max column names that are now calculated using
a revised algorithm in the md format.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Partition tombstones represent an implicit clustering range
that is unbound on both sides, so reflect than in min/max
column names metadata using empty clustering key prefixes.
If we don't do that, when using the sstable for filtering, we have no
other way of distinguishing range tombstones from partition tombstones
given the sstable metadata and we would need to include any sstable
with tombstones, even if those are range tombstone, for which
we can do a better filtering job, using the sstable min/max
column names metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We essentially treat min/max column names as range bounds
with min as incl_start and max as incl_end.
By generating a bound_view for min/max column names on the fly,
we can correctly track and compare also short clustering
key prefixes that may be used as bounds for range tombstones.
Extend the sstable_tombstone_metadata_check unit test
to cover these cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we compare each min/max component independently.
This may lead to suboptimal, inclusive clustering ranges
that do not indicate any actual key we encountered.
For example: ['a', 2], ['b', 1] will lead to min=['a', 1], max=['b', 2]
instead of the keys themselves.
This change keeps the min or max keys as a whole.
It considers shorter clustering prefixes (that are possible with compact
storage) as range tombstone bounds, so that a shorter key is considered
less than the minimum if the latter has a common prefix, and greater
than the maximum if the latter has a common prefix.
Extend the min_max_clustering_key_test to test for this case.
Previously {"a", "2"}, {"b", "1"} clustering keys would erronuously
end up with min={"a", "1"} max={"b", "2"} while we want them to be
min={"a", "2"} max={"b", "1"}.
Adjust sstable_3_x_test to ignore original mc sstables that were
previously computed with different min/max column names.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:
<collection>.erase(
std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
<collection>.end());
In C++20 the same can be expressed with:
std::erase_if(<collection>, <predicate>);
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
If the key type is int64_t and the less-comparator is "natural" (i.e. it's
literally 'a < b') we may use the SIMD instructions to search for the key
on a node. Before doing so, the maybe_key and the searcher should be prepared
for that, in particular:
1. maybe_key should set unused keys to the minimal value
2. the searcher for this case should call the gt() helper with
primitive types -- int64_t search key and array of int64_t values
To tell to B+ code that the key-less pair is such the less-er should define
the simplify_key() method converting search keys to int64_t-s.
This searcher is selected automatically, if any mismatch happens it silently
falls back to default one. Thus also add a static assertion to the row-cache
to mitigate this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
This patch series converts a few more global variables from sstring to
constexpr std::string_view.
Doing that makes it impossible for them to be part of any
initialization order problems.
"
* 'espindola/more-constexpr-v2' of https://github.com/espindola/scylla:
auth: Turn DEFAULT_USER_NAME into a std::string_view variable
auth: Turn SALTED_HASH into a std::string_view variable
auth: Turn meta::role_members_table::qualified_name into a std::string_view variable
auth: Turn meta::roles_table::qualified_name into a std::string_view variable
auth: Turn password_authenticator_name into a std::string_view variable
auth: Inline default_authorizer_name into only use
auth: Turn allow_all_authorizer_name into a std::string_view variable
auth: Turn allow_all_authenticator_name into a std::string_view variable
There is no constexpr operator+ for std::string_view, so we have to
concatenate the strings ourselves.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Add a unit test checking if the top 32 bits of the number of remaining rows in paging state
is used correctly and a manual test checking if it's possible to select over 2^32 rows from a table
and a virtual reader for this table.
Currently, we cannot select more than 2^32 rows from a table because we are limited by types of
variables containing the numbers of rows. This patch changes these types and sets new limits.
The new limits take effect while selecting all rows from a table - custom limits of rows in a result
stay the same (2^32-1).
In classes which are being serialized and used in messaging, in order to be able to process queries
originating from older nodes, the top 32 bits of new integers are optional and stay at the end
of the class - if they're absent we assume they equal 0.
The backward compatibility was tested by querying an older node for a paged selection, using the
received paging_state with the same select statement on an upgraded node, and comparing the returned
rows with the result generated for the same query by the older node, additionally checking if the
paging_state returned by the upgraded node contained new fields with correct values. Also verified
if the older node simply ignores the top 32 bits of the remaining rows number when handling a query
with a paging_state originating from an upgraded node by generating and sending such a query to
an older node and checking the paging_state in the reply(using python driver).
Fixes#5101.
Fixes#6341
Since scylla no longer supports upgrading from a version without the
"new" (dedicated) truncation record table, we can remove support for these
and the migtration thereof.
Make sure the above holds whereever this is committed.
Note that this does not remove the "truncated_at" field in
system.local.
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
"
Non-paged queries completely ignore the query result size limiter
mechanism. They consume all the memory they want. With sufficiently
large datasets this can easily lead to a handful or even a single
unpaged query producing an OOM.
This series continues the work started by 134d5a5f7, by introducing a
configurable pair of soft/hard limit (default to 1MB/100MB) that is
applied to otherwise unlimited queries, like reverse and unpaged ones.
When an unlimited query reaches the soft limit a warning is logged. This
should give users some heads-up to adjust their application. When the
hard limit is reached the query is aborted. The idea is to not greet
users with failing queries after an upgrade while at the same time
protect the database from the really bad queries. The hard limit should
be decreased from time to time gradually approaching the desired goal of
1MB.
We don't want to limit internal queries, we trust ourselves to either
use another form of memory usage control, or read only small datasets.
So the limit is selected according to the query class. User reads use
the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration
items, while internal reads are not limited. The limit is obtained by
the coordinator, who passes it down to replicas using the existing
`max_result_size` parameter (which is not a special type containing the
two limits), which is now passed on every verb, instead of once per
connection. This ensures that all replicas work with the same limits.
For normal paged queries `max_result_size` is set to the usual
`query::result_memory_limiter::maximum_result_size` For queries that can
consume unlimited amount of memory -- unpaged and reverse queries --
this is set to the value of the aforementioned
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration item,
but only for user reads, internal reads are not limited.
This has the side-effect that reverse reads now send entire
partitions in a single page, but this is not that bad. The data was
already read, and its size was below the limit, the replica might as well
send it all.
Fixes: #5870
"
* 'nonpaged-query-limit/v5' of https://github.com/denesb/scylla: (26 commits)
test: database_test: add test for enforced max result limit
mutation_partition: abort read when hard limit is exceeded for non-paged reads
query-result.hh: move the definition of short_read to the top
test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit
test: set the allow_short_read slice option for paged queries
partition_slice_builder: add with_option()
result_memory_accounter: remove default constructor
query_*(): use the coordinator specified memory limit for unlimited queries
storage_proxy: use read_command::max_result_size to pass max result size around
query: result_memory_limiter: use the new max_result_size type
query: read_command: add max_result_size
query: read_command: use tagged ints for limit ctor params
query: read_command: add separate convenience constructor
service: query_pager: set the allow_short_read flag
result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit
storage_proxy: add get_max_result_size()
result_memory_limiter: add unlimited_result_size constant
database: add get_statement_scheduling_group()
database: query_mutations(): obtain the memory accounter inside
query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field
...
If the read is not paged (short read is not allowed) abort the query if
the hard memory limit is reached. On reaching the soft memory limit a
warning is logged. This should allow users to adjust their application
code while at the same time protecting the database from the really bad
queries.
The enforcement happens inside the memory accounter and doesn't require
cooperation from the result builders. This ensures memory limit set for
the query is respected for all kind of reads. Previously non-paged reads
simply ignored the memory accounter requesting the read to stop and
consumed all the memory they wanted.
Some tests use the lower level methods directly and meant to use paging
but didn't and nobody noticed. This was revealed by the enforcement of
max result size (introduced in a later patch), which caused these tests
to fail due to exceeding the max result size.
This patch fixes this by setting the `allow_short_reads` slice option.
If somebody wants to bypass proper memory accounting they should at
the very least be forced to consider if that is indeed wise and think a
second about the limit they want to apply.
It is important that all replicas participating in a read use the same
memory limits to avoid artificial differences due to different amount of
results. The coordinator now passes down its own memory limit for reads,
in the form of max_result_size (or max_size). For unpaged or reverse
queries this has to be used now instead of the locally set
max_memory_unlimited_query configuration item.
To avoid the replicas accidentally using the local limit contained in
the `query_class_config` returned from
`database::make_query_class_config()`, we refactor the latter into
`database::get_reader_concurrency_semaphore()`. Most of its callers were
only interested in the semaphore only anyway and those that were
interested in the limit as well should get it from the coordinator
instead, so this refactoring is a win-win.
Use the recently added `max_result_size` field of `query::read_command`
to pass the max result size around, including passing it to remote
nodes. This means that the max result size will be sent along each read,
instead of once per connection.
As we want to select the appropriate `max_result_size` based on the type
of the query as well as based on the query class (user or internal) the
previous method won't do anymore. If the remote doesn't fill this
field, the old per-connection value is used.
This field will replace max size which is currently passed once per
established rpc connection via the CLIENT_ID verb and stored as an
auxiliary value on the client_info. For now it is unused, but we update
all sites creating a read command to pass the correct value to it. In the
next patch we will phase out the old max size and use this field to pass
max size on each verb instead.
The convenience constructor of read_command now has two integer
parameter next to each other. In the next patch we intend to add another
one. This is recipe for disaster, so to avoid mistakes this patch
converts these parameters to tagged integers. This makes sure callers
pass what they meant to pass. As a matter of fact, while fixing up
call-sites, I already found several ones passing `query::max_partitions`
to the `row_limit` parameter. No harm done yet, as
`query::max_partitions` == `query::max_rows` but this shows just how
easy it is to mix up parameters with the same type.
query::read_command currently has a single constructor, which serves
both as an idl constructor (order of parameters is fixed) and a convenience one
(most parameters have default values). This makes it very error prone to
add new parameters, that everyone should fill. The new parameter has to
be added as last, with a default value, as the previous ones have a
default value as well. This means the compiler's help cannot be enlisted
to make sure all usages are updated.
This patch adds a separate convenience constructor to be used by normal
code. The idl constructor looses all default parameters. New parameters
can be added to any position in the convenience constructor (to force
users to fill in a meaningful value) while the removed default
parameters from the idl constructor means code cannot accidentally use
it without noticing.
We want to switch from using a single limit to a dual soft/hard limit.
As a first step we switch the limit field of `query_class_config` to use
the recently introduced type for this. As this field has a single user
at the moment -- reverse queries (and not a lot of propagation) -- we
update it in this same patch to use the soft/hard limit: warn on
reaching the soft limit and abort on the hard limit (the previous
behaviour).
This pair of limits replace the old max_memory_for_unlimited_query one,
which remains as an alias to the hard limit. The soft limit inherits the
previous value of the limit (1MB), when this limit is reached a warning
will be logged allowing the users to adjust their client codes without
downtime. The hard limit starts out with a more permissive default of
100MB. When this is reached queries are aborted, the same behaviour as
with the previous single limit.
The idea is to allow clients a grace period for fixing their code, while
at the same time protecting the database from the really bad queries.
Currently, encountering an error when loading view build progress
would result in view builder refusing to start - which also means
that future views would not be built until the server restarts.
A more user-friendly solution would be to log an error message,
but continue to boot the view builder as if no views are currently
in progress, which would at least allow future views to be built
correctly.
The test case is also amended, since now it expects the call
to return that "no view builds are in progress" instead of
an exception.
Fixes#6934
Tests: unit(dev)
Message-Id: <9f26de941d10e6654883a919fd43426066cee89c.1595922374.git.sarna@scylladb.com>
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.
This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.
I manually checked that each individual defense added is already
preventing the crash from happening.
Fixes: #6613Fixes: #6907Fixes: #6908
Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"
* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: use cached semaphore
messaging: make verb handler registering independent of current scheduling group
multishard_mutation_query: validate the semaphore of the looked-up reader
reader_concurrency_semaphore: make inactive read handles unique across semaphores
reader_concurrency_semaphore: add name() accessor
reader_concurrency_semaphore: allow passing name to no-limit constructor
Merged patch set by Botond Dénes:
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
test/boost: view_build_test: add test_view_update_generator_buffering
test/boost: view_build_test: add test test_view_update_generator_deadlock
reader_permit: reader_resources: add operator- and operator+
reader_concurrency_semaphore: add initial_resources()
test: cql_test_env: allow overriding database_config
mutation_reader: expose new_reader_base_cost
db/view: view_updating_consumer: allow passing custom update pusher
db/view: view_update_generator: make staging reader evictable
db/view: view_updating_consumer: move implementation from table.cc to view.cc
database: add make_restricted_range_sstable_reader()
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
---
db/view/view_updating_consumer.hh | 51 ++++++++++++++++++++++++++++---
db/view/view.cc | 39 +++++++++++++++++------
db/view/view_update_generator.cc | 19 +++++++++---
3 files changed, 91 insertions(+), 18 deletions(-)
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
It's better to assert a certain vector size first and only then
dereference its elements - otherwise, if a bug causes the size
to be different, the test can crash with a segfault on an invalid
dereference instead of graciously failing with a test assertion.
Currently inactive read handles are only unique within the same
semaphore, allowing for an unregister against another semaphore to
potentially succeed. This can lead to disasters ranging from crashes to
data corruption. While a handle should never be used with another
semaphore in the first place, we have recently seen a bug (#6613)
causing exactly that, so in this patch we prevent such unregister
operations from ever succeeding by making handles unique across all
semaphores. This is achieved by adding a pointer to the semaphore to the
handle.
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This replaces a lot of make_lw_shared(schema(...)) with
make_shared_schema(...).
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A test case which reproduces the view update generator hang, where the
staging reader consumes all resources and leaves none for the pre-image
reader which blocks on the semaphore indefinitely.