There is no constexpr operator+ for std::string_view, so we have to
concatenate the strings ourselves.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Fixes#6341
Since scylla no longer supports upgrading from a version without the
"new" (dedicated) truncation record table, we can remove support for these
and the migtration thereof.
Make sure the above holds whereever this is committed.
Note that this does not remove the "truncated_at" field in
system.local.
Merged pull request https://github.com/scylladb/scylla/pull/6914
by By Juliusz Stasiewicz:
The goal is to have finer control over CDC "delta" rows, i.e.:
disable them totally (mode off);
record only base PK+CK columns (mode keys);
make them behave as usual (mode full, default).
The editing of log rows is performed at the stage of finishing CDC mutation.
Fixes#6838
tests: Added CQL test for `delta mode`
cdc: Implementations of `delta_mode::off/keys`
cdc: Infrastructure for controlling `delta_mode`
Alternator Streams have a "alternator_streams_time_window_s" parameter which
is used to allow for correct ordering in the stream in the face of clock
differences between Scylla nodes and possibly network delays. This parameter
currently defaults to 10 seconds, and there is a discussion on issue #6929
on whether it is perhaps too high. But in any case, for tests running on a
single node there is no reason not to set this parameter to zero.
Setting this parameter to zero greatly speeds up the Alternator Streams
tests which use ReadRecords to read from the stream. Previously each such
test took at least 10 seconds, because the data was only readable after a
10 second delay. With alternator_streams_time_window_s=0, these tests can
finish in less than a second. Unfortunately they are still relatively slow
because our Streams implementation has 512 shards, and thus we need over a
thousand (!) API calls to read from the stream).
Running "test/alternator/run test_streams.py" with 25 tests took before
this patch 114 seconds, after this patch, it is down to 18 seconds.
Refs #6929
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Calle Wilund <calle@scylladb.com>
Message-Id: <20200728184612.1253178-1-nyh@scylladb.com>
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
The tests in this patch, which pass on DynamoDB but fail on Alternator,
reproduce a bug described in issue #6951. This bug makes it impossible for
a Query (or Scan) to filter on an attribute if that attribute is not
requested to be included in the output.
This patch includes two xfailing tests of this type: One testing a
combination of FilterExpression and ProjectionExpression, and the second
testing a combination of QueryFilter and AttributesToGet; These two
pairs are, respectively, DynamoDB's newer and older syntaxes to achieve
the same thing.
Additionally, we add two xfailing tests that demonstrates that combining
old and new style syntax (e.g., FilterExpression with AttributesToGet)
should not have been allowed (DynamoDB doesn't allow such combinations),
but Alternator currently accepts these combinations.
Refs #6951
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200729210346.1308461-1-nyh@scylladb.com>
The test/alternator/run script follows the pytest log with a full log of
Scylla. This saved log can be useful in diagnosing problems, but most of
it is filled with non-useful "INFO"-level messages. The two biggest
offenders are compaction - which logs every single compaction happening,
and the migration manager, which is just a second (and very long) message
about schema change operations (e.g., table creations). Neither of these
are interesting for Alternator's tests, which shouldn't care exactly when
compaction of which sstable is happening. These two components alone
are reponsible for 80% of the log lines, and 90% of the log bytes!
In this patch we increase the log level of just these two components -
compaction and migration_manager - to WARN, which reduces the log
by the same percentages (80% by lines, 90% by bytes).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200728191420.1254961-1-nyh@scylladb.com>
"
Non-paged queries completely ignore the query result size limiter
mechanism. They consume all the memory they want. With sufficiently
large datasets this can easily lead to a handful or even a single
unpaged query producing an OOM.
This series continues the work started by 134d5a5f7, by introducing a
configurable pair of soft/hard limit (default to 1MB/100MB) that is
applied to otherwise unlimited queries, like reverse and unpaged ones.
When an unlimited query reaches the soft limit a warning is logged. This
should give users some heads-up to adjust their application. When the
hard limit is reached the query is aborted. The idea is to not greet
users with failing queries after an upgrade while at the same time
protect the database from the really bad queries. The hard limit should
be decreased from time to time gradually approaching the desired goal of
1MB.
We don't want to limit internal queries, we trust ourselves to either
use another form of memory usage control, or read only small datasets.
So the limit is selected according to the query class. User reads use
the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration
items, while internal reads are not limited. The limit is obtained by
the coordinator, who passes it down to replicas using the existing
`max_result_size` parameter (which is not a special type containing the
two limits), which is now passed on every verb, instead of once per
connection. This ensures that all replicas work with the same limits.
For normal paged queries `max_result_size` is set to the usual
`query::result_memory_limiter::maximum_result_size` For queries that can
consume unlimited amount of memory -- unpaged and reverse queries --
this is set to the value of the aforementioned
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration item,
but only for user reads, internal reads are not limited.
This has the side-effect that reverse reads now send entire
partitions in a single page, but this is not that bad. The data was
already read, and its size was below the limit, the replica might as well
send it all.
Fixes: #5870
"
* 'nonpaged-query-limit/v5' of https://github.com/denesb/scylla: (26 commits)
test: database_test: add test for enforced max result limit
mutation_partition: abort read when hard limit is exceeded for non-paged reads
query-result.hh: move the definition of short_read to the top
test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit
test: set the allow_short_read slice option for paged queries
partition_slice_builder: add with_option()
result_memory_accounter: remove default constructor
query_*(): use the coordinator specified memory limit for unlimited queries
storage_proxy: use read_command::max_result_size to pass max result size around
query: result_memory_limiter: use the new max_result_size type
query: read_command: add max_result_size
query: read_command: use tagged ints for limit ctor params
query: read_command: add separate convenience constructor
service: query_pager: set the allow_short_read flag
result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit
storage_proxy: add get_max_result_size()
result_memory_limiter: add unlimited_result_size constant
database: add get_statement_scheduling_group()
database: query_mutations(): obtain the memory accounter inside
query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field
...
If the read is not paged (short read is not allowed) abort the query if
the hard memory limit is reached. On reaching the soft memory limit a
warning is logged. This should allow users to adjust their application
code while at the same time protecting the database from the really bad
queries.
The enforcement happens inside the memory accounter and doesn't require
cooperation from the result builders. This ensures memory limit set for
the query is respected for all kind of reads. Previously non-paged reads
simply ignored the memory accounter requesting the read to stop and
consumed all the memory they wanted.
Some tests use the lower level methods directly and meant to use paging
but didn't and nobody noticed. This was revealed by the enforcement of
max result size (introduced in a later patch), which caused these tests
to fail due to exceeding the max result size.
This patch fixes this by setting the `allow_short_reads` slice option.
If somebody wants to bypass proper memory accounting they should at
the very least be forced to consider if that is indeed wise and think a
second about the limit they want to apply.
It is important that all replicas participating in a read use the same
memory limits to avoid artificial differences due to different amount of
results. The coordinator now passes down its own memory limit for reads,
in the form of max_result_size (or max_size). For unpaged or reverse
queries this has to be used now instead of the locally set
max_memory_unlimited_query configuration item.
To avoid the replicas accidentally using the local limit contained in
the `query_class_config` returned from
`database::make_query_class_config()`, we refactor the latter into
`database::get_reader_concurrency_semaphore()`. Most of its callers were
only interested in the semaphore only anyway and those that were
interested in the limit as well should get it from the coordinator
instead, so this refactoring is a win-win.
Use the recently added `max_result_size` field of `query::read_command`
to pass the max result size around, including passing it to remote
nodes. This means that the max result size will be sent along each read,
instead of once per connection.
As we want to select the appropriate `max_result_size` based on the type
of the query as well as based on the query class (user or internal) the
previous method won't do anymore. If the remote doesn't fill this
field, the old per-connection value is used.
This field will replace max size which is currently passed once per
established rpc connection via the CLIENT_ID verb and stored as an
auxiliary value on the client_info. For now it is unused, but we update
all sites creating a read command to pass the correct value to it. In the
next patch we will phase out the old max size and use this field to pass
max size on each verb instead.
The convenience constructor of read_command now has two integer
parameter next to each other. In the next patch we intend to add another
one. This is recipe for disaster, so to avoid mistakes this patch
converts these parameters to tagged integers. This makes sure callers
pass what they meant to pass. As a matter of fact, while fixing up
call-sites, I already found several ones passing `query::max_partitions`
to the `row_limit` parameter. No harm done yet, as
`query::max_partitions` == `query::max_rows` but this shows just how
easy it is to mix up parameters with the same type.
query::read_command currently has a single constructor, which serves
both as an idl constructor (order of parameters is fixed) and a convenience one
(most parameters have default values). This makes it very error prone to
add new parameters, that everyone should fill. The new parameter has to
be added as last, with a default value, as the previous ones have a
default value as well. This means the compiler's help cannot be enlisted
to make sure all usages are updated.
This patch adds a separate convenience constructor to be used by normal
code. The idl constructor looses all default parameters. New parameters
can be added to any position in the convenience constructor (to force
users to fill in a meaningful value) while the removed default
parameters from the idl constructor means code cannot accidentally use
it without noticing.
We want to switch from using a single limit to a dual soft/hard limit.
As a first step we switch the limit field of `query_class_config` to use
the recently introduced type for this. As this field has a single user
at the moment -- reverse queries (and not a lot of propagation) -- we
update it in this same patch to use the soft/hard limit: warn on
reaching the soft limit and abort on the hard limit (the previous
behaviour).
This pair of limits replace the old max_memory_for_unlimited_query one,
which remains as an alias to the hard limit. The soft limit inherits the
previous value of the limit (1MB), when this limit is reached a warning
will be logged allowing the users to adjust their client codes without
downtime. The hard limit starts out with a more permissive default of
100MB. When this is reached queries are aborted, the same behaviour as
with the previous single limit.
The idea is to allow clients a grace period for fixing their code, while
at the same time protecting the database from the really bad queries.
Currently, encountering an error when loading view build progress
would result in view builder refusing to start - which also means
that future views would not be built until the server restarts.
A more user-friendly solution would be to log an error message,
but continue to boot the view builder as if no views are currently
in progress, which would at least allow future views to be built
correctly.
The test case is also amended, since now it expects the call
to return that "no view builds are in progress" instead of
an exception.
Fixes#6934
Tests: unit(dev)
Message-Id: <9f26de941d10e6654883a919fd43426066cee89c.1595922374.git.sarna@scylladb.com>
In commit 8d27e1b, we added tracing (see docs/tracing.md) support to
Alternator requests. However, we never had a functional test that
verifies this feature actually works as expected, and we recently
noticed that for the GetItem and BatchGetItem requestd, the trace
doesn't really work (it returns an empty list of events).
So this patch adds a test, test/alternator/test_tracing.py, which verifies
that the tracing feature works for the PutItem, GetItem, DeleteItem,
UpdateItem, BatchGetItem, BatchWriteItem, Query and Scan operations.
This test is very peculiar. It needs to use out-of-band REST API
requests to enable and disable tracing (of course, the test is skipped
when running against AWS - this is a Scylla-only feature). It also needs
to read CQL-only system tables and does this using Alternator's
".scylla.alternator" interface for system tables - which came through
for us here beautifully and demonstrated their usefulness.
I paid a lot of attention for this test to remain reasonably fast -
this entire test now runs in a little less than one second. Achieving
this while testing eight different requests was a bit of a challenge,
because traces take time until they are visible in the trace table.
This is the main reason why in this patch the test for all eight
request types are done in one test, instead of eight separate tests.
Fixes#6891
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200727115401.1199024-1-nyh@scylladb.com>
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.
This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.
I manually checked that each individual defense added is already
preventing the crash from happening.
Fixes: #6613Fixes: #6907Fixes: #6908
Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"
* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: use cached semaphore
messaging: make verb handler registering independent of current scheduling group
multishard_mutation_query: validate the semaphore of the looked-up reader
reader_concurrency_semaphore: make inactive read handles unique across semaphores
reader_concurrency_semaphore: add name() accessor
reader_concurrency_semaphore: allow passing name to no-limit constructor
Merged patch set by Botond Dénes:
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
test/boost: view_build_test: add test_view_update_generator_buffering
test/boost: view_build_test: add test test_view_update_generator_deadlock
reader_permit: reader_resources: add operator- and operator+
reader_concurrency_semaphore: add initial_resources()
test: cql_test_env: allow overriding database_config
mutation_reader: expose new_reader_base_cost
db/view: view_updating_consumer: allow passing custom update pusher
db/view: view_update_generator: make staging reader evictable
db/view: view_updating_consumer: move implementation from table.cc to view.cc
database: add make_restricted_range_sstable_reader()
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
---
db/view/view_updating_consumer.hh | 51 ++++++++++++++++++++++++++++---
db/view/view.cc | 39 +++++++++++++++++------
db/view/view_update_generator.cc | 19 +++++++++---
3 files changed, 91 insertions(+), 18 deletions(-)
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
It's better to assert a certain vector size first and only then
dereference its elements - otherwise, if a bug causes the size
to be different, the test can crash with a segfault on an invalid
dereference instead of graciously failing with a test assertion.
Currently inactive read handles are only unique within the same
semaphore, allowing for an unregister against another semaphore to
potentially succeed. This can lead to disasters ranging from crashes to
data corruption. While a handle should never be used with another
semaphore in the first place, we have recently seen a bug (#6613)
causing exactly that, so in this patch we prevent such unregister
operations from ever succeeding by making handles unique across all
semaphores. This is achieved by adding a pointer to the semaphore to the
handle.
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This replaces a lot of make_lw_shared(schema(...)) with
make_shared_schema(...).
This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A test case which reproduces the view update generator hang, where the
staging reader consumes all resources and leaves none for the pre-image
reader which blocks on the semaphore indefinitely.
test.py passes the "--junit-xml" option to test/alternator/run, which passes
this option to pytest to get an xunit-format summary of the test results.
However, unfortunately until very recent versions (which aren't yet in Linux
distributions), pytest defaulted to a non-standard xunit format which tools
like Jenkins couldn't properly parse. The more standard format can be chosen
by passing the option "-o junit_family=xunit2", so this is what we do in
this patch.
Fixes#6767 (hopefully).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200719203414.985340-1-nyh@scylladb.com>
"
The set's goal is to reduce the indirect fanout of 3 headers only,
but likely affects more. The measured improvement rates are
flat_mutation_reader.hh: -80%
mutation.hh : -70%
mutation_partition.hh : -20%
tests: dev-build, 'checkheaders' for changed headers (the tree-wide
fails on master)
"
* 'br-debloat-mutation-headers' of https://github.com/xemul/scylla:
headers:: Remove flat_mutation_reader.hh from several other headers
migration_manager: Remove db/schema_tables.hh inclustion into header
storage_proxy: Remove frozen_mutation.hh inclustion
storage_proxy: Move paxos/*.hh inclusions from .hh to .cc
storage_proxy: Move hint_wrapper from .hh to .cc
headers: Remove mutation.hh from trace_state.hh
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In test_streams.py, we had the line:
assert desc['StreamDescription'].get('StreamLabel')
In Alternator, the 'StreamLabel' attribute is missing, which the author of
this test probably thought would cause this test to fail (which is expected,
the test is marked with "xfail"). However, my version of pytest actually
doesn't like that assert is given a value instead of a comparison, and we
get the warning message:
PytestAssertRewriteWarning: asserting the value None, please use "assert is None"
I think that the nicest replacement for this line is
assert 'StreamLabel' in desc['StreamDescription']
This is very readable, "pythonic", and checks the right thing - it checks
that the JSON must include the 'StreamLabel' item, as the get() assertion
was supposed to have been doing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200716124621.906473-1-nyh@scylladb.com>
"
Fix#6825 by explicitly distinguishing single- from multi-column expressions in AST.
Tests: unit (dev), dtest secondary_indexes_test.py (dev)
"
* dekimir-single-multiple-ast:
cql3/restrictions: Separate AST for single column
cql3/restrictions: Single-column helper functions