"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.
This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.
I manually checked that each individual defense added is already
preventing the crash from happening.
Fixes: #6613Fixes: #6907Fixes: #6908
Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"
* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: use cached semaphore
messaging: make verb handler registering independent of current scheduling group
multishard_mutation_query: validate the semaphore of the looked-up reader
reader_concurrency_semaphore: make inactive read handles unique across semaphores
reader_concurrency_semaphore: add name() accessor
reader_concurrency_semaphore: allow passing name to no-limit constructor
Merged patch set by Botond Dénes:
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
test/boost: view_build_test: add test_view_update_generator_buffering
test/boost: view_build_test: add test test_view_update_generator_deadlock
reader_permit: reader_resources: add operator- and operator+
reader_concurrency_semaphore: add initial_resources()
test: cql_test_env: allow overriding database_config
mutation_reader: expose new_reader_base_cost
db/view: view_updating_consumer: allow passing custom update pusher
db/view: view_update_generator: make staging reader evictable
db/view: view_updating_consumer: move implementation from table.cc to view.cc
database: add make_restricted_range_sstable_reader()
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
---
db/view/view_updating_consumer.hh | 51 ++++++++++++++++++++++++++++---
db/view/view.cc | 39 +++++++++++++++++------
db/view/view_update_generator.cc | 19 +++++++++---
3 files changed, 91 insertions(+), 18 deletions(-)
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
It's better to assert a certain vector size first and only then
dereference its elements - otherwise, if a bug causes the size
to be different, the test can crash with a segfault on an invalid
dereference instead of graciously failing with a test assertion.
Currently inactive read handles are only unique within the same
semaphore, allowing for an unregister against another semaphore to
potentially succeed. This can lead to disasters ranging from crashes to
data corruption. While a handle should never be used with another
semaphore in the first place, we have recently seen a bug (#6613)
causing exactly that, so in this patch we prevent such unregister
operations from ever succeeding by making handles unique across all
semaphores. This is achieved by adding a pointer to the semaphore to the
handle.
A test case which reproduces the view update generator hang, where the
staging reader consumes all resources and leaves none for the pre-image
reader which blocks on the semaphore indefinitely.
"
The set's goal is to reduce the indirect fanout of 3 headers only,
but likely affects more. The measured improvement rates are
flat_mutation_reader.hh: -80%
mutation.hh : -70%
mutation_partition.hh : -20%
tests: dev-build, 'checkheaders' for changed headers (the tree-wide
fails on master)
"
* 'br-debloat-mutation-headers' of https://github.com/xemul/scylla:
headers:: Remove flat_mutation_reader.hh from several other headers
migration_manager: Remove db/schema_tables.hh inclustion into header
storage_proxy: Remove frozen_mutation.hh inclustion
storage_proxy: Move paxos/*.hh inclusions from .hh to .cc
storage_proxy: Move hint_wrapper from .hh to .cc
headers: Remove mutation.hh from trace_state.hh
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
Fix#6825 by explicitly distinguishing single- from multi-column expressions in AST.
Tests: unit (dev), dtest secondary_indexes_test.py (dev)
"
* dekimir-single-multiple-ast:
cql3/restrictions: Separate AST for single column
cql3/restrictions: Single-column helper functions
Corresponding overload of `storage_proxy::mutate_locally`
was hardcoded to pass `db::commitlog::force_sync::no` to the
`database::apply`. Unhardcode it and substitute `force_sync::no`
to all existing call sites (as it were before).
`force_sync::yes` will be used later for paxos learn writes
when trying to apply mutations upgraded from an obsolete
schema version (similar to the current case when applying
locally a `frozen_mutation` stored in accepted proposal).
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200716124915.464789-1-pa.solodovnikov@scylladb.com>
Existing AST assumes the single-column expression is a special case of
multi-column expressions, so it cannot distinguish `c=(0)` from
`(c)=(0)`. This leads to incorrect behaviour and dtest failures. Fix
it by separating the two cases explicitly in the AST representation.
Modify AST-creation code to create different AST for single- and
multi-column expressions.
Modify AST-consuming code to handle column_name separately from
vector<column_name>. Drop code relying on cardinality testing to
distinguisn single-column cases.
Add a new unit test for `c=(0)`.
Fixes#6825.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Merge pull request https://github.com/scylladb/scylla/pull/6834 by
Juliusz Stasiewicz:
NULLs used to give false positives in GT, LT, GEQ and LEQ ops performed upon
ALLOW FILTERING. That was a consequence of not distinguishing NULL from an
empty buffer.
This patch excludes NULLs on high level, preventing them from entering LHS
of any comparison, i.e. it assumes that any binary operation should return
false whenever the LHS operand is NULL (note: at the moment filters with
RHS NULL, such as ...WHERE x=NULL ALLOW FILTERING, return empty sets anyway).
Fixes#6295
* '6295-do-not-compare-nulls-v2' of github.com:jul-stas/scylla:
filtering_test: check that NULLs do not compare to normal values
cql3/restrictions: exclude NULLs from comparison in filtering
Tested operators are: `<` and `>`. Tests all types of NULLs except
`duration` because durations are explicitly not comparable. Values
to compare against were chosen arbitrarily.
The collection is K:V store
bplus::tree<Key = K, Value = array_trusted_bounds<V>>
It will be used as partitions cache. The outer tree is used to
quickly map token to cache_entry, the inner array -- to resolve
(expected to be rare) hash collisions.
It also must be equipped with two comparators -- less one for
keys and full one for values. The latter is not kept on-board,
but it required on all calls.
The core API consists of just 2 calls
- Heterogenuous lower_bound(search_key) -> iterator : finds the
element that's greater or equal to the provided search key
Other than the iterator the call returns a "hint" object
that helps the next call.
- emplace_before(iterator, key, hint, ...) : the call construct
the element right before the given iterator. The key and hint
are needed for more optimal algo, but strictly speaking not
required.
Adding an entry to the double_decker may result in growing the
node's array. Here to B+ iterator's .reconstruct() method
comes into play. The new array is created, old elements are
moved onto it, then the fresh node replaces the old one.
// TODO: Ideally this should be turned into the
// template <typename OuterCollection, typename InnerCollection>
// but for now the double_decker still has some intimate knowledge
// about what outer and inner collections are.
Insertion into this collection _may_ invalidate iterators, but
may leave intact. Invalidation only happens in case of hashing
conflict, which can be clearly seen from the hint object, so
there's a good room for improvement.
The main usage by row_cache (the find_or_create_entry) looks like
cache_entry find_or_create_entry() {
bound_hint hint;
it = lower_bound(decorated_key, &hint);
if (!hint.found) {
it = emplace_before(it, decorated_key.token(), hint,
<constructor args>)
}
return *it;
}
Now the hint. It contains 3 booleans, that are
- match: set to true when the "greater or equal" condition
evaluated to "equal". This frees the caller from the need
to manually check whether the entry returned matches the
search key or the new one should be inserted.
This is the "!found" check from the above snippet.
To explain the next 2 bools, here's a small example. Consider
the tree containing two elements {token, partition key}:
{ 3, "a" }, { 5, "z" }
As the collection is sorted they go in the order shown. Next,
this is what the lower_bound would return for some cases:
{ 3, "z" } -> { 5, "z" }
{ 4, "a" } -> { 5, "z" }
{ 5, "a" } -> { 5, "z" }
Apparently, the lower bound for those 3 elements are the same,
but the code-flows of emplacing them before one differ drastically.
{ 3, "z" } : need to get previous element from the tree and
push the element to it's vector's back
{ 4, "a" } : need to create new element in the tree and populate
its empty vector with the single element
{ 5, "a" } : need to put the new element in the found tree
element right before the found vector position
To make one of the above decisions the .emplace_before would need
to perform another set of comparisons of keys and elements.
Fortunately, the needed information was already known inside the
lower_bound call and can be reported via the hint.
Said that,
- key_match: set to true if tree.lower_bound() found the element
for the Key (which is token). For above examples this will be
true for cases 3z and 5a.
- key_tail: set to true if the tree element was found, but when
comparing values from array the bounding element turned out
to belong to the next tree element and the iterator was ++-ed.
For above examples this would be true for case 3z only.
And the last, but not least -- the "erase self" feature. Which is
given only the cache_entry pointer at hands remove it from the
collection. To make this happen we need to make two steps:
1. get the array the entry sits in
2. get the b+ tree node the vectors sits in
Both methods are provided by array_trusted_bounds and bplus::tree.
So, when we need to get iterator from the given T pointer, the algo
looks like
- Walk back the T array untill hitting the head element
- Call array_trusted_bounds::from_element() getting the array
- Construct b+ iterator from obtained array
- Construct the double_decker iterator from b+ iterator and from
the number of "steps back" from above
- Call double_decker::iterator.erase()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A plain array of elements that grows and shrinks by
constructing the new instance from an existing one and
moving the elements from it.
Behaves similarly to vector's external array, but has
0-bytes overhead. The array bounds (0-th and N-th
elemements) are determined by checking the flags on the
elements themselves. For this the type must support
getters and setters for the flags.
To remove an element from array there's also a nothrow
option that drops the requested element from array,
shifts the righter ones left and keeps the trailing
unused memory (so called "train") until reconstruction
or destruction.
Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
// The story is at
// https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ
This is the B+ version which satisfies several specific requirements
to be suitable for row-cache usage.
1. Insert/Remove doesn't invalidate iterators
2. Elements should be LSA-compactable
3. Low overhead of data nodes (1 pointer)
4. External less-only comparator
5. As little actions on insert/delete as possible
6. Iterator walks the sorted keys
The design, briefly is:
There are 3 types of nodes: inner, leaf and data, inner and leaf
keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in
a doubly linked list. Data nodes live separately from the leaf ones
and keep pointers on them. Tree handler keeps pointers on root and
left-most and right-most leaves. Nodes do _not_ keep pointers or
references on the tree (except 3 of them, see below).
changes in v9:
- explicitly marked keys/kids indices with type aliases
- marked the whole erase/clear stuff noexcept
- disposers now accept object pointer instead of reference
- clear tree in destructor
- added more comments
- style/readability review comments fixed
Prior changes
**
- Add noexcepts where possible
- Restrict Less-comparator constraint -- it must be noexcept
- Generalized node_id
- Packed code for beging()/cbegin()
**
- Unsigned indices everywhere
- Cosmetics changes
**
- Const iterators
- C++20 concepts
**
- The index_for() implmenetation is templatized the other way
to make it possible for AVX key search specialization (further
patching)
**
- Insertion tries to push kids to siblings before split
Before this change insertion into full node resulted into this
node being split into two equal parts. This behaviour for random
keys stress gives a tree with ~2/3 of nodes half-filled.
With this change before splitting the full node try to push one
element to each of the siblings (if they exist and not full).
This slows the insertion a bit (but it's still way faster than
the std::set), but gives 15% less total number of nodes.
- Iterator method to reconstruct the data at the given position
The helper creates a new data node, emplaces data into it and
replaces the iterator's one with it. Needed to keep arrays of
data in tree.
- Milli-optimize erase()
- Return back an iterator that will likely be not re-validated
- Do not try to update ancestors separation key for leftmost kid
This caused the clear()-like workload work poorly as compared to
std:set. In particular the row_cache::invalidate() method does
exactly this and this change improves its timing.
- Perf test to measure drain speed
- Helper call to collect tree counters
**
- Fix corner case of iterator.emplace_before()
- Clean heterogenous lookup API
- Handle exceptions from nodes allocations
- Explicitly mark places where the key is copied (for future)
- Extend the tree.lower_bound() API to report back whether
the bound hit the key or not
- Addressed style/cleanness review comments
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
NULLs used to give false positives in GT, LT, GEQ and LEQ ops
performed upon `ALLOW FILTERING`. That was a consequence of
not distinguishing NULL from an empty buffer. This patch excludes
NULLs on high level, preventing them from entering any comparison,
i.e. it assumes that any binary operator should return `false`
whenever one of the operands is NULL (note: ATM filters such as
`...WHERE x=NULL ALLOW FILTERING` return empty sets anyway).
`restriction_test/regular_col_slice` had to be updated accordingly.
Fixes#6295
Merged pull request https://github.com/scylladb/scylla/pull/6741
by Piotr Dulikowski:
This PR changes the algorithm used to generate preimages and postimages
in CDC log. While its behavior is the same for non-batch operations
(with one exception described later), it generates pre/postimages that
are organized more nicely, and account for multiple updates to the same
row in one CQL batch.
Fixes#6597, #6598
Tests:
- unit(dev), for each consecutive commit
- unit(debug), for the last commit
Previous method
The previous method worked on a per delta row basis. First, the base
table is queried for the current state of the rows being modified in
the processed mutation (this is called the "preimage query"). Then,
for each delta row (representing a modification of a row):
If preimage is enabled and the row was already present in the table,
a corresponding preimage row is inserted before the delta row.
The preimage row contains data taken directly from the preimage
query result. Only columns that are modified by the delta are
included in the preimage.
If postimage is enabled, then a postimage row is inserted after the
delta row. The postimage row contains data which was a result of
taking row data directly from the preimage query result and applying
the change the corresponding delta row represented. All columns
of the row are included in the postimage.
The above works well for simple cases such like singular CQL INSERT,
UPDATE, DELETE, or simple CQL BATCH-es. An example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v) VALUES (0, 1, 111);
INSERT INTO tbl (pk, ck, v) VALUES (0, 2, 222);
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v from ks.tbl_scylla_cdc_log ;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v
------------------+---------------+---------+----+----+-----
...snip...
0 | 0 | null | 0 | 1 | 100
1 | 2 | null | 0 | 1 | 111
2 | 9 | null | 0 | 1 | 111
3 | 0 | null | 0 | 2 | 200
4 | 2 | null | 0 | 2 | 222
5 | 9 | null | 0 | 2 | 222
Preimage rows are represented by cdc operation 0, and postimage by 9.
Please note that all rows presented above share the same value of
cdc$time column, which was not shown here for brevity.
Problems with previous approach
This simple algorithm has some conceptual and implementational problems
which arise when processing more complicated CQL BATCH-es. Consider
the following example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v1) VALUES (0, 0, 1) USING TTL 1000;
INSERT INTO tbl (pk, ck, v2) VALUES (0, 0, 2) USING TTL 2000;
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v1, v2 FROM tbl_scylla_cdc_log;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v1 | v2
------------------+---------------+---------+----+----+------+------
...snip...
0 | 0 | null | 0 | 0 | null | 0
1 | 2 | 2000 | 0 | 0 | null | 2
2 | 9 | null | 0 | 0 | 0 | 2
3 | 0 | null | 0 | 0 | 0 | null
4 | 1 | 1000 | 0 | 0 | 1 | null
5 | 9 | null | 0 | 0 | 1 | 0
A single cdc group (corresponding to rows sharing the same cdc$time)
might have more than one delta that modify the same row. For example,
this happens when modifying two columns of the same row with
different TTLs - due to our choice of CDC log schema, we must
represent such change with two delta rows.
It does not make sense to present a postimage after the first delta
and preimage before the second - both deltas are applied
simultaneously by the same CQL BATCH, so the middle "image" is purely
imaginary and does not appear at any point in the table.
Moreover, in this example, the last postimage is wrong - v1 is updated,
but v2 is not. None of the postimages presented above represent the
final state of the row.
New algorithm
The new algorithm works now on per cdc group basis, not delta row.
When starting processing a CQL BATCH:
Load preimage query results into a data structure representing
current state of the affected rows.
For each cdc group:
For each row modified within the group, a preimage is produced,
regardless if the row was present in the table. The preimage
is calculated based on the current state. Only include columns
that are modified for this row within the group.
For each delta, produce a delta row and update the current state
accordingly.
Produce postimages in the same way as preimages - but include all
columns for each row in the postimage.
The new algorithm produces postimage correctly when multiple deltas
affect one, because the state of the row is updated on the fly.
This algorithm moves preimage and postimage rows to the beginning and
the end of the cdc group, accordingly. This solves the problem of
imaginary preimages and postimages appearing inside a cdc group.
Unfortunately, it is possible for one CQL BATCH to contain changes that
use multiple timestamps. This will result in one CQL BATCH creating
multiple cdc groups, with different cdc$time. As it is impossible, with
our choice of schema, to tell that those cdc groups were created from
one CQL BATCH, instead we pretend as if those groups were separate CQL
operations. By tracking the state of the affected rows, we make sure
that preimage in later groups will reflect changes introduces in
previous groups.
One more thing - this algorithm should have the same results for
singular CQL operations and simple CQL BATCH-es, with one exception.
Previously, preimage not produced if a row was not present in the
table. Now, the preimage row will appear unconditionally - it will have
nulls in place of column values.
* 'cdc-pre-postimage-persistence' of github.com:piodul/scylla:
cdc: fix indentation
cdc: don't update partition state when not needed
cdc: implement pre/postimage persistence
cdc: add interface for producing pre/postimages
cdc: load preimage query result into partition state fields
cdc: introduce fields for keeping partition state
cdc: rename set_pk_columns -> allocate_new_log_row
cdc: track batch_no inside transformer
cdc: move cdc$time generation to transformer
cdc: move find_timestamp to split.cc
cdc: introduce change_processor interface
cdc: remove redundant schema arguments from cdc functions
cdc: move management of generated mutations inside transformer
cdc: move preimage result set into a field of transformer
cdc: keep ts and tuuid inside transformer
cdc: track touched parts of mutations inside transformer
cdc: always include preimage for affected rows
cquery_nofail returns the query result, not a future. Invoking .get()
on its result is unnecessary. This just happened to compile because
shared_ptr has a get() method with the same signature as future::get.
Tests: cql_query_test unit test (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Moves responsibility for generating pre/postimage rows from the
"process_change" method to "produce_preimage" and "produce_postimage".
This commit actually affects the contents of generated CDC log
mutations.
Added a unit test that verifies more complicated cases with CQL BATCH.
"
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of their arguments.
This patch employs specific per-type comparators to provide semantically
sensible, dynamically created aggregates.
Fixes#6768
"
* jul-stas-6768-use-type-comparators-for-minmax:
tests: Test min/max on set
aggregate_fcts: Use per-type comparators for dynamic types
Expected behavior is the lexicographical comparison of sets
(element by element), so this test was failing when raw byte
representations were compared.
"
This is the first stage of replacing the existing restrictions code with a new representation. It adds a new class `expression` to replace the existing class `restriction`. Lots of the old code is deleted, though not all -- that will come in subsequent stages.
Tests: unit (dev, debug restrictions_test), dtest (next-gating)
"
* dekimir-restrictions-rewrite:
cql3/restrictions: Drop dead code
cql3/restrictions: Use free functions instead of methods
cql3/restrictions: Create expression objects
cql3/restrictions: Add free functions over new classes
cql3/restrictions: Add new representation
Instead of `restriction` class methods, use the new free functions.
Specific replacement actions are listed below.
Note that class `restrictions` (plural) remains intact -- both its
methods and its type hierarchy remain intact for now.
Ensure full test coverage of the replacement code with new file
test/boost/restrictions_test.cc and some extra testcases in
test/cql/*.
Drop some existing tests because they codify buggy behaviour
(reference #6369, #6382). Drop others because they forbid relation
combinations that are now allowed (eg, mixing equality and
inequality, comparing to NULL, etc.).
Here are some specific categories of what was replaced:
- restriction::is_foo predicates are replaced by using the free
function find_if; sometimes it is used transitively (see, eg,
has_slice)
- restriction::is_multi_column is replaced by dynamic casts (recall
that the `restrictions` class hierarchy still exists)
- utility methods is_satisfied_by, is_supported_by, to_string, and
uses_function are replaced by eponymous free functions; note that
restrictions::uses_function still exists
- restriction::apply_to is replaced by free function
replace_column_def
- when checking infinite_bound_range_deletions, the has_bound is
replaced by local free function bounded_ck
- restriction::bounds and restriction::value are replaced by the more
general free function possible_lhs_values
- using free functions allows us to simplify the
multi_column_restriction and token_restriction hierarchies; their
methods merge_with and uses_function became identical in all
subclasses, so they were moved to the base class
- single_column_primary_key_restrictions<clustering_key>::needs_filtering
was changed to reuse num_prefix_columns_that_need_not_be_filtered,
which uses free functions
Fixes#5799.
Fixes#6369.
Fixes#6371.
Fixes#6372.
Fixes#6382.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Merged patch set from Piotr Sarna:
This series addresses issue #6700 again (it was reopened),
by forbidding all non-local schema changes to be performed
from within the database via CQL interface. These changes
are dangerous since they are not directly propagated to other
nodes.
Tests: unit(dev)
Fixes#6700
Piotr Sarna (4):
test: make schema changes in query_processor_test global
cql3: refuse to change schema internally for distributed tables
test: expand testing internal schema changes
cql3: add explanatory comments to execute_internal
cql3/query_processor.hh | 13 ++++++++++++-
cql3/statements/alter_table_statement.cc | 6 ------
cql3/statements/schema_altering_statement.cc | 15 +++++++++++++++
test/boost/cql_query_test.cc | 8 ++++++--
test/boost/query_processor_test.cc | 16 ++++++++--------
5 files changed, 41 insertions(+), 17 deletions(-)
WHERE clauses with start point above the end point were handled
incorrectly. When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0. This is clearly incorrect, as the user's intent was to filter out
all possible values of a.
Fix it by explicitly short-circuiting to false when start > end. Add
a test case.
Fixes#5799.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Merged patch series by Piotr Sarna:
The alternator project was in need of a more optimized
JSON library, which resulted in creating "rjson" helper functions.
Scylla generally used libjsoncpp for its JSON handling, but in order
to reduce the dependency hell, the usage is now migrated
to rjson, which is faster and offers the same functionality.
The original plan was to be able to drop the dependency
on libjsoncpp-lib altogether and remove it from install-dependencies.sh,
but one last usage of it remains in our test suite,
namely cql_repl. The tool compares its output JSON textually,
so it depends on how a library presents JSON - what are the delimeters,
indentation, etc. It's possible to provide a layer of translation
to force rjson to print in an identical format, but the other issue
is that libjsoncpp keeps subobjects sorted by their name,
while rjson uses an unordered structure.
There are two possible solutions for the last remaining usage
of libjsoncpp:
1. change our test suite to compare JSON documents with a JSON parser,
so that we don't rely on internal library details
2. provide a layer of translation which forces rjson to print
its objects in a format idential to libjsoncpp.
(1.) would be preferred, since now we're also vulnerable for changes
inside libjsoncpp itself - if they change anything in their output
format, tests would start failing. The issue is not critical however,
so it's left for later.
Tests: unit(dev), manual(json_test),
dtest(partitioner_tests.TestPartitioner.murmur3_partitioner_test)
Piotr Sarna (8):
alternator,utils: move rjson.hh to utils/
alternator: remove ambiguous string overloads in rjson
rjson: add parse_to_map helper function
rjson: add from_string_map function
rjson: add non-throwing parsing
rjson: move quote_json_string to rjson
treewide: replace libjsoncpp usage with rjson
configure: drop json.cc and json.hh helpers
alternator/base64.hh | 2 +-
alternator/conditions.cc | 2 +-
alternator/executor.hh | 2 +-
alternator/expressions.hh | 2 +-
alternator/expressions_types.hh | 2 +-
alternator/rmw_operation.hh | 2 +-
alternator/serialization.cc | 2 +-
alternator/serialization.hh | 2 +-
alternator/server.cc | 2 +-
caching_options.hh | 9 +-
cdc/log.cc | 4 +-
column_computation.hh | 5 +-
configure.py | 3 +-
cql3/functions/functions.cc | 4 +-
cql3/statements/update_statement.cc | 24 ++--
cql3/type_json.cc | 212 ++++++++++++++++++----------
cql3/type_json.hh | 7 +-
db/legacy_schema_migrator.cc | 12 +-
db/schema_tables.cc | 1 -
flat_mutation_reader.cc | 1 +
index/secondary_index.cc | 80 +++++------
json.cc | 80 -----------
json.hh | 113 ---------------
schema.cc | 25 ++--
test/boost/cql_query_test.cc | 9 +-
test/manual/json_test.cc | 4 +-
test/tools/cql_repl.cc | 1 +
{alternator => utils}/rjson.cc | 75 +++++++++-
{alternator => utils}/rjson.hh | 40 +++++-
29 files changed, 344 insertions(+), 383 deletions(-)
delete mode 100644 json.cc
delete mode 100644 json.hh
rename {alternator => utils}/rjson.cc (86%)
rename {alternator => utils}/rjson.hh (81%)
In order to eventually switch to a single JSON library,
most of the libjsoncpp usage is dropped in favor of rjson.
Unfortunately, one usage still remains:
test/utils/test_repl utility heavily depends on the *exact textual*
format of its output JSON files, so replacing a library results
in all tests failing because of differences in formatting.
It is possible to force rjson to print its documents in the exact
matching format, but that's left for later, since the issue is not
critical. It would be nice though if our test suite compared
JSON documents with a real JSON parser, since there are more
differences - e.g. libjsoncpp keeps children of the object
sorted, while rapidjson uses an unordered data structure.
This change should cause no change in semantics, it strives
just to replace all usage of libjsoncpp with rjson.
It looks like an order version of my patch series was merged. The only
difference is that the new one had more tests. This patch adds the
missing ones.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630141150.1286893-1-espindola@scylladb.com>
Testing the multishard reader's various read-ahead related corner cases
requires a non-trivial setup. Currently there is just one such test,
but we plan to add more so in this patch we extract this setup code to a
free function to allow reuse across multiple tests.
A fast-forwarded puppet reader goes immediately to EOS. A counter is
added to the remote control to allow tests to check which readers were
actually fast forwarded.
Currently the puppet reader will do an automatic (half) buffer-fill in
the constructor. This makes it very hard to reason about when and how
the action that was passed to it will be executed. Refactor it to take a
list of actions and only execute those, no hidden buffer-fill anymore.
No better proof is needed for this than the fact that the test which is
supposed to test the multishard reader being destroyed with a pending
read-ahead was silently broken (not testing what it should).
This patch fixes this test too.
Also fixed in this patch is the `pending` and `destroyed` fields of the
remote control, tests can now rely on these to be correct and add
additional checkpoints to ensure the test is indeed doing what it was
intended to do.
needs_cleanup() returns true if a sstable needs cleanup.
Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
O(num_sstables * local_ranges)
We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.
So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).
With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.
Fixes#6730.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
"
The snapshotting code is already well isolated from the rest of
the storage_service, so it's relatively easy to move it into
independent component, thus de-bloating the storage_service.
As a side effect this allows painless removal of calls to global
get_storage_service() from schema::describe code.
Test: unit(debug), dtest.snapshot_test(dev), manual start-stop
"
* 'br-snapshot-controller-4' of https://github.com/xemul/scylla:
snap: Get rid of storage_service reference in schema.cc
main: Stop http server
snapshot: Make check_snapshot_not_exist a method
snapshots: Move ops gate from storage_service
snapshot: Move lock from storage_service
snapshot: Move all code into db::snapshot_ctl class
storage_service: Move all snapshot code into snapshot-ctl.cc
snapshots: Initial skeleton
snapshots: Properly shutdown API endpoints
api: Rewrap set_server_snapshot lambda
Now when the snapshot stopping is correctly handled, we may pull the database
reference all the way down to the schema::describe().
One tricky place is in table::napshot() -- the local db reference is pulled
through an smp::submit_to call, but thanks to the shard checks in the place
where it is needed the db is still "local"
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This behavior is different from cassandra, but without arithmetic
operations it doesn't seem possible to notice the difference from
CQL. Using avg produces the same results, since we use an initial
value of 0 (scale = 0).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A negative scale was being passed an a positive value to
boost::multiprecision::pow, which would never finish.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>