1. storage_service: Make handle_state_left more robust when tokens are empty
In case the tokens for the node to be removed from the cluster are
empty, log the application_state of the leaving node to help understand
why the tokens are empty and try to get the tokens from token_metadata.
2. token_metadata: Do not throw if empty tokens are passed to remove_bootstrap_tokens
Gossip on_change callback calls storage_service::excise which calls
remove_bootstrap_tokens to remove the tokens of the leaving node from
bootstrap tokens. If empty tokens, e.g., due to gossip propagation issue
as we saw in #6468, are passed
to remove_bootstrap_tokens, it will throw. Since the on_change callback
is marked as noexcept, such throw will cause the node to terminate which
is an overkill.
To avoid such error causing the whole cluster to down in worse cases,
just log the tokens are empty passed to remove_bootstrap_tokens.
Refs #6468
Currently, when update_normal_tokens is called, a warning logged.
Token X changing ownership from A to B
It is not correct to log so because we can call update_normal_tokens
against a temporary token_metadata object during topology calculation.
Refs: #6437
The default ninja build target now builds artifacts and packages. Let's
add a 'build' target that only builds the artifacts.
Message-Id: <20200714105042.416698-1-penberg@scylladb.com>
Consider a cluster with two nodes:
- n1 (dc1)
- n2 (dc2)
A third node is bootstrapped:
- n3 (dc2)
The n3 fails to bootstrap as follows:
[shard 0] init - Startup failed: std::runtime_error
(bootstrap_with_repair: keyspace=system_distributed,
range=(9183073555191895134, 9196226903124807343], no existing node in
local dc)
The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.
Fixes: #6744
Backports: 4.0 4.1, 4.2
This adds a short project description to README to make the git
repository more discoverable. The text is an edited version of a Scylla
blurb provided by Peter Corless.
Message-Id: <20200714065726.143147-1-penberg@scylladb.com>
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.
Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch changes the row locking latencies to use
time_estimated_histogram.
The change consist of changing the histogram definition and changing how
values are inserted to the histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch moves the alternator latencies histograms to use the time_estimated_histogram.
The changes requires changing the defined type and use the simpler
insertion method.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.
For example,
Node1 (Repair master):
row1 -> hash1
row2 -> hash2
row3 -> hash3
row3' -> hash3
Node2 (Repair follower):
row1 -> hash1
row2 -> hash2
We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:
repair - Got error in row level repair: std::runtime_error
(row_diff.size() != set_diff.size())
In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.
Refs: #6252
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.
But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.
The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.
Fixes#6750
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
Export TMPDIR environment variable pointing at a subdir of testlog.
This variable is used by seastar/scylla tests to create a
a subdirectory with temporary test data. Normally a test cleans
up the temporary directory, but if it crashes or is killed the
directory remains.
By resetting the default location from /tmp to testlog/{mode}
we allow test.py we consolidate all test artefacts in a single
place.
Fixes#6062, "test.py uses tmpfs"
* seastar 5632cf2146...0fe32ec596 (11):
> futures: Add a test for a broken promise in a parallel_for_each
> future: Simplify finally_body implementation
> futures_test: Extend nested_exception test
> Merge "make gate methods noexcept" from Benny
> tutorial: fix service_loop example
> sharded: fix doxygen \example clause for sharded_parameter
> Merge "future: Don't call need_preempt in 'then' and 'then_impl'" from Rafael
> future: Refactor a bit of duplicated code
> Merge "Add with_file helpers" from Benny
> Merge "Fix doxygen warnings" from Benny
> build: add doxygen to install-dependencies.sh
While fetching CDC generations, various exceptions can occur. They
are divided into "fatal" and "nonfatal", where "fatal" ones prevent
retrying of the fetch operation.
This patch makes `read_failure_exception` "non-fatal", because such
error may appear during restart. In general this type of error can
mean a few different things (e.g. an error code in a response from
replica, but also a broken connection) so retrying seems reasonable.
Fixes#6804
Consolidate the Docker image build instructions into the "Building Scylla"
section of the README instead of having it in a separate section in a different
place of the file.
Message-Id: <20200713132600.126360-1-penberg@scylladb.com>
Currently, if a user tries to CreateTable with a forbidden set of tags,
e.g., the Tags list is too long or contains an invalid value for
system:write_isolation, then the CreateTable request fails but the table
is still created. Without the tag of course.
This patch fixes this bug, and adds two test cases for it that fail
before this patch, and succeed with it. One of the test cases is
scylla_only because it checks the Scylla-specific system:write_isolation
tag, but the second test case works on DynamoDB as well.
What this patch does is to split the update_tags() function into two
parts - the first part just parses the Tags, validates them, and builds
a map. Only the second part actually writes the tags to the schema.
CreateTable now does the first part early, before creating the table,
so failure in parsing or validating the Tags will not leave a created
table behind.
Fixes#6809.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713120611.767736-1-nyh@scylladb.com>
When tests are run in parallel, it is hard to tell how much time each test
ran. The time difference between consecutive printouts (indicating a test's
end) says nothing about the test's duration.
This patch adds in "--verbose" mode, at the end of each test result, the
duration in seconds (in wall-clock time) of the test. For example,
$ ./test.py --mode dev --verbose alternator
================================================================================
[N/TOTAL] TEST MODE RESULT
------------------------------------------------------------------------------
[1/2] boost/alternator_base64_test dev [ PASS ] 0.02s
[2/2] alternator/run dev [ PASS ] 26.57s
These durations are useful for recognizing tests which are especially slow,
or runs where all the tests are unusually slow (which might indicate some
sort of misconfiguration of the test machine).
Fixes#6759
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200706142109.438905-1-nyh@scylladb.com>
This adds a new "dist-<mode>" target, which builds the server package in
selected build mode together with the other packages, and wires it to
the "<mode>" target, which is built as part of default "ninja"
invocation.
This allows us to perform a full build, package, and test cycle across
all build modes with:
./configure.py && ninja && ./test.py
Message-Id: <20200713101918.117692-1-penberg@scylladb.com>
Improve the "pull_github_pr.sh" to detect the number of commits in a
pull request, and use "git cherry-pick" to merge single-commit pull
requests.
Message-Id: <20200713093044.96764-1-penberg@scylladb.com>
There are a bunch of renames that are done if PRODUCT is not the
default, but the Python code for them is incorrect. Path.glob()
is not a static method, and Path does not support .endswith().
Fix by constructing a Path object, and later casting to str.
Let's report each missing CPU feature individually, and improve the
error message a bit. For example, if the "clmul" instruction is missing,
the report looks as follows:
ERROR: You will not be able to run Scylla on this machine because its CPU lacks the following features: pclmulqdq
If this is a virtual machine, please update its CPU feature configuration or upgrade to a newer hypervisor.
Fixes#6528
Add links to the users and developers mailing lists, and the Slack
channel in README.md to make them more discoverable.
Message-Id: <20200713074654.90204-1-penberg@scylladb.com>
Simplify the build and run instructions by splitting the text in three
sections (prerequisites, building, and running) and streamlining the
steps a bit.
Message-Id: <20200713065910.84582-1-penberg@scylladb.com>
scylla fiber often fails to really unwind the entire fiber, stopping
sooner than expected. This is expected as scylla fiber only recognizes
the most standard continuations but can drop the ball as soon as there
is an unusual transmission.
This commits adds a message below the found tasks explaining that the
list might not be exhaustive and prints a command which can be used to
explain why the unwinding stopped at the last task.
While at it also rephrase an out-of-date comment.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200710120813.100009-1-bdenes@scylladb.com>
As requested in #5763 feedback, require that Fn be callable with
binary_operator in the functions mentioned above.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The problem is that this option is defined in seastar testing wrapper,
while no unit tests use it, all just start themselves with app.run() and
would complain on unknown option.
"Would", because nowadays every single test in it declares its own options
in suite.yaml, that override test.py's defaults. Once an option-less unit
test is added (B+ tree ones) it will complain.
The proposal is to remove this option from defaults, if any unit test will
use the seastar testing wrappers and will need this option, it can add one
to the suite.yaml.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200709084602.8386-1-xemul@scylladb.com>
The goal is to make the lambdas, that are fed into partition cache's
clear_and_dispose() and erase_in_dispose(), to be noexcept.
This is to satisfy B+, which strictly requires those to be noexcept
(currently used collections don't care).
The set covers not only the strictly required minimum, but also some
other methods that happened to be nearby.
* https://github.com/xemul/scylla/tree/br-noexcepts-over-the-row-cache:
row_cache: Mark invalidation lambda as noexcept
cache_tracker: Mark methods noexcept
cache_entry: Mark methods noexcept
region: Mark trivial noexcept methods as such
allocation_strategy: Mark returning lambda as noexcept
allocation_strategy: Mark trivial noexcept methods as such
dht: Mark noexcept methods
The "NULL" operator in Expected (old-style conditional operations) doesn't
have any parameters, so we insisted that the AttributeValueList be empty.
However, we forgot to allow it to also be missing - a possibility which
DynamoDB allows.
This patch adds a test to reproduce this case (the test passes on DyanmoDB,
fails on Alternator before this patch, and succeeds after this patch), and
a fix.
Fixes#6816.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709161254.618755-1-nyh@scylladb.com>
If any of the compared bytes_view's is empty
consider the empty prefix is same and proceed to compare
the size of the suffix.
A similar issue exists in legacy_compound_view::tri_comparator::operator().
It too must not pass nullptr to memcmp if any of the compared byte_view's
is empty.
Fixes#6797
Refs #6814
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test: unit(dev)
Branches: all
Message-Id: <20200709123453.955569-1-bhalevy@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6741
by Piotr Dulikowski:
This PR changes the algorithm used to generate preimages and postimages
in CDC log. While its behavior is the same for non-batch operations
(with one exception described later), it generates pre/postimages that
are organized more nicely, and account for multiple updates to the same
row in one CQL batch.
Fixes#6597, #6598
Tests:
- unit(dev), for each consecutive commit
- unit(debug), for the last commit
Previous method
The previous method worked on a per delta row basis. First, the base
table is queried for the current state of the rows being modified in
the processed mutation (this is called the "preimage query"). Then,
for each delta row (representing a modification of a row):
If preimage is enabled and the row was already present in the table,
a corresponding preimage row is inserted before the delta row.
The preimage row contains data taken directly from the preimage
query result. Only columns that are modified by the delta are
included in the preimage.
If postimage is enabled, then a postimage row is inserted after the
delta row. The postimage row contains data which was a result of
taking row data directly from the preimage query result and applying
the change the corresponding delta row represented. All columns
of the row are included in the postimage.
The above works well for simple cases such like singular CQL INSERT,
UPDATE, DELETE, or simple CQL BATCH-es. An example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v) VALUES (0, 1, 111);
INSERT INTO tbl (pk, ck, v) VALUES (0, 2, 222);
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v from ks.tbl_scylla_cdc_log ;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v
------------------+---------------+---------+----+----+-----
...snip...
0 | 0 | null | 0 | 1 | 100
1 | 2 | null | 0 | 1 | 111
2 | 9 | null | 0 | 1 | 111
3 | 0 | null | 0 | 2 | 200
4 | 2 | null | 0 | 2 | 222
5 | 9 | null | 0 | 2 | 222
Preimage rows are represented by cdc operation 0, and postimage by 9.
Please note that all rows presented above share the same value of
cdc$time column, which was not shown here for brevity.
Problems with previous approach
This simple algorithm has some conceptual and implementational problems
which arise when processing more complicated CQL BATCH-es. Consider
the following example:
cqlsh:ks> BEGIN UNLOGGED BATCH
INSERT INTO tbl (pk, ck, v1) VALUES (0, 0, 1) USING TTL 1000;
INSERT INTO tbl (pk, ck, v2) VALUES (0, 0, 2) USING TTL 2000;
APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
pk, ck, v1, v2 FROM tbl_scylla_cdc_log;
cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v1 | v2
------------------+---------------+---------+----+----+------+------
...snip...
0 | 0 | null | 0 | 0 | null | 0
1 | 2 | 2000 | 0 | 0 | null | 2
2 | 9 | null | 0 | 0 | 0 | 2
3 | 0 | null | 0 | 0 | 0 | null
4 | 1 | 1000 | 0 | 0 | 1 | null
5 | 9 | null | 0 | 0 | 1 | 0
A single cdc group (corresponding to rows sharing the same cdc$time)
might have more than one delta that modify the same row. For example,
this happens when modifying two columns of the same row with
different TTLs - due to our choice of CDC log schema, we must
represent such change with two delta rows.
It does not make sense to present a postimage after the first delta
and preimage before the second - both deltas are applied
simultaneously by the same CQL BATCH, so the middle "image" is purely
imaginary and does not appear at any point in the table.
Moreover, in this example, the last postimage is wrong - v1 is updated,
but v2 is not. None of the postimages presented above represent the
final state of the row.
New algorithm
The new algorithm works now on per cdc group basis, not delta row.
When starting processing a CQL BATCH:
Load preimage query results into a data structure representing
current state of the affected rows.
For each cdc group:
For each row modified within the group, a preimage is produced,
regardless if the row was present in the table. The preimage
is calculated based on the current state. Only include columns
that are modified for this row within the group.
For each delta, produce a delta row and update the current state
accordingly.
Produce postimages in the same way as preimages - but include all
columns for each row in the postimage.
The new algorithm produces postimage correctly when multiple deltas
affect one, because the state of the row is updated on the fly.
This algorithm moves preimage and postimage rows to the beginning and
the end of the cdc group, accordingly. This solves the problem of
imaginary preimages and postimages appearing inside a cdc group.
Unfortunately, it is possible for one CQL BATCH to contain changes that
use multiple timestamps. This will result in one CQL BATCH creating
multiple cdc groups, with different cdc$time. As it is impossible, with
our choice of schema, to tell that those cdc groups were created from
one CQL BATCH, instead we pretend as if those groups were separate CQL
operations. By tracking the state of the affected rows, we make sure
that preimage in later groups will reflect changes introduces in
previous groups.
One more thing - this algorithm should have the same results for
singular CQL operations and simple CQL BATCH-es, with one exception.
Previously, preimage not produced if a row was not present in the
table. Now, the preimage row will appear unconditionally - it will have
nulls in place of column values.
* 'cdc-pre-postimage-persistence' of github.com:piodul/scylla:
cdc: fix indentation
cdc: don't update partition state when not needed
cdc: implement pre/postimage persistence
cdc: add interface for producing pre/postimages
cdc: load preimage query result into partition state fields
cdc: introduce fields for keeping partition state
cdc: rename set_pk_columns -> allocate_new_log_row
cdc: track batch_no inside transformer
cdc: move cdc$time generation to transformer
cdc: move find_timestamp to split.cc
cdc: introduce change_processor interface
cdc: remove redundant schema arguments from cdc functions
cdc: move management of generated mutations inside transformer
cdc: move preimage result set into a field of transformer
cdc: keep ts and tuuid inside transformer
cdc: track touched parts of mutations inside transformer
cdc: always include preimage for affected rows
All but few are trivially such.
The clear_continuity() calls cache_entry::set_continuous() that had become noexcept
a patch ago.
The allocator() calls region.allocator() which had been marked noexcept few patches
back.
The on_partition_erase() calls allocator().invalidate_references(), both had
been marked noexcept few patches back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All but one are trivially such, the position() one calls is_dummy_entry()
which has become noexcept right now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
cquery_nofail returns the query result, not a future. Invoking .get()
on its result is unnecessary. This just happened to compile because
shared_ptr has a get() method with the same signature as future::get.
Tests: cql_query_test unit test (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When "trace"-level logging is enabled for Alternator, we log every request,
but currently only the request's body. For debugging, it is sometimes useful
to also see the headers - which are important to debug authentication,
for example. So let's print the headers as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709103414.599883-1-nyh@scylladb.com>