In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
[avi: adjust sstable_datafile_test's use of compaction_descriptor and make_permit]
(cherry picked from commit fe127a2155)
from Botond.
Recently it was observed (#6603) that since 4e6400293ea, the staging
reader is reading from a lot of sstables (200+). This consumes a lot of
memory, and after this reaches a certain threshold -- the entire memory
amount of the streaming reader concurrency semaphore -- it can cause a
deadlock within the view update generation. To reduce this memory usage,
we exploit the fact that the staging sstables are usually disjoint, and
use the partitioned sstable set to create the staging reader. This
should ensure that only the minimum number of sstable readers will be
opened at any time.
Refs: #6603Fixes: #6707
Tests: unit(dev)
* 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla:
db/view: view_update_generator: use partitioned sstable set
sstables: make_partitioned_sstable_set(): return an sstable_set
(cherry picked from commit e4b74356bb)
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.
Fixes#6798.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
(cherry picked from commit b67066cae2)
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.
For example,
Node1 (Repair master):
row1 -> hash1
row2 -> hash2
row3 -> hash3
row3' -> hash3
Node2 (Repair follower):
row1 -> hash1
row2 -> hash2
We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:
repair - Got error in row level repair: std::runtime_error
(row_diff.size() != set_diff.size())
In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.
Refs: #6252
(cherry picked from commit a00ab8688f)
* seastar 78f626af6c...c9c1dc5fa7 (2):
> futures: Add a test for a broken promise in a parallel_for_each
> future: Call set_to_broken_promise earlier
Fixes#6749 (probably).
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
Now you will get legit error message in the case.
Fixes#6690
(cherry picked from commit de82b3efae)
WHERE clauses with start point above the end point were handled
incorrectly. When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0. This is clearly incorrect, as the user's intent was to filter out
all possible values of a.
Fix it by explicitly short-circuiting to false when start > end. Add
a test case.
Fixes#5799.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 921dbd0978)
Counter update is a RMW operation. Until now the "Read" part was
not guarded by a timeout, which is changed in this patch.
Fixes#5069
(cherry picked from commit e04fd9f774)
We already have a docker image option to enable alternator on an unencrypted
port, "--alternator-port", but we forgot to also allow the similar option
for enabling alternator on an encrypted (HTTPS) port: "--alternator-https-port"
so this patch adds the missing option, and documents how to use it.
Note that using this option is not enough. When this option is used,
Alternator also requires two files, /etc/scylla/scylla.crt and
/etc/scylla/scylla.key, to be inserted into the image. These files should
contain the SSL certificate, and key, respectively. If these files are
missing, you will get an error in the log about the missing file.
Fixes#6583.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200621125219.12274-1-nyh@scylladb.com>
(cherry picked from commit e4eca5211a)
When a token is calculated for stream_id, we check that the key is
exactly 16 bytes long. If it's not - `minimum_token` is returned
and client receives empty result.
This used to be the expected behavior for empty keys; now it's
extended to keys of any incorrect length.
Fixes#6570
(cherry picked from commit 8628ede009)
After commit 7d86a3b208 (storage_service:
Make replacing node take writes), during replace operation, tokens in
_token_metadata for node being replaced are updated only after the replace
operation is finished. As a result, in range_streamer::add_ranges, the
node being replaced will be considered as a source to stream data from.
Before commit 7d86a3b208, the node being
replaced will not be considered as a source node because it is already
replaced by the replacing node before the replace operation is finished.
This is the reason why it works in the past.
To fix, filter out the node being replaced as a source node explicitly.
Tests: replace_first_boot_test and replace_stopped_node_test
Backports: 4.1
Fixes: #6728
(cherry picked from commit e338028b7e22b0a80be7f80c337c52f958bfe1d7)
SSTable upgrade is requiring 2x the space of input SSTables because
we aren't releasing references of the SSTables that were already
upgraded. So if we're upgrading 1TB, it means that up to 2TB may be
required for the upgrade operation to succeed.
That can be fixed by moving all input SSTables when rewrite_sstables()
asks for the set of SSTables to be compacted, so allowing their space
to be released as soon as there is no longer any ref to them.
Spotted while auditting code.
Fixes#6682.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619205701.92891-1-raphaelsc@scylladb.com>
(cherry picked from commit 52180f91d4)
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.
Fixes#6720
"
* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
big_decimal: Add a test for a corner case
big_decimal: Correctly handle negative scales
big_decimal: Add a as_rational member function
big_decimal: Move constructors out of line
(cherry picked from commit 3e2eeec83a)
Updating tags was erroneously done locally, which means that
the schema change was not propagated to other nodes.
The new code announces new schema globally.
Fixes#6513
Branches: 4.0,4.1
Tests: unit(dev)
dtest(alternator_tests.AlternatorTest.test_update_condition_expression_and_write_isolation)
Message-Id: <3a816c4ecc33c03af4f36e51b11f195c231e7ce1.1592935039.git.sarna@scylladb.com>
(cherry picked from commit f4e8cfe03b)
"
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:
```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.
In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```
Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.
After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```
Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.
Ta fix, we ignore the table of RF 1 when it is unable to find a source
node to stream.
Fixes#6351
"
* asias-fix_bootstrap_with_rf_one_in_range_streamer:
range_streamer: Handle table of RF 1 in get_range_fetch_map
streaming: Use separate streaming reason for replace operation
(cherry picked from commit 9afd599d7c)
Current sender sends stream_mutation_fragments_cmd::end_of_stream to
receiver when an error is received from a peer node. To be safe, send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to prevent end_of_stream to
be written into the sstable when a partition is not closed yet.
In addition, use mutation_fragment_stream_validator to valid the
mutation fragments emitted from the reader, e.g., check if
partition_start and partition_end are paired when the reader is done. If
not, fail the stream session and send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to isolate the problematic
sstables on the sender node.
Refs: #6478
(cherry picked from commit a521c429e1)
LWT batches conditions can't span multiple tables.
This was detected in batch_statement::validate() called in ::prepare().
But ::cas_result_set_metadata() was built in the constructor,
causing a bitset assert/crash in a reported scenario.
This patch moves validate() to the constructor before building metadata.
Closes#6332
Tested with https://github.com/scylladb/scylla-dtest/pull/1465
[avi: adjust spelling of exception message to 4.1 spelling]
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit d1521e6721)
The get token range API can become big which can cause large allocation
and stalls.
This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.
Fixes#6297
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 7c4562d532)
Even if there are no attributes to return from PutItem requests,
we should return a valid JSON object, not an empty string.
Fixes#6568
Tests: unit(dev)
(cherry picked from commit 8fc3ca855e)
Client libraries (e.g. PynamoDB) expect the UnprocessedKeys
and UnprocessedItems attributes to appear in the response
unconditionally - it's hereby added, along with a simple test case.
Fixes#6569
Tests: unit(dev)
(cherry picked from commit 3aff52f56e)
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).
There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:
1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.
2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.
Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:
auto prev = std::prev(it);
if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
it->set_continuous(true);
but handle_end_of_stream() is not invoked under allocating section.
Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:
return with_linearized_managed_bytes([&] {
...
return _worker_state->alloc_section(region, [&] {
Fixes#6637.
Refs #6108.
Tests:
- unit (all)
Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e81fc1f095)
When a replacing node is in early boot up and is not in HIBERNATE sate
yet, if the node is killed by a user, the node will wrongly send a
shutdown message to other nodes. This is because UNKNOWN is not in
SILENT_SHUTDOWN_STATES, so in gossiper::do_stop_gossiping, the node will
send shutdown message. Other nodes in the cluster will call
storage_service::handle_state_normal for this node, since NORMAL and
SHUTDOWN status share the same status handler. As a result, other nodes
will incorrectly think the node is part of the cluster and the replace
operation is finished.
Such problem was seen in replace_node_no_hibernate_state_test dtest:
n1, n2 are in the cluster
n2 is dead
n3 is started to replace n2, but n3 is killed in the middle
n3 announces SHUTDOWN status wrongly
n1 runs storage_service::handle_state_normal for n3
n1 get tokens for n3 which is empty, because n3 hasn't gossip tokens yet
n1 skips update normal tokens for n3, but think n3 has replaced n2
n4 starts to replace n2
n4 checks the tokens for n2 in storage_service::join_token_ring (Cannot
replace token {} which does not exist!) or
storage_service::prepare_replacement_info (Cannot replace_address {}
because it doesn't exist in gossip)
To fix, we add UNKNOWN into SILENT_SHUTDOWN_STATES and avoid sending
shutdown message.
Tests: replace_address_test.py:TestReplaceAddress.replace_node_no_hibernate_state_test
Fixes: #6436
(cherry picked from commit dddde33512)
Commit 968177da04 has changed the schema
of cdc_topology_description and cdc_description tables in the
system_distributed keyspace.
Unfortunately this was a backwards-incompatible change: these tables
would always be created, irrespective of whether or not "experimental"
was enabled. They just wouldn't be populated with experimental=off.
If the user now tries to upgrade Scylla from a version before this change
to a version after this change, it will work as long as CDC is protected
b the experimental flag and the flag is off.
However, if we drop the flag, or if the user turns experimental on,
weird things will happen, such as nodes refusing to start because they
try to populate cdc_topology_description while assuming a different schema
for this table.
The simplest fix for this problem is to rename the tables. This fix must
get merged in before CDC goes out of experimental.
If the user upgrades his cluster from a pre-rename version, he will simply
have two garbage tables that he is free to delete after upgrading.
sstables and digests need to be regenerated for schema_digest_test since
this commit effectively adds new tables to the system_distributed keyspace.
This doesn't result in schema disagreement because the table is
announced to all nodes through the migration manager.
(cherry picked from commit d89b7a0548)
Fixes#6537.
GC writer, used for incremental compaction, cannot be currently used if interposer
consumer is used. That's because compaction assumes that GC writer will be operated
only by a single compaction writer at a given point in time.
With interposer consumer, multiple writers will concurrently operate on the same
GC writer, leading to race condition which potentially result in use-after-free.
Let's disable GC writer if interposer consumer is enabled. We're not losing anything
because GC writer is currently only needed on strategies which don't implement an
interposer consumer. Resharding will always disable GC writer, which is the expected
behavior because it doesn't support incremental compaction yet.
The proper fix, which allows GC writer and interposer consumer to work together,
will require more time to implement and test, and for that reason, I am postponing
it as #6472 is a showstopper for the current release.
Fixes#6472.
tests: mode(dev).
[Raphael: Fixed compilation failure in unit test test_bug_6472 for backport]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 097a5e9e07)
Message-Id: <20200610203928.86717-1-raphaelsc@scylladb.com>
In 28c3d4 `out()` was used without `shell=True` and was the spliting of arguments
failed cause of the complex commands in the cmd (pipe and such)
Fixes#6159
(cherry picked from commit a2bb48f44b)
We generate a coredump as part of "scylla_coredump_setup" to verify that
coredumps are working. However, we need to *remove* that test coredump
to avoid people and test infrastructure reporting those coredumps.
Fixes#6159
(cherry picked from commit 28c3d4f8e8)
Currently we use a systemd mount (var-lib-systemd-coredump.mount) to mount
default coredump directory (/var/lib/systemd/coredump) to
(/var/lib/scylla/coredump). The /var/lib/scylla had been mounted to a big
storage, so we will have enough space for coredump after the mount.
Currently in coredump_setup, we only enabled var-lib-systemd-coredump.mount,
but not start it. The directory won't be mounted after coredump_setup, so the
coredump will still be saved to default coredump directory.
The mount will only effect after reboot.
Fixes#6566
(cherry picked from commit abf246f6e5)
This reverts commit e77dad3adf because its
incorrect.
Amos explains:
"Quote from https://www.freedesktop.org/software/systemd/man/systemd.mount.html
What=
Takes an absolute path of a device node, file or other resource to
mount. See mount(8) for details. If this refers to a device node, a
dependency on the respective device unit is automatically created.
Where=
Takes an absolute path of a file or directory for the mount point; in
particular, the destination cannot be a symbolic link. If the mount
point does not exist at the time of mounting, it is created as
directory.
So the mount point is '/var/lib/systemd/coredump' and
'/var/lib/scylla/coredump' is the file to mount, because /var/lib/scylla
had mounted a second big storage, which has enough space for Huge
coredumps.
Bentsi or other touched problem with old scylla-master AMI, a coredump
occurred but not successfully saved to disk for enospc. The directory
/var/lib/systemd/coredump wasn't mounted to /var/lib/scylla/coredump.
They WRONGLY thought the wrong mount was caused by the config problem,
so he posted a fix.
Actually scylla-ami-setup / coredump wasn't executed on that AMI, err:
unit scylla-ami-setup.service not found Because
'scylla-ami-setup.service' config file doesn't exist or is invalid.
Details of my testing: https://github.com/scylladb/scylla/issues/6300#issuecomment-637324507
So we need to revert Bentsi's patch, it changed the right config to wrong."
(cherry picked from commit 9d9d54c804)
Our parsing of values in a KeyConditions paramter of Query was done naively.
As a result, we got bizarre error messages "condition not met: false" when
these values had incorrect type (this is issue #6490). Worse - the naive
conversion did not decode base64-encoded bytes value as needed, so
KeyConditions on bytes-typed keys did not work at all.
This patch fixes these bugs by using our existing utility function
get_key_from_typed_value(), which takes care of throwing sensible errors
when types don't match, and decoding base64 as needed.
Unfortunately, we didn't have test coverage for many of the KeyConditions
features including bytes keys, which is why this issue escaped detection.
A patch will follow with much more comprehensive tests for KeyConditions,
which also reproduce this issue and verify that it is fixed.
Refs #6490Fixes#6495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-1-nyh@scylladb.com>
(cherry picked from commit 6b38126a8f)
Alternator supports four ways in which write operations can use quorum
writes or LWT or both, which we called "write isolation policies".
Until this patch, Alternator defaulted to the most generally safe policy,
"always_use_lwt". This default could have been overriden for each table
separately, but there was no way to change this default for all tables.
This patch adds a "--alternator-write-isolation" configuration option which
allows changing the default.
Moreover, @dorlaor asked that users must *explicitly* choose this default
mode, and not get "always_use_lwt" without noticing. The previous default,
"always_use_lwt" supports any workload correctly but because it uses LWT
for all writes it may be disappointingly slow for users who run write-only
workloads (including most benchmarks) - such users might find the slow
writes so disappointing that they will drop Scylla. Conversely, a default
of "forbid_rmw" will be faster and still correct, but will fail on workloads
which need read-modify-write operations - and suprise users that need these
operations. So Dor asked that that *none* of the write modes be made the
default, and users must make an informed choice between the different write
modes, rather than being disappointed by a default choice they weren't
aware of.
So after this patch, Scylla refuses to boot if Alternator is enabled but
a "--alternator-write-isolation" option is missing.
The patch also modifies the relevant documentation, adds the same option to
our docker image, and the modifies the test-running script
test/alternator/run to run Scylla with the old default mode (always_use_lwt),
which we need because we want to test RMW operations as well.
Fixes#6452
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524160338.108417-1-nyh@scylladb.com>
(cherry picked from commit c3da9f2bd4)
In order to be sure that all nodes acknowledged that a table was
created, the CreateTable request will now only return after
seeing that schema agreement was reached.
Rationale: alternator users check if the table was created by issuing
a DescribeTable request, and assume that the table was correctly
created if it returns nonempty results. However, our current
implementation of DescribeTable returns local results, which is
not enough to judge if all the other nodes acknowledge the new table.
CQL drivers are reported to always wait for schema agreement after
issuing DDL-changing requests, so there should be no harm in waiting
a little longer for alternator's CreateTable as well.
Fixes#6361
Tests: alternator(local)
(cherry picked from commit 5f2eadce09)
The existing text did not explain what happens if additional DCs are added
to the cluster, so this patch improves the explanation of the status of
our support for global tables, including that issue.
Fixes#6353
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200513175908.21642-1-nyh@scylladb.com>
(cherry picked from commit f3fd976120)
In write_end_of_stream, it does:
1) Write write_partition_end
2) Write empty mutation_fragment_opt
If 1) fails, 2) will be skipped, the consumer of the queue will wait for
the empty mutation_fragment_opt forever.
Found this issue when injecting random exceptions between 1) and 2).
Refs #6272
Refs #6248
(cherry picked from commit b744dba75a)
The implementation of get_range_to_address_map has a default behaviour,
when getting an empty keypsace, it uses the first non-system keyspace
(first here is basically, just a keyspace).
The current implementation has two issues, first, it uses a reference to
a string that is held on a stack of another function. In other word,
there's a use after free that is not clear why we never hit.
The second, it calls get_non_system_keyspaces twice. Though this is not
a bug, it's redundant (get_non_system_keyspaces uses a loop, so calling
that function does have a cost).
This patch solves both issues, by chaning the implementation to hold a
string instead of a reference to a string.
Second, it stores the results from get_non_system_keyspaces and reuse
them it's more efficient and holds the returned values on the local
stack.
Fixes#6465
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 69a46d4179)
When the 'forbid_rmw' write isolation policy is selected, read-modify-write
are intentionally forbidden. The error message in this case used to say:
"Read-modify-write operations not supported"
Which can lead users to believe that this operation isn't supported by this
version of Alternator - instead of realizing that this is in fact a
configurable choice.
So in this patch we just change the error message to say:
"Read-modify-write operations are disabled by 'forbid_rmw' write isolation policy. Refer to https://github.com/scylladb/scylla/blob/master/docs/alternator/alternator.md#write-isolation-policies for more information."
Fixes#6421.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200518125538.8347-1-nyh@scylladb.com>
(cherry picked from commit 5ef9854e86)
When index file is larger than 4GB, offset calculation will overflow
uint32_t and _promoted_index_end will be too small.
As a result, promoted_index_size calculation will underflow and the
rest of the page will be interpretd as a promoted index.
The partitions which are in the remainder of the index page will not
be found by single-partition queries.
Data is not lost.
Introduced in 6c5f8e0eda.
Fixes#6040
Message-Id: <20200521174822.8350-1-tgrabiec@scylladb.com>
(cherry picked from commit a6c87a7b9e)