Fixes#6995
In c2c6c71 the assert on replay positions in flushed sstables discarded by
truncate was broken, by the fact that we no longer flush all sstables
unless auto snapshot is enabled.
This means the low_mark assertion does not hold, because we maybe/probably
never got around to creating the sstables that would hold said mark.
Note that the (old) change to not create sstables and then just delete
them is in itself good. But in that case we should not try to verify
the rp mark.
(cherry picked from commit 9620755c7f)
"
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.
This PR prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.
Fixes: #6409, Fixes#6776
"
* piodul-fix-commitlog-memory-leak-in-hinted-handoff:
hinted handoff: disable warnings about segments left on disk
hinted handoff: release memory on commitlog termination
(cherry picked from commit 4c221855a1)
The column names in SlicePredicate can be passed in arbitrary order.
We converted them to clustering ranges in read_command preserving the
original order. As a result, the clustering ranges in read command may
appear out of order. This violates storage engine's assumptions and
lead to undefined behavior.
It was seen manifesting as a SIGSEGV or an abort in sstable reader
when executing a get_slice() thrift verb:
scylla: sstables/consumer.hh:476: seastar::future<> data_consumer::continuous_data_consumer<StateProcessor>::fast_forward_to(size_t, size_t) [with StateProcessor = sstables::data_consume_rows_context_m; size_t = long unsigned int]: Assertion `end >= _stream_position.position' failed.
Fixes#6486.
Tests:
- added a new dtest to thrift_tests.py which reproduces the problem
Message-Id: <1596725657-15802-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit bfd129cffe)
The "NULL" operator in Expected (old-style conditional operations) doesn't
have any parameters, so we insisted that the AttributeValueList be empty.
However, we forgot to allow it to also be missing - a possibility which
DynamoDB allows.
This patch adds a test to reproduce this case (the test passes on DyanmoDB,
fails on Alternator before this patch, and succeeds after this patch), and
a fix.
Fixes#6816.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709161254.618755-1-nyh@scylladb.com>
(cherry picked from commit f549d147ea)
On some CLI tools, command options may different between latest version
vs older version.
To maximize compatibility of setup scripts, we should always use
relocatable CLI tools instead of distribution version of the tool.
Related #6954
(cherry picked from commit a19a62e6f6)
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of arguments.
This patch uses specific per-type comparators to provide semantically
sensible, dynamically created aggregates.
Fixes#6768
(cherry picked from commit 5b438e79be)
Fixes#6828
When using the scylla list index from UUID extension,
null values were not handled properly causing throws
from underlying layer.
(cherry picked from commit 3b74b9585f)
On CentOS7, systemd does not support percentage-based parameter.
To apply memory parameter on CentOS7, we need to override the parameter
in bytes, instead of percentage.
Fixes#6783
(cherry picked from commit 3a25e7285b)
The mutation object may be freed prematurely during commitlog replay
in the schema upgrading path. We will hit the problem if the memtable
is full and apply_in_memory() needs to defer.
This will typically manifest as a segfault.
Fixes#6953
Introduced in 79935df
Tests:
- manual using scylla binary. Reproduced the problem then verified the fix makes it go away
Message-Id: <1596044010-27296-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3486eba1ce)
On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition,
should not use for scylla data partition, also cannot use for it since it's
too small.
It's better to exclude such partiotion from unsed disk list.
Fixes#6636
(cherry picked from commit d7de9518fe)
We saw scylla hit user after free in repair with the following procedure during tests:
- n1 and n2 in the cluster
- n2 ran decommission
- n2 sent data to n1 using repair
- n2 was killed forcely
- n1 tried to remove repair_meta for n1
- n1 hit use after free on repair_meta object
This was what happened on n1:
1) data was received -> do_apply_rows was called -> yield before create_writer() was called
2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
with _writer_done[node_idx] not engaged
3) step 1 resumed, create_writer() was called and _repair_writer object was referenced
4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed
5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object
To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.
Fixes: #6853Fixes: #6868
Backports: 4.0, 4.1, 4.2
(cherry picked from commit e6f640441a)
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.
Fixes#6924.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
(cherry picked from commit 0d70efa58e)
Merged patch set by Botond Dénes:
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
test/boost: view_build_test: add test_view_update_generator_buffering
test/boost: view_build_test: add test test_view_update_generator_deadlock
reader_permit: reader_resources: add operator- and operator+
reader_concurrency_semaphore: add initial_resources()
test: cql_test_env: allow overriding database_config
mutation_reader: expose new_reader_base_cost
db/view: view_updating_consumer: allow passing custom update pusher
db/view: view_update_generator: make staging reader evictable
db/view: view_updating_consumer: move implementation from table.cc to view.cc
database: add make_restricted_range_sstable_reader()
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
(cherry picked from commit f488eaebaf)
Fixes#6892.
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.
This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.
I manually checked that each individual defense added is already
preventing the crash from happening.
Fixes: #6613Fixes: #6907Fixes: #6908
Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"
* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: use cached semaphore
messaging: make verb handler registering independent of current scheduling group
multishard_mutation_query: validate the semaphore of the looked-up reader
reader_concurrency_semaphore: make inactive read handles unique across semaphores
reader_concurrency_semaphore: add name() accessor
reader_concurrency_semaphore: allow passing name to no-limit constructor
(cherry picked from commit 3f84d41880)
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.
Fixes#6913
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
(cherry picked from commit fe127a2155)
from Botond.
Recently it was observed (#6603) that since 4e6400293ea, the staging
reader is reading from a lot of sstables (200+). This consumes a lot of
memory, and after this reaches a certain threshold -- the entire memory
amount of the streaming reader concurrency semaphore -- it can cause a
deadlock within the view update generation. To reduce this memory usage,
we exploit the fact that the staging sstables are usually disjoint, and
use the partitioned sstable set to create the staging reader. This
should ensure that only the minimum number of sstable readers will be
opened at any time.
Refs: #6603Fixes: #6707
Tests: unit(dev)
* 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla:
db/view: view_update_generator: use partitioned sstable set
sstables: make_partitioned_sstable_set(): return an sstable_set
(cherry picked from commit e4b74356bb)
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.
Fixes#6798.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
(cherry picked from commit b67066cae2)
* seastar 8aad24a5f8...4641f4f2d3 (4):
> httpd: Don't warn on ECONNABORTED
> httpd: Avoid calling future::then twice on the same future
Fixes#6709.
> httpd: Use handle_exception instead of then_wrapped
> httpd: Use std::unique_ptr instead of a raw pointer
Consider a cluster with two nodes:
- n1 (dc1)
- n2 (dc2)
A third node is bootstrapped:
- n3 (dc2)
The n3 fails to bootstrap as follows:
[shard 0] init - Startup failed: std::runtime_error
(bootstrap_with_repair: keyspace=system_distributed,
range=(9183073555191895134, 9196226903124807343], no existing node in
local dc)
The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.
Fixes: #6744
Backports: 4.0 4.1, 4.2
(cherry picked from commit 38d964352d)
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.
For example,
Node1 (Repair master):
row1 -> hash1
row2 -> hash2
row3 -> hash3
row3' -> hash3
Node2 (Repair follower):
row1 -> hash1
row2 -> hash2
We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:
repair - Got error in row level repair: std::runtime_error
(row_diff.size() != set_diff.size())
In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.
Refs: #6252
(cherry picked from commit a00ab8688f)
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.
But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.
The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.
Fixes#6750
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
(cherry picked from commit 8e3be5e7d6)
Export TMPDIR environment variable pointing at a subdir of testlog.
This variable is used by seastar/scylla tests to create a
a subdirectory with temporary test data. Normally a test cleans
up the temporary directory, but if it crashes or is killed the
directory remains.
By resetting the default location from /tmp to testlog/{mode}
we allow test.py we consolidate all test artefacts in a single
place.
Fixes#6062, "test.py uses tmpfs"
(cherry picked from commit e628da863d)
* seastar 1e762652c4...8aad24a5f8 (2):
> futures: Add a test for a broken promise in a parallel_for_each
> future: Call set_to_broken_promise earlier
Fixes#6749 (probably).
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.
I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.
- unordered_set:
hash_sets=295634, time=333029199 ns
- flat_hash_set:
hash_sets=295634, time=312484711 ns
- node_hash_set:
hash_sets=295634, time=346195835 ns
- btree_set:
hash_sets=295634, time=341379801 ns
The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.
To fix, switch to absl btree_set container.
Fixes#6190
(cherry picked from commit 67f6da6466)
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
Now you will get legit error message in the case.
Fixes#6690
(cherry picked from commit de82b3efae)
WHERE clauses with start point above the end point were handled
incorrectly. When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0. This is clearly incorrect, as the user's intent was to filter out
all possible values of a.
Fix it by explicitly short-circuiting to false when start > end. Add
a test case.
Fixes#5799.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 921dbd0978)
We currently does not able to apply version number fixup for .orig.tar.gz file,
even we applied correct fixup on debian/changelog, becuase it just reading
SCYLLA-VERSION-FILE.
We should parse debian/{changelog,control} instead.
Fixes#6736
(cherry picked from commit a107f086bc)
After commit 7d86a3b208 (storage_service:
Make replacing node take writes), during replace operation, tokens in
_token_metadata for node being replaced are updated only after the replace
operation is finished. As a result, in range_streamer::add_ranges, the
node being replaced will be considered as a source to stream data from.
Before commit 7d86a3b208, the node being
replaced will not be considered as a source node because it is already
replaced by the replacing node before the replace operation is finished.
This is the reason why it works in the past.
To fix, filter out the node being replaced as a source node explicitly.
Tests: replace_first_boot_test and replace_stopped_node_test
Backports: 4.1
Fixes: #6728
(cherry picked from commit e338028b7e22b0a80be7f80c337c52f958bfe1d7)
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.
Fixes#6720
"
* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
big_decimal: Add a test for a corner case
big_decimal: Correctly handle negative scales
big_decimal: Add a as_rational member function
big_decimal: Move constructors out of line
(cherry picked from commit 3e2eeec83a)
This makes the code a bit easier to read as there are no discarded
futures and no references to having to keep a subscription alive,
which we don't with current seastar.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200527013120.179763-1-espindola@scylladb.com>
"
Row level repair, when using a local reader, is prone to deadlocking on
the streaming reader concurrency semaphore. This has been observed to
happen with at least two participating nodes, running more concurrent
repairs than the maximum allowed amount of reads by the concurrency
semaphore. In this situation, it is possible that two repair instances,
competing for the last available permits on both nodes, get a permit on
one of the nodes and get queued on the other one respectively. As
neither will let go of the permit it already acquired, nor give up
waiting on the failed-to-acquired permit, a deadlock happens.
To prevent this, we make the local repair reader evictable. For this we
reuse the already existing evictable reader mechanism of the multishard
combining reader. This patchset refactors this evictable reader
mechanism into a standalone flat mutation reader, then exposes it to the
outside world.
The repair reader is paused after the repair buffer is filled, which is
currently 32MB, so the cost of a possible reader recreation is amortized
over 32MB read.
The repair reader is said to be local, when it can use the shard-local
partitioner. This is the case if the participating nodes are homogenous
(their shard configuration is identical), that is the repair instance
has to read just from one shard. A non-local reader uses the multishard
reader, which already makes its shard readers evictable and hence is not
prone to the deadlock described here.
Fixes: #6272
Tests: unit(dev, release, debug)
"
* 'repair-row-level-evictable-local-reader/v3' of https://github.com/denesb/scylla:
repair: row_level: destroy reader on EOS or error
repair: row_level: use evictable_reader for local reads
mutation_reader: expose evictable_reader
mutation_reader: evictable_reader: add auto_pause flag
mutation_reader: make evictable_reader a flat_mutation_reader
mutation_reader: s/inactive_shard_read/inactive_evictable_reader/
mutation_reader: move inactive_shard_reader code up
mutation_reader: fix indentation
mutation_reader: shard_reader: extract remote_reader as evictable_reader
mutation_reader: reader_lifecycle_policy: make semaphore() available early