* seastar 428f4ac...11546d4 (9):
> reactor: Fix an infinite loop caused the by high resolution timer not being monitored
> build: Add back `SEASTAR_SHUFFLE_TASK_QUEUE`
> build: Unify dependency versions
> future-util: optimize parallel_for_each() with single element
> core/sharded.hh: fix doxygen for "Multicore" group
> build: switch from travis-ci to circleci
> perftune.py: fix irqbalance tuning on Ubuntu 18
> build: Make the use of sanitizers transitive
> net: ipv6: fix ipv6 detection and tests by binding to loopback
"
Recently, there has been a series of incidents of the multishard
combining reader deadlocking, when the concurrency of reads were
severely restricted and there was no timeout for the read.
Several fixes have been merged (414b14a6b, 21b4b2b9a, ee193f1ab,
170fa382f) but eliminating all occurrences of deadlocks proved to be a
whack-a-mole game. After the last bug report I have decided that instead
of trying to plug new wholes as we find them, I'll try to make wholes
impossible to appear in the first place. To translate this into the
multishard reader, instead of sprinkling new `reader.pause()` calls all
over the place in the multishard reader to solve the newly found
deadlocks, make the pausing of readers fully automatic on the shard
reader level. Readers are now always kept in a paused state, except when
actually used. This eliminates the entire class of deadlock bugs.
This patch-set also aims at simplifying the multishard reader code, as
well as the code of the existing `lifecycle_policy` implementations.
This effort resulted in:
* mutation_reader.cc: no change in SLOC, although it now also contains
logic that used to be duplicated in every `lifecycle_policy`
implementation;
* multishard_mutation_query.cc: 150 SLOC removed;
* database.cc: 30 SLOC removed;
Also the code is now (hopefully) simpler, safer and has a clearer
structure.
Fixes#4050 (main issue)
Fixes#3970Fixes#3998 (deprecates really)
"
* 'simplify-and-fix-multishard-reader/v3.1' of https://github.com/denesb/scylla:
query_mutations_on_all_shards(): make states light-weight
query_mutations_on_all_shards(): get rid of read_context::paused_reader
query_mutations_on_all_shards(): merge the dismantling and ready_to_save states into saving state
query_mutations_on_all_shards(): pause looked-up readers
query_mutation_on_all_shards(): remove unecessary indirection
shard_reader: auto pause readers after being used
reader_concurrency_semaphore::inactive_read_handle: fix handle semantics
shard_reader: make reader creation sync
shard_reader: use semaphore directly to pause-resume
shard_reader: recreate_reader(): fix empty range case
foreign_reader: rip out the now unused private API
shard_reader: move away from foreign_reader
multishard_combining_reader: make shard_reader a shared pointer
multishard_combining_reader: move the shard reader definition out
multishard_combining_reader: disentangle shard_reader
Previously the different states a reader can be in were all separate
structs, and were joined together by a variant. When this was designed
this made sense as states were numerous and quite different. By this
point however the number of states has been reduced to 4, with 3 of them
being almost the same. Thus it makes sense to merge these states into
single struct and keep track of the current state with an enum field.
This can theoretically increase the chances of mistakes, but in practice
I expect the opposite, due to the simpler (and less) code. Also, all the
important checks that verify that a reader is in the state expected by
the code are all left in place.
A byproduct of this change is that the amount of cross-shard writes is
greatly reduced. Whereas previously the whole state object had to be
rewritten on state change, now a single enum value has to be updated.
Cross shard reads are reduced as well to the read of a few foreign
pointers, all state-related data is now kept on the shard where the
associated reader lives.
These two states are now the same, with the artificial distinction that
all readers are promoted to readey_to_save state after the compaction
state and the combined buffer is dismantled. From a practical
perspective this distinction is meaningless so merge the two states into
a single `saving` state.
On the beginning of each page, all saved readers from the previous pages
(if any) are looked up, so they can be reused. Some of these saved
readers can end up not being used at all for the current page, in which
case they will needlessly sit on their permit for the duration of
filling the page. Avoid this by immediately pausing all looked-up
readers. This also allows a nice unifying of the reader saving logic, as
now *all* readers will be in a paused state when `save_reader()` is
called. Previously, looked-up, but not used readers were an exception to
this, requiring extra logic to handle both cases. This logic can now be
removed.
Previously it was the responsibility of the layer above (multishard
combining reader) to pause readers, which happened via an explicit
`pause()` call. This proved to be a very bad design as we kept finding
spots where the multishard reader should have paused the reader to avoid
potential deadlocks (due to starved reader concurrency semaphores), but
didn't.
This commit moves the responsibility of pausing the reader into the
shard reader. The reader is now kept in a paused state, except when it
is actually used (a `fill_buffer()` or `fast_forward_to()` call is
executing). This is fully transparent to the layer above.
As a side note, the shard reader now also hides when the reader is
created. This also used to be the responsibility of the multishard
reader, and although it caused no problems so far, it can be considered
a leak of internal details. The shard reader now automatically creates
the remote reader on the first time it is attempted to be used.
The code has been reorganized, such that there is now a clear separation
of responsibilities. The multishard combining reader handles the
combining of the output of the shard readers, as well as issuing
read-aheads. The shard reader handles read-ahead and creating the
remote reader when needed, as well as transferring the results of remote
reads to the "home" shard. The remote reader
(`shard_reader::remote_reader`, new in this patch) handles
pausing-resuming as well as recreating the reader after it was evicted.
Layers don't access each other's internals (like they used to).
After this commit, the reader passed to `destroy_reader()` will always
be in paused state.
Reader creation happens through the `reader_lifecycle_policy` interface,
which offers a `create_reader()` method. This method accepts a shard
parameter (among others) and returns a future. Its implementation is
expected to go to the specified shard and then return with the created
reader. The method is expected to be called from the shard where the
shard reader (and consequently the multishard reader) lives. This API,
while reasonable enough, has a serious flaw. It doesn't make batching
possible. For example, if the shard reader issues a call to the remote
shard to fill the remote reader's buffer, but finds that it was evicted
while paused, it has to come back to the local shard just to issue the
recreate call. This makes the code both convoluted and slow.
Change the reader creation API to be synchronous, that is, callable from
the shard where the reader has to be created, allowing for simple call
sites and batching.
This change requires that implementations of the lifecycle policy update
any per-reader data-structure they have from the remote shard. This is
not a problem however, as these data-structures are usually partitioned,
such that they can be accessed safely from a remote shard.
Another, very pleasant, consequence of this change is that now all
methods of the lifecycle interface are sync and thus calls to them
cannot overlap anymore.
This patch also removes the
`test_multishard_combining_reader_destroyed_with_pending_create_reader`
unit test, which is not useful anymore.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize
noise.
The shard reader relies on the `reader_lifecycle_policy` for pausing and
resuming the remote reader. The lifecycle policy's API was designed to
be as general as possible, allowing for any implementation of
pause/resume. However, in practice, we have a single implementation of
pause/resume: registering/unregistering the reader with the relevant
`reader_concurrency_semaphore`, and we don't expect any new
implementations to appear in the future.
Thus, the generic API of the lifecycle policy, is needlessly abstract
making its implementations needlessly complex. We can instead make this
very concrete and have the lifecycle policy just return the relevant
semaphore, removing the need for every implementor of the lifecycle
policy interface to have a duplicate implementation of the very same
logic.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize noise.
If the shard reader is created for a singular range (has a single
partition), and then it is evicted after reaching EOS, when recreated we
would have to create a reader that reads an empty range, since the only
partition the range has was already read. Since it is not possible to
create a reader with an empty range, we just didn't recreate the reader
in this case. This is incorrect however, as the code might still attempt
to read from this reader, if only due to a bug, and would trigger a
crash. The correct fix is to create an empty reader that will
immediately be at EOS.
Drop all the glue code, needed in the past so the shard reader can be
implemented on top of foreign reader. As the shard reader moved away
from foreign reader, this glue code is not needed anymore.
In the past, shard reader wrapped a foreign reader instance, adding
functionality required by the multishard reader on top. This has worked
well to a certain degree, but after the addition of pause-resume of
shard reader, the cooperation with foreign reader became more-and-more a
struggle. It has now gotten to a point, where it feels like shard reader
is fighting foreign reader as much as it reuses it. This manifested
itself in the ever growing amount of glue code, and hacks baked into
foreign reader (which is supposed to be of general use), specific to
the usage in the multishard reader.
It is time we don't force this code-reuse anymore and instead implement
all the required functionality in shard reader directly.
Some members of shard reader have to be accessed even after it is
destroyed. This is required by background work that might still be
pending when the reader is destroyed. This was solved by creating a
special `state` struct, which contained all the members of the shard
readers that had to be accessed even after it was destroyed. This state
struct was managed through a shared pointer, that each continuation that
was expected to outlive the reader, held a copy of. This however created
a minefield, where each line of the code had to be carefully audited to
access only fields that will be guaranteed to remain valid.
Fix this mess by making the whole class a shared pointer, with
`enable_shared_from_this`. Now each continuation just has to make sure
to keep `this` alive and code can now access all members freely (well,
almost).
Shard reader started its life as a very thin layer above foreign reader,
with just some convenience methods added. As usual, by now it has grown
into a hairy monster, its class definition out-growing even that of the
multishard reader itself. It is time shard reader is moved into the
top-level scope, improving the readability of both classes.
Currently shard reader has a reference to the owning multishard reader
and it freely accesses its members. This resulted in a mess, where it's
not clear what exactly shard reader depends on. Disentangle this mess,
by making the shard reader self-sufficient, passing all it depends on
into its constructor.
All tests that involve writing to a base table and then reading from the
view table must use the eventually() function to account for the fact that
the view update is asynchronous, and may be visible only some time after
writing the base table. Forgetting an eventually() can cause the test
to become flaky and sometimes fail because the expected data is not *yet*
in the view. Botond noticed these failures in practice in two subtests
(test_partition_key_filtering_with_slice and
test_clustering_key_in_restrictions).
This patch fixes both tests, and I also reviewed the entire source file
view_schem_test.cc and found additional places missing an eventually()
(and also places that unnecessarily used eventually() to read from the
base table), and fixed those as well.
Fixes#4212
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212121140.14679-1-nyh@scylladb.com>
There is no value if having a default value for encoding_stats parameter
of write_components(). If anything it weakens the tests by encouraging
not using the real encoding stats which is not what the actual sstable
write path in Scylla does.
This patch removes the default value and makes most of the tests provide
real encoding statistics. The ones that do not are those that have no
easy way of obtaining those (and those stats are not that important for
the test itself) or there is a reason for not using those
(sstable_3_x_test::test_sstable_write_large_row uses row size thresholds
based on size with default-constructed encoding_stats).
Message-Id: <20190212124356.14878-1-pdziepak@scylladb.com>
Refs #4085
Changes commitlog descriptor to both accept "Recycled-Commitlog..."
file names, and preserve said name in the descriptor.
This ensures we pick up the not-yet-used recycled segments left
from a crash for replay. The replay in turn will simply ignore
the recycled files, and post actual replay they will be deleted
as needed.
Message-Id: <20190129123311.16050-1-calle@scylladb.com>
create-relocatable-package.py currently (refs #4194) builds a compressed
tar file, but does so using a painfully slow Python implementation of gzip,
which is a problem considering the huge size (around 2 gigabytes) of Scylla's
executable. On my machine, running it for a release build of Scylla takes a
whopping 6 minutes.
Just replacing the Python compression with a pipe to an external "gzip"
process speeds up the run to just 2 minutes. But gzip is still not optimal,
using only one thread even when on a many-core machine. If we switch to
"pigz", a parallel implementation of "gzip", all cores are used and on
my machine the compression speeds up to just 23 seconds - that's 15
times faster than before this patch.
So this patch has create-relocatable-package.py use an external pigz process.
"pigz" is now required on the build system (if you want to create packages),
so is added to install-dependencies.sh.
[avi: update toolchain]
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212090333.3970-1-nyh@scylladb.com>
In commit ec66dd6562, in non-interactive
runs of scylla_setup all options were unintentionally set to "false",
regardless of the options passed on the scylla_setup command line. This
can lead to all sorts of wrong behaviors, and in particular one test
setup assumed it was enabling the Scylla service (which was previously
the default) but after this commit, it no longer did.
This patch restores the previous behavior: Non-interactive invocations
of scylla_setup adhere to the defaults and the command-line options,
rather than blindly choosing "false".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190211214105.32613-1-nyh@scylladb.com>
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().
Protect against that by using get_opt() and failing authentication if we see a NULL.
Fixes#4168.
Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
Right now Cassandra SSTables with counters cannot be imported into
Scylla. The reason for that is that Cassandra changed their counter
representation in their 2.1 version and kept transparently supporting
both representations. We do not support their old representation, nor
there is a sane way to figure out by looking at the data which one is in
use.
For safety, we had made the decision long ago to not import any
tables with counters: if a counter was generated in older Cassandra, we
would misrepresent them.
In this patch, I propose we offer a non-default way to import SSTables
with counters: we can gate it with a flag, and trust that the user knows
what they are doing when flipping it (at their own peril). Cassandra 2.1
is by now pretty old. many users can safely say they've never used
anything older.
While there are tools like sstableloader that can be used to import
those counters, there are often situations in which directly importing
SSTables is either better, faster, or worse: the only option left. I
argue that having a flag that allow us to import them when we are sure
it is safe is better than having no option at all.
With this patch I was able to successfully import Cassandra tables with
counters that were generated in Cassandra 2.1, reshard and compact their
SSTables, and read the data back to get the same values in Scylla as in
Cassandra.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190210154028.12472-1-glauber@scylladb.com>
This is convenient to test scylla directly by invoking build/dev/scylla.
This needs to be done under docker because the shared objects scylla
looks for may not exist in the host system.
During quick development we may not want to go through the trouble of
packaging relocatable scylla every time to test changes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190209021033.8400-1-glauber@scylladb.com>
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.
Fixes#4206
Tests: unit(release)
"
* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
tests/sstables: test counter cell header with large number of shards
sstables/counters: fix remote counter shard detection
The logic responsible for reading counters from sstables was getting
confused by large headers. The size of the header depends directly on
the number of shards. This tests checks that we can handle cells with
large number of counter shards properly.
Each counter cell has a header with an entry for each local and global
shards. The detection of remote shards is done by checking if there are
any counter shards that do not have an entry in the header. This is done
by computing the number of counter shards in a cell and comparing it to
the number of header entries. However, the computation was wrong and
included the size taken by the header itself. As a result, if the header
was as big or larger than a single counter shard Scylla incorrectly
complained about remote shards.
The interpreter as it is right now has a bug: I incorrectly assumed that
all the shared libraries that python dynamically links would be in
lib-dynload. That is not true, and at least some of them are in
site-packages.
With that, we were loading system libraries for some shared objects.
The approach taken to fix this is to just check if we're seeing a shared
library and relocate everything we see: we will end up relocating the
ones in lib64 too, but that not only should be okay, it is probably even
more fool-proof.
While doing that I noticed that I had forgotten to incorporate one of
previous feedback from Avi (that we're leaving temporary files behind).
So I'm fixing that as well.
[avi: update toolchain]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190208115501.7234-1-glauber@scylladb.com>
I was playing with the python3 interpreter trying to get pip to work,
just to see how far we can go. We don't really need pip, but I figured
it would be a good stress test to make sure that the process is working
and robust.
And it didn't really work, because although pip will correctly install
things into $relocatable_root/local/lib, sys.path will still refer to a
hardcoded /usr/local. While this should not affect Scylla, since we
expect to have all our modules in out path anyway -- and that path is
searched before /usr/local, it is still dangerous to make an absolute
reference like this.
Unfortunately, /usr/local/ it is included unconditionally by site.py,
which is executed when the interpreter is started and there is no
environment variable I found to change that (the help string refers to
PYTHONNOUSERSITE, but I found no mention of that in site.py whatsoever)
There is a way to tell site.py not to bother to add user sites, by
passing the -s flag, which this patch does.
Aside from doing that, we also enhance PYTHONPATH to include a reference
to ./local/{lib,lib64}/python<version>/site-packages.
After applying this patch, I was able to build an interpreter containing
only python3-pip and python3-setuptools, and build the relocatable
environment from there.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190206052104.25927-1-glauber@scylladb.com>
tmpdir is a helper class representing a temporary directory.
Unfortunately, it suffers for some problems such as lack of proper
encapsulation and weak typing. This has caused bugs in the past when the
user code accidentally modified the member variable with the path to the
directory.
This patch modernises tmpdir and updates its users. The path is stored
in a std::filesystem::path and available read-only to the class users.
mkdtemp and boost are replaced by standard solution.
The users are update to use path more (when it didn't involve too many
changes to their code) and stop using lw_shared_ptr to store the tmpdir
when it wasn't necessary.
tmpdir intentionally doesn't provide any helpers for getting the path as
a string in order to discourage weak types.
Message-Id: <20190207145727.491-1-pdziepak@scylladb.com>
* seastar c3be06d...428f4ac (13):
> build: make the "dist" test respect the build type
> Merge 'Add support for docker --cpuset-cpus' from Juliana
> Merge "Add support for Coroutines TS" from Paweł
> Merge "Modernize dependency management" from Avi
> future: propagate broken_promise exception to abandoned continuations
> net/inet_address: avoid clang Wmissing-braces
> build: Default to the "Release" type if unspecified
> rpc: log an exception that may happen while processing an RPC message
> Add a --split-dwarf option to configure.py
> build: Fix the `StdFilesystem` module
> Compress debug info by default
> Add an option for building with split dwarf
> Dockerfile: install stow
Default constructed extremum_tracker has uninitialised _default_value
which basically makes it never correct to do that. Since this class is a
mechanism and not a value it doesn't really need to be a regular type,
so let's drop the default constructor.
Message-Id: <20190207162430.7460-1-pdziepak@scylladb.com>
This series contains several fixes and improvements as well as new tests
for sstable code dealing with statistics.
* https://github.com/pdziepak/scylla.git sstable-stats-fixes/v1-rebased:
sstables: compaction: don't access moved-from vector of sstables
memtable: move encoding_stats_collector implementation out of header
sstables: seal_statistics(): pass encoding_stats by constant reference
sstables/mc/writer: don't assume all schema columns are present
tests/sstable3: improvements to file compare
tests: extract mutation data model
tests/data_model: add support for expiring atomic cells
tests/data_model: allow specifying timestamp for row markers
tests/memtable: test column tracking for encoding stats
sstables: use correct source of statistics in
get_encoding_stats_for_compaction()
utils/extremum_tracking: preserve "not-set" status on merge
sstables/metadata_collector: move the default values to the global
tracker
tests/sstables: test for reading serialisation header
tests/sstables: pass encoding stats to write_components()
tests/sstable: test merging encoding_stats
Fixes#4202.
By default write_components() uses a safe default for encoding_stats
which indicates that all columns are present. This may hide so bugs, so
let's pass the real thing in the tests that this may matter.
column_stats is a per-partition tracker, while metadata_collector is the
global one. The statistics gathered by column_stats are merged into the
metadata_collector. In order to ensure that we get proper default values
in case no value of particular kind (e.g. no TTLs) was seen they need to
be set on the global tracker, not the per-partition one.
extremum_tracker allows choosing a default value that's going to be used
only if no "real" values were provided. Since it is never compared with
the actual input values it can be anything. For instance, if the minimum
tracker default value is 0 and there was one update with the value 1 the
detected minimum is going to be 1 (the default is ignored).
However, this doesn't work when the trackers are merged since that
process always leaves the destination tracker in the "set" state
regardless whether any of the merged trakcers has ever seen any value.
This is fixed by this patch, by properly preserving _is_set state on
merge.
sstable class is responsible for much more things that it should. In
particular, it takes care of both writing and reading sstables. The
problem that it causes is that it is very easy to confuse those two.
This is what has happened in get_encoding_stats_for_compaction().
Originally, it was using _c_stats as a source of the statistics, which
is used only during the write and per-partition. Needless to say, the
returned encoding_stats were bogus.
The correct source of those statistics is get_stats_metadata().
This patch introduces some improvement to file comparison:
- exception flags are set so that any error triggers an exceptions and
guarantees that they are not silently ignored
- std::ios_base::binary flag is passed to open()
- istreambuf_iterator is used instead of istream_iterator. It is better
suited for comparing binary data.