In preparations for introducing support multiple entry types in the
querier_cache move all insert/lookup related logic into free functions.
Later these functions will be templated so they can handle multiple
entry types with the same code.
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.
The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.
"
After we fixed reloading flow it enabled situations when items are no longer cached but
still held in the underlying loading_shared_values object. Since loading_cache::size() returns
the size of its loading_shared_values object and loading_cache::begin()/end()/find() are returning
iterators based on loading_shared_values iterators these APIs may return very weird values, e.g.
size() may return the same value after one of the items have been removed using remove(key) API.
This series fixes this by switching mentioned above APIs to work on top of lru_list object instead
of loading_shared_values.
"
* 'loading_cache_fix_api_semantics-v1' of https://github.com/vladzcloudius/scylla:
loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
loading_cache: make size() return the size of lru_list instead of loading_shared_values
A relocatable package contains the Scylla (and iotune)
executables (in a bin/ directory), any libraries they may need (lib/)
the configuration file defaults (conf/) and supporting scripts (dist/).
The libraries are picked up from the host; including libc and the dynamic
linker (ld.so).
We also provide a thunk script that forces the library path
(LD_LIBRARY_PATH) to point at our libraries, and overrides the
interpreter to point at our ld.so.
With these files, it is possible to run a fully functional Scylla
instance on any Linux distribution. This is similar to chroot or
containers, except that we run in the same namespace as the host.
The packages are created by running
ninja build/release/scylla-package.tar
or
ninja --mode debug build/debug/scylla-package.tar
Message-Id: <20180828065352.30730-1-avi@scylladb.com>
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.
This may create weird situations like this:
<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
std::out << e << std::endl;
}
<all 10 entries are printed, including the one for "key1">
In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* seastar 12f18ce...5712816 (6):
> tests: add signal_test to test list
> Merge "Enhancements for memory_output_stream" from Paweł
> seastar-addr2line: don't print an empty line between backtrace lines
> seastar-addr2line: add --verbose option
> seastar-addr2line: make prefix matching non-greedy
> future: make available() const
When we load new SSTables, we use the directory information from the
entry descriptor to build information about those SSTables. When the
descriptor is created by flush_upload_dir, the sstable directory used in
the descriptor contains the `upload` part. Therefore, we will try to
load SSTables that are in the upload directory when we already moved
them out and fail.
Since the generation also changes, we have been historically fixing the
generation manually, but not the SSTable directory. The reason for that
is that up until recently, the SSTable directory was passed statically
to open_sstables, ignoring whatever the entry descriptor said. Now that
the sstable directory is also derived from the entry descriptor, we
should fix that too.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180829165326.12183-1-glauber@scylladb.com>
Additional tests for cases surrounding issue #3362, where base rows
disappear (or not) and view rows need to disappear (or not) as well.
These new tests focus on checking that view_updates::do_delete_old_entry()
is correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-2-nyh@scylladb.com>
In previous patches, we gave up on an old (and broken) attempt to track
the timestamps of many unselected base-table columns through one row marker
in the view table - and replaced them by "virtual cells", one per unselected
cell.
The do_delete_old_entry() function still contains old code which maintained
that row marker, and is no longer needed. That old code is no only no longer
needed, it also no longer did anything because all columns now appear in
the view (as virtual columns) so the code ignored them when calculating the
row marker.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-1-nyh@scylladb.com>
"
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness - existance or disappearance -
of a view-table row is tied to the liveness of the base table row. And
that, in turn, depends not only on selected columns (base-table columns
SELECTed to also appear in the view) but also on unselected columns.
This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch set we tried to build a single "row marker" in the view column which
tried to summarize the liveness information in all unselected columns.
But this proved unworkable, as explained in issue #3362 and as will be
demonstrated in unit tests at the end of this series.
Because we can't replace several unselected cells by one row marker, what
we do in this series is to add for each for the unselected cells a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.
Fixes#3362
"
* 'virtual-cols-v3' of https://github.com/nyh/scylla:
Materialized Views: test that virtual columns are not visible
Materialized Views: unit test reproducing fixed issue #3362
Materialized Views: no need for elaborate row marker calculations
Materialized Views: add unselected columns as virtual columns
Materialized Views: fill virtual columns
Do not allow selecting a virtual column
schema: persist "view virtual" columns to a separate system table
schema: add "view virtual" flag to schema's column_definition
Add "empty" type name to CQL parser, but only for internal parsing
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.
This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"
* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
database: make database's mutation apply stage inherit its scheduling group from the caller
database: make database::_mutation_query_stage inherit the scheduling group
database: make database::_data_query_stage inheriting its caller's scheduling_group
storage_proxy: make _mutate_stage inherit its caller's scheduling_group
"This series introduces a few improvements related to a reload flow.
From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.
It may also safely evict as many items from the cache as needed."
Fixes#3606
* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
utils::loading_cache: hold a shared_value_ptr to the value when we reload
utils::loading_cache::on_timer(): remove not needed capture of "this"
utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
"
Fix loading_cache_test flakiness by retrying assertions.
Tests: unit(loading_cache_test(debug, release))
Fixes#3723
"
* 'loading-cache-test-flake/v4' of https://github.com/duarten/scylla:
tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
tests/loading_cache_test: Use eventually() instead of open-coding it
tests/mutation_reader_test: Extract eventually_true() to eventually.hh
tests/cql_test_env: Lift eventually() to its own header file
* seastar 9bb1611...12f18ce (17):
> correctly configure I/O Scheduler for usage with the YAML file
> Added support for user-defined signal handlers
> Added reactor method to modify blocked_reactor_notify_ms
> configure.py: Use the user-specified compiler for dialect detection
> seastar-addr2line: clear current trace when omitting already seen trace
> seastar-addr2line: fix redirecting output to a file
> seastar-addr2line: don't require a space before the addresses
> tests: Ensure test thread is always joined
> README.md: Add cute badges
> iotune: adjust num-io-queues recommendation
> dns: add SRV record lookup
> reactor: define max_aio_per_queue for C++14
> reactor,alien: silence GCC warnings
> core,json,net: silence GCC warnings
> fstream: "using data_sink_impl::put" to silence gcc warning
> Merge 'Ensure Seastar compiles in C++14 mode' from Jesse
> Revert "foreign_ptr: allow waiting for the destruction of the managed ptr"
Implement and test support for reading range tombstones in SSTables 3.
Does not yet support reads which are using slicing or fast forwarding.
From github.com/scylladb/seastar-dev.git haaawk/sstables3/tombstones_v11:
Piotr Jastrzebski (5):
sstables: Add consumer_m::consume_range_tombstone
sstables: Support null columns in ck
sstables: Support reading range_tombstones
sstables: Test reading range_tombstones
sstables: Add test for RT with non-full key
Vladimir Krivopalov (2):
sstables: Add operator<< overload for bound_kind_m.
keys: Add clustering_key_prefix::make_full helper.
The `loading_cache_test::test_loading_cache_loading_reloading` test
case is flaky, and fails in both debug and release mode. In an
over-provisioned environment, it's possible that when the reactor
runs, the timers for the `sleep()` and for reloading the
`loading_cache` are both expired, and continuations are scheduled with
an arbitrary order, causing the test to fail.
Fixes#3723
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This error is transient, since as soon as the node is up we will be able
to send the migration request. Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).
The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.
Fixes#3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>
Now scylla-ami is not submodule of scylla repo, it will works as
independent repository just like scylla-jmx and scylla-tools, provides
.rpm package to install AMI scripts on AMI.
Most files are gone from dist/ami/files, but scylla_install_ami copied
from scylla-ami, since it requires to install scylla .rpms, cannot
pacakge in scylla-ami rpm.
On scylla_install_ami, we dropped ixgbevf/ena drivers code, we will
provide 'scylla-ixgbevf' and 'scylla-ena' DKMS .rpm instead.
It will automatically build kernel modules for current kernel.
A repo of the driver packages is on
https://copr.fedorainfracloud.org/coprs/scylladb/scylla-ami-drivers/
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180821201101.4631-1-syuu@scylladb.com>
"
Right now, simple_memory_input_stream takes Iterator as a template
parameter. That iterator is supposed to point to fragments in a
underlying fragmented buffer. This makes no sense, since simple streams
deal only with contiguous buffer.
This series removes any assumption that simple_memory_input_stream has
iterator_type member from Scylla so that it can be removed.
"
* tag 'prepare-simple-stream-no-iterator/v1' of https://github.com/pdziepak/scylla:
idl: deserialized_bytes_proxy do not assume presence of iterator_type
idl-compiler: specify return type of with_serialized_stream() lambdas
"
This series is a refactor of password management, motivated by a
combination of correctness bugs, improving testability, improving
clarity, and adding documentation.
Tests: unit (release)
"
* 'jhk/passwords_refactor/v2' of https://github.com/hakuch/scylla:
auth: Clean up implementation comments
auth: Remove unnecessary local variable
auth: Allow different random engines for salt
auth: Correct modulo bias in salt generation
auth: Extract random byte generation for salt
auth: Split out test for best supported scheme
auth: Rename function to use full words
auth: Add domain-specific exception for passwords
auth: Document passwords interface
auth: Move passsword stuff to its own namespace
auth: Identify password hashing errors correctly
auth: Add unit tests for password handling
auth: Move password handling to its own files
auth: Construct `std::random_device` instances once
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.
flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.
This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test
Fixes#3717
Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.
The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.
I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.
The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.
This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.
Fixes#3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
* dist/ami/files/scylla-ami c7e5a70...b7db861 (2):
> scylla-ami-setup.service: run only on first startup
> Use fstab to mount RAID volume on every reboot
Since the Linux system abort booting when it fails to mount fstab entries,
user may not able to see an error message when we use fstab to mount
/var/lib/scylla on AMI.
Instead of abort booting, we can just abort to start scylla-server.service
when RAID volume is not mounted, using RequiresMountsFor directive of systemd
unit file.
See #3640
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180824185511.17557-1-syuu@scylladb.com>
Like the two preceeding patches, convert the mutation apply stage
to an inheriting_concrete_scheduling_group. This change has two
added benefits: we get rid of a thread_local, and we drop a
with_scheduling_group() inside an execution stage which just creates a bunch
of continuations and somewhat undoes the benefit of the execution stage.
Now (8c993e0728) that replica-side operations run under the correct
scheduling group, we can inherit the scheduling_group for _data_query_stage
from the caller. By itself this doesn't do much, but it will later allow us
to have multiple groups for statement executions.
Right now, storage_proxy's mutate_stage violates isolation by running
in a plain execution_stage without a scheduling_group. This means do_mutate()
will run under the main scheduling_group, at least until we reach the database
apply execution stage, which is correct.
Fix by moving to an inheriting execution stage; this works because the
messaging service will tell RPC to set the correct execution stage for us. We
could explicitly specify statement_scheduling_group, but inheriting the
scheduling group allows us to have multiple statment scheduling groups, later.
deserialized_bytes_proxy assumes that the provided input stream has
iterator_type that represents the iterator pointing to the next
fragment of the fragmented underlying buffyer. This makes little sense
if the input stream is a contiguous one (i.e.
simple_memory_input_stream) so let's not make such assumptions.
IDL-generated code uses with_serialized_stream() to optimise for cases
when the underlying buffer is not fragmented. The provided lambda will
be called with wither simple or fragmented stream as an argument. The
consequence of this is that both instantations of generic lambda need to
return the same type. This is a problem if the type is deduced and
depends on the provided input stream (e.g. different type for fragmented
and simple streams). The solution is to explictly specify the return
type as the type returned by deserialising general utils::input_stream.
This way each instantation of lambda can return whatever it wants as
long as it is convertible to the type that the serialiser would return
if utils::input_stream was given.
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.
It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.
I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.
Fixes#3717.
Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
Being the single user of fnv1a, this allows us to get rid of it. As
the TODO inside fnv1a_hasher.hh indicates, and judging by any
independent benchmark, fnv1a is very slow. As we have added xx_hash
since then, and we know it to be fast, use it instead.
Tests: unit(release/cell_locker_test)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180823081715.26089-1-duarte@scylladb.com>