When we skip through a wide partition using promoted index, we may land
to a position that lies in the middle of a range tombstone so we need to
be aware of it. For this, we check if the previous promoted block has an
end open marker and either set the range tombstone start using it or
reset if missing.
Note several things about the implementation.
Firstly, we have to peek back at the previous promoted index block for the
end open marker, and so we have to always preserve one more promoted
index block when we read the next batch so that we can stil access it.
Secondly, we use the previous promoted block end position to build
position in partition for the range tombstone start.
Lastly, we don't have a notion of end open marker in older consumers
that work with SSTables of ka/la formats so we only call the
corresponding methods if the consumer supports them.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
We should not access the internal object stored in std::optional when
passing the end_open_marker, moreover that it can be disengaged.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This methods tells whether we will need to skip to reach the input
position or not.
It can be used for skipping with index when reading SSTables 3.x because
we only want to to set/reset the open range tombstone bound when we
actually move to another promoted index block.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Prior to this fix, the end_open_marker has been only accessible as a
plain deletion_time structure. Now it also contains the start position
of a promoted index block so that it can be used for setting range
tombstone open bound.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.
Fixes#3738
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.
It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.
Fixes#3625 (hopefully).
Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
"
This series extends the query statefullness, introduced by f8613a841 to
point queries, to range scans as well. This means that queriers will be
saved and reused for range scans too.
This series builds heavily on the infrastructure introduced by stateful
point queries, namely the querier object and the querier_cache. It also
builds on another critical piece of infrastructure, the
multishard_combining_reader, introduced by 2d126a79b.
To make the range scan on a given node suspendable and resumable we move
away from the current code in
`storage_proxy::query_nonsingular_mutations_locally()` and use a
multishard_combining_reader to execute the read. When the page is filled
this reader is dismantled and its shard readers are saved in the
querier cache.
There are of course a lot more details to it but this is the gist of it.
Tests: unit(release, debug), dtest(paging_test.py, paging_additional_test.py)
"
* '1865/range-scans/v7.1' of https://github.com/denesb/scylla: (33 commits)
query_pagers: generate query_uuid for range-scans as well
storage_proxy: use preferred/last replicas
storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent
db::consistency_level::filter_for_query() add preferred_endpoints
storage_proxy: use query_mutations_from_all_shards() for range scans
tests: add unit test for multishard_mutation_query()
tests/mutation_assertions.hh: add missing include
multishard_mutation_query: add badness counters
database: add query_mutations_on_all_shards()
mutation_compactor: add detach_state()
flat_mutation_reader: add unpop_mutation_fragment()
Move reconcilable_result_builder declaration to mutation_query.hh
mutation_source_test: add an additional REQUIRE()
mutation: add missing assert to mutation from reader
querier: add shard_mutation_querier
querier: prepare for multi-ranges
tests/querier_cache: add tests specific for multiple entry-types
querier: split querier into separate data and mutation querier types
querier: move consume_page logic into a free function
querier: move all matching related logic into free functions
...
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves
The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.
The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
Allow the state of the compaction to be detached. The detached state is
a set of mutation fragments, which if replayed through a new compactor
object will result in the latter being in the same state as the previous
one was.
This allows for storing the compaction state in the compacted reader by
using `unpop_mutation_fragment()` to push back the fragments that
comprise the detached state into the reader. This way, if a new
compaction object is created it can just consume the reader and continue
where the previous compaction left off.
This is the inverse of `pop_mutation_fragment()`. Allow fragments to be
pushed back into the buffer of the reader to undo a previous consumtion
of the fragments.
test_streamed_mutation_forwarding_is_consistent_with_slicing already has
a REQUIRE() for the mutation read with the slicing reader. Add another
one for the forwarding reader. This makes it more consistent and also
helps finding problems with either the forwarding or slicing reader.
read_mutation_from_flat_mutation_reader's internal adapter can build a
single mutation only and hence can consume only a single partition.
If more than one partitions are pushed down from the producer the
adaptor will very likely crash. To avoid unnecessary investigations add
an assert() to fail early and make it clear what the real problem is.
All other consume_ methods have an assert() already for their
invariants so this is just following suit.
The querier to be used for saving shard readers belonging to a
multishard range scan. This querier doesn't provide a `consume_page`
method as it doesn't support reading from it directly. It is more
of a storage to allow caching the reader and any objects it depends on.
In the next patch a querier will be added that reads multiple ranges as
opposed to a single range that data and mutation queriers read.
To keep `querier_cache` code seamless regarding this difference change all
range-matching logic to work in terms of `dht::partition_ranges_view`.
This allows for cheap and seamless way of having a single code-base for
the insert/lookup logic. Code actually matching ranges is updated to be
able to handle both singular and multi-ranges while maintaining backward
compatibility.
Instead of hiding what compaction method the querier uses (and only
expose it via rejecting 'can_be_used_for_page()`) make it very explicit
that these are really two different queriers. This allows using
different indexes for the two queriers in `querier_cache` and
eliminating the possibility of picking up a querier with the wrong
compaction method (read kind).
This also makes it possible to add new querier type(s) that suit the
multishard-query's needs without making a confusing mess of `querier` by
making it a union of all querying logic.
Splitting the queriers this way changes what happens when a lookup finds
a querier of the wrong kind (e.g. emit_only_live::yes for an
emit_only_live::no command). As opposed to dropping the found (but
wrong) querier the querier will now simply not be found by the lookup.
This is a result of using separate search indexes for the different
mutation kinds. This change should have no practical implications.
Splitting is done by making querier templated on `emit_only_live_rows`.
It doesn't make sense to duplicate the entire querier as the two share
99% of the code.
In preparation of the now single querier being split into multiple more
specialized ones. Make it possible for the multiple queriers sharing the
same implementation. Also, the code can now be reused by outside code as
well, not just queriers.
So that they can be used for multiple querier classes easily, without
inheritance. The functions are not visible from the header.
Also update the comments on `querier` to w.r.t. the disappeared
checking functions. Change the language to be more general. In practice
these checks are never done by client code, instead they are done by the
`querier_cache`.
In preparations for introducing support multiple entry types in the
querier_cache move all insert/lookup related logic into free functions.
Later these functions will be templated so they can handle multiple
entry types with the same code.
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.
The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.
"
After we fixed reloading flow it enabled situations when items are no longer cached but
still held in the underlying loading_shared_values object. Since loading_cache::size() returns
the size of its loading_shared_values object and loading_cache::begin()/end()/find() are returning
iterators based on loading_shared_values iterators these APIs may return very weird values, e.g.
size() may return the same value after one of the items have been removed using remove(key) API.
This series fixes this by switching mentioned above APIs to work on top of lru_list object instead
of loading_shared_values.
"
* 'loading_cache_fix_api_semantics-v1' of https://github.com/vladzcloudius/scylla:
loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
loading_cache: make size() return the size of lru_list instead of loading_shared_values
A relocatable package contains the Scylla (and iotune)
executables (in a bin/ directory), any libraries they may need (lib/)
the configuration file defaults (conf/) and supporting scripts (dist/).
The libraries are picked up from the host; including libc and the dynamic
linker (ld.so).
We also provide a thunk script that forces the library path
(LD_LIBRARY_PATH) to point at our libraries, and overrides the
interpreter to point at our ld.so.
With these files, it is possible to run a fully functional Scylla
instance on any Linux distribution. This is similar to chroot or
containers, except that we run in the same namespace as the host.
The packages are created by running
ninja build/release/scylla-package.tar
or
ninja --mode debug build/debug/scylla-package.tar
Message-Id: <20180828065352.30730-1-avi@scylladb.com>
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.
This may create weird situations like this:
<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
std::out << e << std::endl;
}
<all 10 entries are printed, including the one for "key1">
In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* seastar 12f18ce...5712816 (6):
> tests: add signal_test to test list
> Merge "Enhancements for memory_output_stream" from Paweł
> seastar-addr2line: don't print an empty line between backtrace lines
> seastar-addr2line: add --verbose option
> seastar-addr2line: make prefix matching non-greedy
> future: make available() const
When we load new SSTables, we use the directory information from the
entry descriptor to build information about those SSTables. When the
descriptor is created by flush_upload_dir, the sstable directory used in
the descriptor contains the `upload` part. Therefore, we will try to
load SSTables that are in the upload directory when we already moved
them out and fail.
Since the generation also changes, we have been historically fixing the
generation manually, but not the SSTable directory. The reason for that
is that up until recently, the SSTable directory was passed statically
to open_sstables, ignoring whatever the entry descriptor said. Now that
the sstable directory is also derived from the entry descriptor, we
should fix that too.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180829165326.12183-1-glauber@scylladb.com>