Commit e664f9b0c6 transitioned internal
CQL queries in the auth. sub-system to be executed with finite time-outs
instead of infinite ones.
It should have also modified the functions in `auth/roles-metadata.cc`
to have finite time-outs.
This change fixes some previously failing dtests, particularly around
repair. Without this change, the QUORUM query fails to terminate when
the necessary consistency level cannot be achieved.
Fixes#3736.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>
The foreground reads metric is derived from the number of live read
executors minus the number of background reads. Background reads are
counted down when their resolver times out. However, a read executor
may still be around for a while, resulting in such reads being
accounted as foreground.
Usually, the gap in which this happens is short, because executor
reference holders timeout quickly as well. It's not always the case
though. For instance, local read executor doesn't time out quickly
when the target shard has an overloaded CPU, and it takes a while
before the request goes through all the queues, even if IO is not
involved. Observed in #3628.
Fixes#3734.
Another problem is that all reads which received CL responses are
accounted as background, until all replicas respond, but if such read
needs reconciliation, it's still practically a foreground read and
should be accounted as such. Found during code review.
Fixes#3745.
This patch fixes both issues by rearranging accounting to track
foreground reads instead of background reads, and considering all
reads as foreground until the resulting promise is resolved.
Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>
"
This patchset adds proper support for sliced reads of partitions
containing range tombstones.
Given the SSTables 3.x repesentation of range tombstones by separate
start and end markers, we refer to the index for the information about
the currently opened range tombstone, if any, when skipping to the next
promoted index block.
Note that for this we have to take the promoted index block immediately
preceding the one we are jumping to.
Tests: unit {release}
"
* 'projects/sstables-30/range-tombstones-slicing/v3' of https://github.com/argenet/scylla:
tests: Test filtering and forwarding on a partition with interleaved rows and RTs.
tests: Add tests for reading wide partitions with range tombstones.
sstables: Support slicing for range tombstones.
sstables: Set/reset range tombstone start from end open marker.
sstables: Fix end_open_marker population in promoted index blocks.
sstables: Add need_skip() helper to data_consume_context.
sstables: For end_open_marker, return both position in partition and deletion time.
When we skip through a wide partition using promoted index, we may land
to a position that lies in the middle of a range tombstone so we need to
be aware of it. For this, we check if the previous promoted block has an
end open marker and either set the range tombstone start using it or
reset if missing.
Note several things about the implementation.
Firstly, we have to peek back at the previous promoted index block for the
end open marker, and so we have to always preserve one more promoted
index block when we read the next batch so that we can stil access it.
Secondly, we use the previous promoted block end position to build
position in partition for the range tombstone start.
Lastly, we don't have a notion of end open marker in older consumers
that work with SSTables of ka/la formats so we only call the
corresponding methods if the consumer supports them.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
We should not access the internal object stored in std::optional when
passing the end_open_marker, moreover that it can be disengaged.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This methods tells whether we will need to skip to reach the input
position or not.
It can be used for skipping with index when reading SSTables 3.x because
we only want to to set/reset the open range tombstone bound when we
actually move to another promoted index block.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from 127.0.4.1#0:
std::runtime_error (sorted_tokens is empty in first_token_index!)
To fix, wait for the token range setup before announcing the join
status.
Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test
Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
Change the validity timeout from 1s to 1h in order to avoid false alarms
on busy systems: for a short value there is a chance that
(loading_cache.size() == num_loaders) check is going to run after some elements
of the cache have already been evicted.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180904193026.7304-1-vladz@scylladb.com>
Prior to this fix, the end_open_marker has been only accessible as a
plain deletion_time structure. Now it also contains the start position
of a promoted index block so that it can be used for setting range
tombstone open bound.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When slice::is_satisfied_by() restriction check is performed
on raw data represented as bytes, it should always use a regular
type comparator, not a reversed one. Reversed types are used to
preserve descending clustering order, but comparison with constants
should be used with a regular underlying type comparator (for x < 1
to actually mean 'lesser than 1' instead of 'bigger than 1, because
the clustering order is reversed').
Fixes#3741
Message-Id: <3e25fc66688c9253287f2c4f31ede8339b9bbe23.1535981852.git.sarna@scylladb.com>
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.
Fixes#3738
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.
It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.
Fixes#3625 (hopefully).
Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
"
This series extends the query statefullness, introduced by f8613a841 to
point queries, to range scans as well. This means that queriers will be
saved and reused for range scans too.
This series builds heavily on the infrastructure introduced by stateful
point queries, namely the querier object and the querier_cache. It also
builds on another critical piece of infrastructure, the
multishard_combining_reader, introduced by 2d126a79b.
To make the range scan on a given node suspendable and resumable we move
away from the current code in
`storage_proxy::query_nonsingular_mutations_locally()` and use a
multishard_combining_reader to execute the read. When the page is filled
this reader is dismantled and its shard readers are saved in the
querier cache.
There are of course a lot more details to it but this is the gist of it.
Tests: unit(release, debug), dtest(paging_test.py, paging_additional_test.py)
"
* '1865/range-scans/v7.1' of https://github.com/denesb/scylla: (33 commits)
query_pagers: generate query_uuid for range-scans as well
storage_proxy: use preferred/last replicas
storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent
db::consistency_level::filter_for_query() add preferred_endpoints
storage_proxy: use query_mutations_from_all_shards() for range scans
tests: add unit test for multishard_mutation_query()
tests/mutation_assertions.hh: add missing include
multishard_mutation_query: add badness counters
database: add query_mutations_on_all_shards()
mutation_compactor: add detach_state()
flat_mutation_reader: add unpop_mutation_fragment()
Move reconcilable_result_builder declaration to mutation_query.hh
mutation_source_test: add an additional REQUIRE()
mutation: add missing assert to mutation from reader
querier: add shard_mutation_querier
querier: prepare for multi-ranges
tests/querier_cache: add tests specific for multiple entry-types
querier: split querier into separate data and mutation querier types
querier: move consume_page logic into a free function
querier: move all matching related logic into free functions
...
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves
The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.
The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
Allow the state of the compaction to be detached. The detached state is
a set of mutation fragments, which if replayed through a new compactor
object will result in the latter being in the same state as the previous
one was.
This allows for storing the compaction state in the compacted reader by
using `unpop_mutation_fragment()` to push back the fragments that
comprise the detached state into the reader. This way, if a new
compaction object is created it can just consume the reader and continue
where the previous compaction left off.
This is the inverse of `pop_mutation_fragment()`. Allow fragments to be
pushed back into the buffer of the reader to undo a previous consumtion
of the fragments.
test_streamed_mutation_forwarding_is_consistent_with_slicing already has
a REQUIRE() for the mutation read with the slicing reader. Add another
one for the forwarding reader. This makes it more consistent and also
helps finding problems with either the forwarding or slicing reader.
read_mutation_from_flat_mutation_reader's internal adapter can build a
single mutation only and hence can consume only a single partition.
If more than one partitions are pushed down from the producer the
adaptor will very likely crash. To avoid unnecessary investigations add
an assert() to fail early and make it clear what the real problem is.
All other consume_ methods have an assert() already for their
invariants so this is just following suit.
The querier to be used for saving shard readers belonging to a
multishard range scan. This querier doesn't provide a `consume_page`
method as it doesn't support reading from it directly. It is more
of a storage to allow caching the reader and any objects it depends on.
In the next patch a querier will be added that reads multiple ranges as
opposed to a single range that data and mutation queriers read.
To keep `querier_cache` code seamless regarding this difference change all
range-matching logic to work in terms of `dht::partition_ranges_view`.
This allows for cheap and seamless way of having a single code-base for
the insert/lookup logic. Code actually matching ranges is updated to be
able to handle both singular and multi-ranges while maintaining backward
compatibility.
Instead of hiding what compaction method the querier uses (and only
expose it via rejecting 'can_be_used_for_page()`) make it very explicit
that these are really two different queriers. This allows using
different indexes for the two queriers in `querier_cache` and
eliminating the possibility of picking up a querier with the wrong
compaction method (read kind).
This also makes it possible to add new querier type(s) that suit the
multishard-query's needs without making a confusing mess of `querier` by
making it a union of all querying logic.
Splitting the queriers this way changes what happens when a lookup finds
a querier of the wrong kind (e.g. emit_only_live::yes for an
emit_only_live::no command). As opposed to dropping the found (but
wrong) querier the querier will now simply not be found by the lookup.
This is a result of using separate search indexes for the different
mutation kinds. This change should have no practical implications.
Splitting is done by making querier templated on `emit_only_live_rows`.
It doesn't make sense to duplicate the entire querier as the two share
99% of the code.
In preparation of the now single querier being split into multiple more
specialized ones. Make it possible for the multiple queriers sharing the
same implementation. Also, the code can now be reused by outside code as
well, not just queriers.
So that they can be used for multiple querier classes easily, without
inheritance. The functions are not visible from the header.
Also update the comments on `querier` to w.r.t. the disappeared
checking functions. Change the language to be more general. In practice
these checks are never done by client code, instead they are done by the
`querier_cache`.
In preparations for introducing support multiple entry types in the
querier_cache move all insert/lookup related logic into free functions.
Later these functions will be templated so they can handle multiple
entry types with the same code.
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.
The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.