Commit Graph

16435 Commits

Author SHA1 Message Date
Botond Dénes
6486d6c8bd storage_proxy: use preferred/last replicas 2018-09-03 10:31:44 +03:00
Botond Dénes
577a06ce1b storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent 2018-09-03 10:31:44 +03:00
Botond Dénes
6e59cee244 db::consistency_level::filter_for_query() add preferred_endpoints
To the second overload (the one without read-repair related params) too.
2018-09-03 10:31:44 +03:00
Botond Dénes
2f66bde26f storage_proxy: use query_mutations_from_all_shards() for range scans 2018-09-03 10:31:44 +03:00
Botond Dénes
6779b63dfe tests: add unit test for multishard_mutation_query() 2018-09-03 10:31:44 +03:00
Botond Dénes
c678b665b4 tests/mutation_assertions.hh: add missing include 2018-09-03 10:31:44 +03:00
Botond Dénes
253407bdc8 multishard_mutation_query: add badness counters
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves

The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.

The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
2018-09-03 10:31:44 +03:00
Botond Dénes
97364c7ad9 database: add query_mutations_on_all_shards()
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
2018-09-03 10:31:44 +03:00
Botond Dénes
33d72efa49 mutation_compactor: add detach_state()
Allow the state of the compaction to be detached. The detached state is
a set of mutation fragments, which if replayed through a new compactor
object will result in the latter being in the same state as the previous
one was.
This allows for storing the compaction state in the compacted reader by
using `unpop_mutation_fragment()` to push back the fragments that
comprise the detached state into the reader. This way, if a new
compaction object is created it can just consume the reader and continue
where the previous compaction left off.
2018-09-03 10:31:44 +03:00
Botond Dénes
48054ed810 flat_mutation_reader: add unpop_mutation_fragment()
This is the inverse of `pop_mutation_fragment()`. Allow fragments to be
pushed back into the buffer of the reader to undo a previous consumtion
of the fragments.
2018-09-03 10:31:44 +03:00
Botond Dénes
3bcd577907 Move reconcilable_result_builder declaration to mutation_query.hh
It will be used by code outside of mutation_partition.cc so it needs to
be public. The definition remains in mutation_partition.cc.
2018-09-03 10:31:44 +03:00
Botond Dénes
b8b34223a4 mutation_source_test: add an additional REQUIRE()
test_streamed_mutation_forwarding_is_consistent_with_slicing already has
a REQUIRE() for the mutation read with the slicing reader. Add another
one for the forwarding reader. This makes it more consistent and also
helps finding problems with either the forwarding or slicing reader.
2018-09-03 10:31:44 +03:00
Botond Dénes
d347866664 mutation: add missing assert to mutation from reader
read_mutation_from_flat_mutation_reader's internal adapter can build a
single mutation only and hence can consume only a single partition.
If more than one partitions are pushed down from the producer the
adaptor will very likely crash. To avoid unnecessary investigations add
an assert() to fail early and make it clear what the real problem is.
All other consume_ methods have an assert() already for their
invariants so this is just following suit.
2018-09-03 10:31:44 +03:00
Botond Dénes
ecb1e79bcc querier: add shard_mutation_querier
The querier to be used for saving shard readers belonging to a
multishard range scan. This querier doesn't provide a `consume_page`
method as it doesn't support reading from it directly. It is more
of a storage to allow caching the reader and any objects it depends on.
2018-09-03 10:31:44 +03:00
Botond Dénes
07cdf766c5 querier: prepare for multi-ranges
In the next patch a querier will be added that reads multiple ranges as
opposed to a single range that data and mutation queriers read.
To keep `querier_cache` code seamless regarding this difference change all
range-matching logic to work in terms of `dht::partition_ranges_view`.
This allows for cheap and seamless way of having a single code-base for
the insert/lookup logic. Code actually matching ranges is updated to be
able to handle both singular and multi-ranges while maintaining backward
compatibility.
2018-09-03 10:31:44 +03:00
Botond Dénes
88a7effd8d tests/querier_cache: add tests specific for multiple entry-types 2018-09-03 10:31:44 +03:00
Botond Dénes
c12008b8cb querier: split querier into separate data and mutation querier types
Instead of hiding what compaction method the querier uses (and only
expose it via rejecting 'can_be_used_for_page()`) make it very explicit
that these are really two different queriers. This allows using
different indexes for the two queriers in `querier_cache` and
eliminating the possibility of picking up a querier with the wrong
compaction method (read kind).
This also makes it possible to add new querier type(s) that suit the
multishard-query's needs without making a confusing mess of `querier` by
making it a union of all querying logic.

Splitting the queriers this way changes what happens when a lookup finds
a querier of the wrong kind (e.g. emit_only_live::yes for an
emit_only_live::no command). As opposed to dropping the found (but
wrong) querier the querier will now simply not be found by the lookup.
This is a result of using separate search indexes for the different
mutation kinds. This change should have no practical implications.

Splitting is done by making querier templated on `emit_only_live_rows`.
It doesn't make sense to duplicate the entire querier as the two share
99% of the code.
2018-09-03 10:31:44 +03:00
Botond Dénes
e46251ebf6 querier: move consume_page logic into a free function
In preparation of the now single querier being split into multiple more
specialized ones. Make it possible for the multiple queriers sharing the
same implementation. Also, the code can now be reused by outside code as
well, not just queriers.
2018-09-03 10:31:44 +03:00
Botond Dénes
c53f17ddb8 querier: move all matching related logic into free functions
So that they can be used for multiple querier classes easily, without
inheritance. The functions are not visible from the header.
Also update the comments on `querier` to w.r.t. the disappeared
checking functions. Change the language to be more general. In practice
these checks are never done by client code, instead they are done by the
`querier_cache`.
2018-09-03 10:31:44 +03:00
Botond Dénes
43f464c52d querier: inline querier::current_position() and make it public 2018-09-03 10:31:44 +03:00
Botond Dénes
86a61ded7d querier: s/position/position_view/
Also treat it as a view, that is take it by value in functions,
instead of reference.
2018-09-03 10:31:44 +03:00
Botond Dénes
6e4ec53679 querier: move position outside of querier
In preparation for having multiple querier types that can share code
without inheritance.
2018-09-03 10:31:44 +03:00
Botond Dénes
a172dfec4e querier: move clustering_position_tracker outside of querier
In preparation for having multiple querier types that can share code
without inheritance.
2018-09-03 10:31:44 +03:00
Botond Dénes
7bd955e993 querier_cache: move insert/lookup related logic into free functions
In preparations for introducing support multiple entry types in the
querier_cache move all insert/lookup related logic into free functions.
Later these functions will be templated so they can handle multiple
entry types with the same code.
2018-09-03 10:31:44 +03:00
Botond Dénes
cded477b94 querier: return std::optional<querier> instead of using create_fun()
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
2018-09-03 10:31:44 +03:00
Botond Dénes
5f726e9a89 querier: move all to query namespace
To avoid name clashes.
2018-09-03 10:31:44 +03:00
Botond Dénes
867f69b9d1 dht::i_partitioner: add partition_ranges_view 2018-09-03 10:31:44 +03:00
Botond Dénes
a011a9ebf2 mutation_reader: multishard_combining_reader: support custom dismantler
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.

The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
2018-09-03 10:31:44 +03:00
Botond Dénes
f13b878a94 mutation_reader: pass all standard reader params to remote_reader_factory
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.
2018-09-03 10:31:44 +03:00
Botond Dénes
e67c6d9f39 flat_mutation_reader::impl: add protected buffer() member
To allow implementations to access the buffer in a read-only way.
2018-09-03 10:31:44 +03:00
Botond Dénes
8915293257 multishard_combining_reader: fix incorrect comment 2018-09-03 10:31:44 +03:00
Botond Dénes
75d60b0627 docs: add paged-queries.md design doc 2018-09-03 10:31:44 +03:00
Duarte Nunes
6593226849 Merge branch 'loading_cache: fix a consistency of size() and iterators APIs' from Vlad
"
After we fixed reloading flow it enabled situations when items are no longer cached but
still held in the underlying loading_shared_values object. Since loading_cache::size() returns
the size of its loading_shared_values object and loading_cache::begin()/end()/find() are returning
iterators based on loading_shared_values iterators these APIs may return very weird values, e.g.
size() may return the same value after one of the items have been removed using remove(key) API.

This series fixes this by switching mentioned above APIs to work on top of lru_list object instead
of loading_shared_values.
"

* 'loading_cache_fix_api_semantics-v1' of https://github.com/vladzcloudius/scylla:
  loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
  loading_cache: make size() return the size of lru_list instead of loading_shared_values
2018-09-01 11:05:28 +01:00
Avi Kivity
fd8eae50db build: add relocatable package target
A relocatable package contains the Scylla (and iotune)
executables (in a bin/ directory), any libraries they may need (lib/)
the configuration file defaults (conf/) and supporting scripts (dist/).
The libraries are picked up from the host; including libc and the dynamic
linker (ld.so).

We also provide a thunk script that forces the library path
(LD_LIBRARY_PATH) to point at our libraries, and overrides the
interpreter to point at our ld.so.

With these files, it is possible to run a fully functional Scylla
instance on any Linux distribution. This is similar to chroot or
containers, except that we run in the same namespace as the host.

The packages are created by running

    ninja build/release/scylla-package.tar

or

    ninja --mode debug build/debug/scylla-package.tar
Message-Id: <20180828065352.30730-1-avi@scylladb.com>
2018-08-31 23:14:42 +01:00
Vlad Zolotarov
945d26e4ee loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.

This may create weird situations like this:

<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
    std::out << e << std::endl;
}

<all 10 entries are printed, including the one for "key1">

In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 20:56:44 -04:00
Vlad Zolotarov
1e56c7dd58 loading_cache: make size() return the size of lru_list instead of loading_shared_values
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 15:55:30 -04:00
Paweł Dziepak
dbbd664600 Update seastar submodule
* seastar 12f18ce...5712816 (6):
  > tests: add signal_test to test list
  > Merge "Enhancements for memory_output_stream" from Paweł
  > seastar-addr2line: don't print an empty line between backtrace lines
  > seastar-addr2line: add --verbose option
  > seastar-addr2line: make prefix matching non-greedy
  > future: make available() const
2018-08-30 11:41:27 +01:00
Glauber Costa
8dea1b3c61 database: fix directory for information when loading new SSTables from upload dir
When we load new SSTables, we use the directory information from the
entry descriptor to build information about those SSTables. When the
descriptor is created by flush_upload_dir, the sstable directory used in
the descriptor contains the `upload` part. Therefore, we will try to
load SSTables that are in the upload directory when we already moved
them out and fail.

Since the generation also changes, we have been historically fixing the
generation manually, but not the SSTable directory. The reason for that
is that up until recently, the SSTable directory was passed statically
to open_sstables, ignoring whatever the entry descriptor said. Now that
the sstable directory is also derived from the entry descriptor, we
should fix that too.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180829165326.12183-1-glauber@scylladb.com>
2018-08-30 10:34:25 +03:00
Nadav Har'El
2f02d006b3 materialized views: more tests
Additional tests for cases surrounding issue #3362, where base rows
disappear (or not) and view rows need to disappear (or not) as well.
These new tests focus on checking that view_updates::do_delete_old_entry()
is correct.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-2-nyh@scylladb.com>
2018-08-29 14:33:48 +01:00
Nadav Har'El
16a6f76873 materialized views: simplify do_delete_old_entry()
In previous patches, we gave up on an old (and broken) attempt to track
the timestamps of many unselected base-table columns through one row marker
in the view table - and replaced them by "virtual cells", one per unselected
cell.

The do_delete_old_entry() function still contains old code which maintained
that row marker, and is no longer needed. That old code is no only no longer
needed, it also no longer did anything because all columns now appear in
the view (as virtual columns) so the code ignored them when calculating the
row marker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-1-nyh@scylladb.com>
2018-08-29 14:33:41 +01:00
Duarte Nunes
79d796e710 Merge 'Materialized Views: row liveness correction' from Nadav
"
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness - existance or disappearance -
of a view-table row is tied to the liveness of the base table row. And
that, in turn, depends not only on selected columns (base-table columns
SELECTed to also appear in the view) but also on unselected columns.

This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch set we tried to build a single "row marker" in the view column which
tried to summarize the liveness information in all unselected columns.
But this proved unworkable, as explained in issue #3362 and as will be
demonstrated in unit tests at the end of this series.

Because we can't replace several unselected cells by one row marker, what
we do in this series is to add for each for the unselected cells a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.

Fixes #3362
"

* 'virtual-cols-v3' of https://github.com/nyh/scylla:
  Materialized Views: test that virtual columns are not visible
  Materialized Views: unit test reproducing fixed issue #3362
  Materialized Views: no need for elaborate row marker calculations
  Materialized Views: add unselected columns as virtual columns
  Materialized Views: fill virtual columns
  Do not allow selecting a virtual column
  schema: persist "view virtual" columns to a separate system table
  schema: add "view virtual" flag to schema's column_definition
  Add "empty" type name to CQL parser, but only for internal parsing
2018-08-29 14:32:38 +01:00
Paweł Dziepak
6f1c3e6945 Merge "Convert more execution_stages to inherit scheduling_groups" from Avi
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.

This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"

* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
  database: make database's mutation apply stage inherit its scheduling group from the caller
  database: make database::_mutation_query_stage inherit the scheduling group
  database: make database::_data_query_stage inheriting its caller's scheduling_group
  storage_proxy: make _mutate_stage inherit its caller's scheduling_group
2018-08-28 13:49:31 +01:00
Duarte Nunes
f6aadd8077 Merge 'utils::loading_cache: improve reload() robustness' from Vlad
"This series introduces a few improvements related to a reload flow.

From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.

It may also safely evict as many items from the cache as needed."

Fixes #3606

* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: hold a shared_value_ptr to the value when we reload
  utils::loading_cache::on_timer(): remove not needed capture of "this"
  utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
2018-08-28 10:52:20 +01:00
Piotr Sarna
aa2bfc0a71 tests: add multi-column pk test to INSERT JSON case
Refs #3687
Message-Id: <6ba1328549ed701691ca7cbdacc7d6fa72f2c3de.1534171422.git.sarna@scylladb.com>
2018-08-28 11:34:13 +03:00
Piotr Sarna
fa72422baa cql3: fix handling multi-column partition key in INSERT JSON
Multiple column partition keys were previously handled incorrectly,
now the implementation is based on from_exploded instead of
from_singular.

Fixes #3687
Message-Id: <09e0bdb0f1c18d49b9e67c21777d93ba1545a13c.1534171422.git.sarna@scylladb.com>
2018-08-28 11:34:11 +03:00
Avi Kivity
1fd9974b6b Merge "tests/loading_cache_test: Fix flakiness" from Duarte
"
Fix loading_cache_test flakiness by retrying assertions.

Tests: unit(loading_cache_test(debug, release))

Fixes #3723
"

* 'loading-cache-test-flake/v4' of https://github.com/duarten/scylla:
  tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
  tests/loading_cache_test: Use eventually() instead of open-coding it
  tests/mutation_reader_test: Extract eventually_true() to eventually.hh
  tests/cql_test_env: Lift eventually() to its own header file
2018-08-28 09:35:09 +03:00
Takuya ASADA
4a5157857a dist/debian: support package renaming on build script
To automatically rename packages on enterprise release, added package name
prefix as a variable on build_deb.sh.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180828010445.11920-1-syuu@scylladb.com>
2018-08-28 09:25:07 +03:00
Avi Kivity
22396d57c2 Update seastar submodule
* seastar 9bb1611...12f18ce (17):
  > correctly configure I/O Scheduler for usage with the YAML file
  > Added support for user-defined signal handlers
  > Added reactor method to modify blocked_reactor_notify_ms
  > configure.py: Use the user-specified compiler for dialect detection
  > seastar-addr2line: clear current trace when omitting already seen trace
  > seastar-addr2line: fix redirecting output to a file
  > seastar-addr2line: don't require a space before the addresses
  > tests: Ensure test thread is always joined
  > README.md: Add cute badges
  > iotune: adjust num-io-queues recommendation
  > dns: add SRV record lookup
  > reactor: define max_aio_per_queue for C++14
  > reactor,alien: silence GCC warnings
  > core,json,net: silence GCC warnings
  > fstream: "using data_sink_impl::put" to silence gcc warning
  > Merge 'Ensure Seastar compiles in C++14 mode' from Jesse
  > Revert "foreign_ptr: allow waiting for the destruction of the managed ptr"
2018-08-28 09:10:14 +03:00
Tomasz Grabiec
75cde85349 Merge "Support reading range tombstones" from Piotr and Vladimir
Implement and test support for reading range tombstones in SSTables 3.

Does not yet support reads which are using slicing or fast forwarding.

From github.com/scylladb/seastar-dev.git haaawk/sstables3/tombstones_v11:

Piotr Jastrzebski (5):
  sstables: Add consumer_m::consume_range_tombstone
  sstables: Support null columns in ck
  sstables: Support reading range_tombstones
  sstables: Test reading range_tombstones
  sstables: Add test for RT with non-full key

Vladimir Krivopalov (2):
  sstables: Add operator<< overload for bound_kind_m.
  keys: Add clustering_key_prefix::make_full helper.
2018-08-27 20:43:38 +02:00
Duarte Nunes
40044c0460 tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
The `loading_cache_test::test_loading_cache_loading_reloading` test
case is flaky, and fails in both debug and release mode. In an
over-provisioned environment, it's possible that when the reactor
runs, the timers for the `sleep()` and for reloading the
`loading_cache` are both expired, and continuations are scheduled with
an arbitrary order, causing the test to fail.

Fixes #3723

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:05 +01:00