Commit Graph

308 Commits

Author SHA1 Message Date
Benny Halevy
5f513ed28b view_builder: consumer: flush_fragments: close reader on error
Make sure to close the reader created by flush_fragments
if an exception occurs before it's moved to `populate_views`.

Note that it is also ok to close the reader _after_ it has been
moved, in case populate_views itself throws after closing the
reader that was moved it.  For conveience flat_mutation_reader::close
supports close-after-move.

Fixes #9479

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211024164138.1100304-1-bhalevy@scylladb.com>
2021-10-24 19:53:31 +03:00
Benny Halevy
cddd16f22d db: view: use effective_replication_map to get_natural_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:55:50 +03:00
Pavel Emelyanov
c504361c15 view_builder: Accept view_build_statuses
The code itself is already in relevant .cc file, not move it to the
relevant class.

The only significant change is where to get token metadata from.
In its old location tokens were provided by the storage service
itself, now when it's in the view builder there's no "native" place
to get them from, however the rest of the view building code gets
tokens from global storage proxy, so do the same here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:11:40 +03:00
Pavel Emelyanov
3b6e8c7d93 storage_service: Move view_build_statuses code
This code belongs to view builder, so put it into its .cc. No changes,
just move. This needs some ugly namespace breakage, but they will
be patched away with the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:11:29 +03:00
Avi Kivity
369afe3124 treewide: use coroutine::maybe_yield() instead of co_await make_ready_future()
The dedicated API shows the intent, and may be a tiny bit faster.

Closes #9382
2021-09-23 12:28:56 +02:00
Pavel Emelyanov
598841a5dd code: Expell gossiper.hh from other headers
This needs to add forward declarations of the gossiper class and
re-include some other headers here and there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
0de69136d4 view_update_generator: Register staging sstables in constructor
First, it's to fix the discarded future during the register. The
future is not actually such, as it's always the no-op ready one as
at that stage the view_update_generator is neither aborted nor is
in throttling state.

Second, this change is to keep database start-up code in main
shorter and cleaner. Registering staging sstables belongs to the
view_update_generator start code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Avi Kivity
115d6d8d4c system_keyspace: prepare forward-declared members
In anticipation of making system_keyspace a class instead of a
namespace, rename any member that is currently forward-declared,
since one can't forward-declare a class member. Each member
is taken out of the system_keyspace namespace and gains a
system_keyspace prefix. Aliases are added to reduce code churn.

The result isn't lovely, but can be adjusted later.
2021-09-13 15:11:26 +03:00
Avi Kivity
c1028de22a Merge 'Introduce native reversed format' from Botond Dénes
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.

This series is the first step towards implementing efficient reverse
reads. It allows us to remove all the special casing we have in various
places for reverse reads and thus treating reverse streams transparently
in all the middle layers. The only layers that have to know about the
actual reversing are mutation sources proper. The plan is that when
reading in reverse we create a reversed schema in the top layer then
pass this down as the schema for the read. There are two layers that
will need to act on this reversed schema:
* The layer sitting on top of the first layer which still can't handle
  reversed streams, this layer will create a reversed reader to handle
  the transition.
* The mutation source proper: which will obtain the underlying schema
  and will emit the data in reverse order.

Once all the mutation sources are able to handle reverse reads, we can
get rid of the reverse reader entirely.

Refs: #1413

Tests: unit(dev)

TODO:
* v2
* more testing

Also on: https://github.com/denesb/scylla.git reverse-reads/v3

Changelog

v3:
* Drop the entire schema transformation mechanism;
* Drop reversing from `schema_builder()`;
* Don't keep any information about whether the schema is reversed or not
  in the schema itself, instead make reversing deterministic w.r.t.
  schema version, such that:
  `s.version() == s.make_reversed().make_reversed().version()`;
* Re-reverse range tombstones in `streaming_mutation_freezer`, so
  `reconcilable_results` sent to the coordinator during read repair
  still use the old reverse format;

v2:
* Add `data_type reversed(data_type)`;
* Add `bound_kind reverse_kind(bound_kind)`;
* Make new API safer to use:
    - `schema::underlying_type()`: return this when unengaged;
    - `schema::make_transformed()`: noop when applying the same
      transformation again;
* Generalize reversed into transformation. Add support to transferring
  to remote nodes and shards by way of making `schema_tables` aware of
  the transformation;
* Use reverse schema everywhere in reverse reader;

Closes #9184

* github.com:scylladb/scylla:
  range_tombstone_accumulator: drop _reversed flag
  test/boost/mutation_test: add test for mutation::consume() monotonicity
  test/boost/flat_mutation_reader_test: more reversed reader tests
  flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range)
  flat_mutation_reader: make_reversing_reader(): take ownership of the reader
  test/lib/mutation_source_test: add consistent log to all methods
  mutation: introduce reverse()
  mutation_rebuilder: make it standalone
  mutation: make copy constructor compatible with mutation_opt
  treewide: switch to native reversed format for reverse reads
  mutation: consume(): add native reverse order
  mutation: consume(): don't include dummy rows
  query: add slice reversing functions
  partition_slice_builder: add range mutating methods
  partition_slice_builder: add constructor with slice
  query: specific_ranges: add non-const ranges accessor
  range_tombstone: add reverse()
  clustering_bounds_comparator: add reverse_kind()
  schema: introduce make_reversed()
  schema: add a transforming copy constructor
  utils: UUID_gen: introduce negate()
  types: add reversed(data_type)
  docs: design-notes: add reverse-reads.md
2021-09-09 15:50:22 +03:00
Botond Dénes
f02632aeb0 range_tombstone_accumulator: drop _reversed flag 2021-09-09 15:42:15 +03:00
Piotr Sarna
5d7c765422 db,view: split stopping view builder to drain+stop
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
 - view builder creates a reader permit in order to be able
   to read from system tables
 - CQL server fails to start, shutdown procedure begins
 - view builder stop() is not called (because it was not scheduled
   yet), so it holds onto its reader permit
 - database shutdown procedure waits for all permits to be destroyed,
   and it hangs indefinitely because view builder keeps holding
   its permit.
2021-09-08 10:52:40 +02:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
fe479aca1d reader_permit: add timeout member
To replace the timeout parameter passed
to flat_mutation_reader methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Benny Halevy
e9aff2426e everywhere: make deferred actions noexcept
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).

Test: unit(dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:52 +03:00
Benny Halevy
4439e5c132 everywhere: cleanup defer.hh includes
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:39 +03:00
Piotr Sarna
58196e8ea6 db,view: avoid ignoring failed future in background view updates
The code for handling background view updates used to propagate
exceptions unconditionally, which leads to "exceptional future
ignored" warnings if the update was put to background.
From now on, the exception is only propagated if its future
is actually waited on.

Fixes #6187

Tested manually, the warning was not observed after the patch

Closes #9179
2021-08-12 17:32:35 +03:00
Pavel Emelyanov
689a4c1e54 view: Use proxy to get token metadata from
The mutate_MV() call needs token metadata and it gets them from
global storage service. Fixing it not to use globals is a huge
refactoring, so for now just get the tokens from global storage
proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-29 05:12:36 +03:00
Nadav Har'El
5183e0cbe9 Merge 'Fix artificial view update size limit' from Piotr Sarna
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.

The series comes with a cql-pytest which failed before the fix and succeeds now. This bug is also  covered by `wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view` dtest, but it needs over a minute to run, as opposed to cql-pytest's <1 second.

Fixes #9047

Tests: unit(release), dtest(wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view)

Closes #9048

* github.com:scylladb/scylla:
  cql-pytest: add a materialized views suite with first cases
  db,view: drop the artificial limit on view update mutation size
2021-07-15 17:03:07 +03:00
Piotr Sarna
697e2fc66d db,view: drop the artificial limit on view update mutation size
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.

Tests: unit(release), dtest(wide_rows_test)
2021-07-15 14:09:37 +02:00
Botond Dénes
1b7eea0f52 reader_concurrency_semaphore: admission: flip the switch
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.

(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.

(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).

We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.

Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.
2021-07-14 17:19:02 +03:00
Botond Dénes
7bfa40a2f1 treewide: use make_tracking_only_permit()
For all those reads that don't (won't or can't) pass through admission
currently.
2021-07-14 17:19:02 +03:00
Botond Dénes
f28b5018f2 view/view_update_generator: use obtain_reader_permit() 2021-07-14 16:48:43 +03:00
Piotr Sarna
a1813c9b34 db,view,table: drop unneeded time point parameter
Now that restriction checking is translated to the partition-slice-style
interface, checking the partition/clustering key restrictions for views
can be performed without the time point parameter.
The parameter is dropped from all relevant call sites.
2021-07-13 10:40:08 +02:00
Piotr Sarna
37fc3f4b5b db,view: migrate key checks from the deprecated is_satisfied_by
Last two users of the mutation-based is_satisfied_by function
were in the partition/clustering key checks. These functions are now
translated to use the original API.
2021-07-13 10:40:07 +02:00
Piotr Sarna
d6b0a8338a db,view: migrate checking view filter to new is_satisfied_by
In order to unify the interfaces, the is_satisfied_by flavor
for mutations is getting deprecated. In order to be able to remove it,
one of its biggest users, the matches_view_filter() function,
is translated to the other variant.
2021-07-13 10:04:03 +02:00
Piotr Sarna
786db7e9a8 db,view: add a helper result builder class
In order to migrate from mutation-based restriction checks,
code in view.cc needs to have a way of translating results
to partition-slice-based representation.
A slightly simplified builder from multishard_mutation_query.cc
is injected into the view code.
2021-07-13 10:04:03 +02:00
Piotr Sarna
32d87837b1 db,view: move make_partition_slice helper function up
No functional changes, it will be needed for a future patch.
2021-07-13 10:04:02 +02:00
Avi Kivity
9059514335 build, treewide: enable -Wpessimizing-move warning
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.

Fix the handful of cases (all trivial), and enable the warning.

Closes #8992
2021-07-08 17:52:34 +03:00
Piotr Sarna
bf0777e97a view: generate view updates in smaller parts
In order to avoid large allocations and too large mutations
generated from large view updates, granularity of the process
is broken down from per-partition to smaller chunks.
The view update builder now produces partial updates, no more
than 100 view rows at a time.
2021-07-08 11:17:27 +02:00
Piotr Sarna
679dc4d824 db,view: move view_update_builder to the header
The builder is going to be used directly by the callers,
which requires making its definition public.
No semantic changes were intended.
2021-07-08 11:17:27 +02:00
Pavel Emelyanov
64bb16af8a view_update_generator: Remove unused struct sstable_with_table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Piotr Sarna
1fb831c8c1 db,view: limit the number of simultaneous view update futures
Previously the view update code generated a continuation for each
view update and stored them all in a vector. In certain cases
the number of updates can grow really large (to millions and beyond),
so it's better to only store a limited amount of these futures
at a time.
2021-06-17 10:20:52 +02:00
Piotr Sarna
a7f7716ecf db,view: use chunked_vector for view updates
The number of view updates can grow large, especially in corner
cases like removing large base partitions. Chunked vector
prevents large allocations.
2021-06-17 10:15:17 +02:00
Piotr Sarna
f832a30388 db,view,table: futurize calculating affected ranges
In order to avoid stalls on large inputs, calculating
affected ranges is now able to yield.
2021-06-16 09:51:31 +02:00
Piotr Sarna
3592d9b36e db,view: use chunked vector for view affected ranges
There were large allocation reportsa from vectors used for
calculating affected ranges. In order to reduce the pressure
on the allocator, chunked vector is used for storing intermediate
results.
2021-06-15 10:30:27 +02:00
Piotr Sarna
8a049c9116 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
2021-06-14 09:36:10 +02:00
Piotr Sarna
7cdbb7951a db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
2021-06-14 09:34:40 +02:00
Piotr Sarna
88d4a66e90 db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
2021-06-14 09:30:38 +02:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Emelyanov
1ce0682821 view: Get database from stprage_proxy
The db::view code already uses proxy rather actively, so instead of
depending on the storage service to be at hands it's better to make
db::view require the proxy. For now -- via global instance.

There's one dependency on storage service left after this patch --
to get the tokens. This piece is to be fixed later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:09:32 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Tomasz Grabiec
abe3d7d7d3 Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.

This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.

Test results (perf_simple_query --smp 1 --task-quota-ms 10):

```
before: median 184810.98 tps ( 91.1 allocs/op,  20.1 tasks/op,   54564 insns/op)
after:  median 192125.99 tps ( 87.1 allocs/op,  20.1 tasks/op,   53673 insns/op)
```

4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).

The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.

I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.

Closes #8592

* github.com:scylladb/scylla:
  storage_proxy, treewide: use utils::small_vector inet_address_vector:s
  storage_proxy, treewide: introduce names for vectors of inet_address
  utils: small_vector: add print operator for std::ostream
  hints: messages.hh: add missing #include
2021-05-06 18:00:54 +02:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00
Nadav Har'El
64a4e5e059 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505121830.964529-1-nyh@scylladb.com>
2021-05-05 15:24:25 +03:00
Nadav Har'El
58e275e362 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
2021-05-05 13:23:00 +03:00
Benny Halevy
dad6c94476 view_builder: stop: close all build_step readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
02d74e1530 view_update_generator: start: close staging_sstable_reader when done
The staging_sstable_reader has to be closed before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
1e1c8ea824 view: build_progress_virtual_reader: implement close method
Close underlying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
2d8b00f2d8 view: generate_view_updates: close builder readers when done
Make sure to close the builder's _updates and optional _existings
readers before they are destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00