Commit Graph

930 Commits

Author SHA1 Message Date
Raphael S. Carvalho
3a1cf3aa88 database: document database::get_keyspace_local_ranges()
Documentation was extracted from abstract_replication_strategy::get_ranges(),
which says:
    // get_ranges() returns the list of ranges held by the given endpoint.
    // The list is sorted, and its elements are non overlapping and non wrap-around.

That's important because users of get_keyspace_local_ranges() expect
that the returned list is both sorted and non overlapping, so let's
document it to prevent someone from removing any of these properties.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210805140628.537368-1-raphaelsc@scylladb.com>
2021-08-17 21:44:24 +03:00
Asias He
cc44edb4e2 database: Detemplate run_async
I initially tried to use a noncopyable_function to avoid the unnecessary
template usage.

However, since database::apply_in_memory is a hot function. It is better
to use with_gate directly. The run_async function does nothing but calls
with_gate anyway.

Closes #9160
2021-08-12 07:53:10 +03:00
Asias He
4ae6eae00a table: Get rid of table::run_compaction helper
The table::run_compaction is a trivial wrapper for
table::compact_sstables.

We have lots of similar {start, trigger, run}_compaction functions.
Dropping the run_compaction wrapper to reduce confusion.

Closes #9161
2021-08-09 14:02:54 +03:00
Asias He
6350a19f73 compaction: Move compaction_strategy.hh to compaction dir
The top dir is a mess. Move compaction_strategy.hh and
compaction_strategy_type.hh to the new home.
2021-08-07 08:06:37 +08:00
Michael Livshin
64dca1fef9 memtables: count read row tombstones
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Michael Livshin
2ee9f1b951 memtables: add metric and accounter for range tombstone reads
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Raphael S. Carvalho
c399601833 table: kill move_sstables_from_staging()
not used anywhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210728175403.86867-1-raphaelsc@scylladb.com>
2021-07-29 10:42:36 +03:00
Tomasz Grabiec
b044db863f Merge 'db/virtual_table: Streaming tables for large data + describe_ring example table' from Juliusz Stasiewicz
This is the 2nd PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). This one introduces a new implementation of the virtual tables, the streaming tables, which are suitable for large amounts of data.

This PR was created by @jul-stas and @StarostaGit

Closes #8961

* github.com:scylladb/scylla:
  test/boost: run_mutation_source_tests on streaming virtual table
  system_keyspace: Introduce describe_ring table as virtual_table
  storage_service: Pass the reference down to system_keyspace
  endpoint_details: store `_host` as `gms::inet_address`
  queue_reader: implement next_partition()
  virtual_tables: Introduce streaming_virtual_table
  flat_mutation_reader: Add a new filtering reader factory method
2021-07-23 18:05:51 +02:00
Raphael S. Carvalho
e4eb7df1a1 table: Make correctness of concurrent sstable list update robust
Today, table relies on row_cache::invalidate() serialization for
concurrent sstable list updates to produce correct results.
That's very error prone because table is relying on an implementation
detail of invalidate() to get things right.
Instead, let's make table itself take care of serialization on
concurrent updates.
To achieve that, sstable_list_builder is introduced. Only one
builder can be alive for a given table, so serialization is guaranteed
as long as the builder is kept alive throughout the update procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210721001716.210281-1-raphaelsc@scylladb.com>
2021-07-21 16:45:30 +03:00
Raphael S. Carvalho
aad72289e2 table: Kill load_sstable()
That function is dangerously used by distributed loader, as the latter
was responsible for invalidating cache for new sstable.
load_sstable() is an unsafe alternative to
add_sstable_and_update_cache() that should never have been used by
the outside world. Instead, let's kill it and make loader use
the safe alternative instead.
This will also make it easier to make sure that all concurrent updates
to sstable set are properly serialized.

Additionally, this may potentially reduce the amount of data evicted
from the cache, when the sstables being imported have a narrow range,
like high level sstables imported from a LCS table. Unlikely but
possible.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210721131949.26899-1-raphaelsc@scylladb.com>
2021-07-21 16:21:42 +03:00
Juliusz Stasiewicz
f8067d938d storage_service: Pass the reference down to system_keyspace
According to the policy of avoiding globals.
2021-07-20 14:18:24 +02:00
Botond Dénes
27fbca84f6 reader_concurrency_semaphore: remove prethrow_action
The semaphore accepts a functor as in its constructor which is run just
before throwing on wait queue overload. This is used exclusively to bump
a counter in the database::stats, which counts queue overloads. However,
there is now an identical counter in
reader_concurrency_semaphore::stats, so the database can just use that
directly and we can retire the now unused prethrow action.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210716111105.237492-1-bdenes@scylladb.com>
2021-07-19 15:47:37 +03:00
Raphael S. Carvalho
841e9227f9 table: Document the serialization requirement on sstable set rebuild
In order to avoid data loss bugs, that could come due to lack of
serialization when using the preemptable build_new_sstable_list(),
let's document the serialization requirement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210714201301.188622-1-raphaelsc@scylladb.com>
2021-07-17 18:09:00 +03:00
Pavel Emelyanov
1ed582304d memtable_list: Shorten flush coalescing codeflow
The memtable_list::flush() maintains a shared_promise object
to coalesce the flushers until the get_flush_permit() resolves.
Also it needs to keep the extraneous flushes counter bumped
while doing the flush itself.

All this can be coded in a shorter form and without the need
to carry shared_promise<> around.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210716164237.10993-1-xemul@scylladb.com>
2021-07-17 00:42:20 +02:00
Botond Dénes
ae4df99e6b database: remove now unused query execution stages 2021-07-14 17:19:02 +03:00
Botond Dénes
16d3cb4777 mutation_reader: remove now unused restricting_reader
Move the now orphaned new_reader_base_cost constant to
database.hh/table.cc, as its main user is now
`table::estimate_read_memory_cost()`.
2021-07-14 17:19:02 +03:00
Botond Dénes
01a4bb33de database: increase semaphore max queue size
Queued reads don't take 10KB (not even 1KB) for years now. But the real
motivation of this patch is that due to a soon-to-come change to
admission we expect larger queues especially in tests, so be more
forgiving with queue sizes.
2021-07-14 17:19:02 +03:00
Botond Dénes
7f2813e3fa database: mutation_query(): handle querier lookup/save on the database level
Instead of passing down the querier_cache_ctx to table::mutation_query(),
handle the querier lookup/save on the level where the cache exists.

The real motivation behind this change however is that we need to move
the lookup outside the execution stage, because the current execution
stage will soon be replaced by the one provided by the semaphore and to
use that properly we need to know if we have a saved permit or not.
2021-07-14 16:48:43 +03:00
Botond Dénes
d2f5393a43 database: query(): handle querier lookup/save on the database level
Instead of passing down the querier_cache_ctx to table::query(),
handle the querier lookup/save on the level where the cache exists.

The real motivation behind this change however is that we need to move
the lookup outside the execution stage, because the current execution
stage will soon be replaced by the one provided by the semaphore and to
use that properly we need to know if we have a saved permit or not.
2021-07-14 16:48:43 +03:00
Botond Dénes
97a03f9027 database: make_multishard_streaming_reader: use external permit
As a preparation for up-front admission, add a permit parameter to
`make_multishard_streaming_reader()`, which will be the admitted permit
once we switch to up-front admission. For now it has to be a
non-admitted permit.
A nice side-effect of this patch is that now permits will have a
use-case specific description, instead of the generic
"multishard-streaming-reader" one
2021-07-14 16:48:43 +03:00
Botond Dénes
999169e535 database: make_streaming_reader(): require permit
As a preparation for up-front admission, add a permit parameter to
`make_streaming_reader()`, which will be the admitted permit once we
switch to up-front admission. For now it has to be a non-admitted
permit.
A nice side-effect of this patch is that now permits will have a
use-case specific description, instead of the generic "streaming" one.
2021-07-14 16:48:43 +03:00
Botond Dénes
3ec149222d database: add obtain_reader_permit()
A convenience method for obtaining an admitted permit for a read on a
given table.
For now it uses the nowait semaphore obtaining method, as all normal
reads still use the old admission method. Migrating reads to this method
will make the switch easier, as there will be one central place to
replace the nowait method with the proper one.
2021-07-14 16:48:43 +03:00
Botond Dénes
a6b59f0d89 table: add estimate_read_memory_cost()
To be used for determining the base cost of reads used in admission. For
now it just returns the already used constant. This is a forward looking
change, to when this will be a real estimation, not just a hardcoded
number.
2021-07-14 16:48:43 +03:00
Piotr Sarna
a1813c9b34 db,view,table: drop unneeded time point parameter
Now that restriction checking is translated to the partition-slice-style
interface, checking the partition/clustering key restrictions for views
can be performed without the time point parameter.
The parameter is dropped from all relevant call sites.
2021-07-13 10:40:08 +02:00
Raphael S. Carvalho
88119a5c81 distributed_loader: Kill table's _sstables_opened_but_not_loaded
_sstables_opened_but_not_loaded was needed because the old loader would
open sstables from all shards before loading them.
In the new loader, introduced with reshape, make_sstables_available()
is called on each shard after resharding and reshape finished, so
there's no need whatsoever for that mess.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>
2021-06-24 12:03:26 +03:00
Calle Wilund
373fa3fa07 table: ensure memtable is actually in memtable list before erasing
Fixes #8749

if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase

v2:
* Make interface more set-like (tolerate non-existance in erase).

Closes #8904
2021-06-22 15:58:56 +02:00
Pavel Emelyanov
ab4fc41f25 database: Remove unused forward declarations
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Avi Kivity
00ff3c1366 Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
         [(-pp | --print-port)] [(-pw <password> | --password <password>)]
         [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
         [(-u <username> | --username <username>)] snapshot
         [(-cf <table> | --column-family <table> | --table <table>)]
         [(-kc <kclist> | --kc.list <kclist>)]
         [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.

In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).

Refs #8725

Closes #8726

* github.com:scylladb/scylla:
  test: database_test: add snapshot_skip_flush_works
  api: storage_service/snapshots: support skip-flush option
  snapshot: support skip_flush option
  table: snapshot: add skip_flush option
  api: storage_service/snapshots: add sf (skip_flush) option
2021-06-17 13:32:23 +03:00
Piotr Sarna
e3fa0246a1 table: coroutinize do_push_view_replica_updates
Makes the code cleaner, but more importantly it will make it easier
to futurize calculate_affected_clustering_ranges in the near future.
2021-06-16 09:51:30 +02:00
Benny Halevy
5a8531c4c8 repair: get_sharder_for_tables: throw no_such_column_family
Insteadof std::runtime_error with a message that
resembles no_such_column_family, throw a
no_such_column_family given the keyspace and table uuid.

The latter can be explicitly caught and handled if needed.

Refs #8612

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210608113605.91292-1-bhalevy@scylladb.com>
2021-06-08 14:45:44 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
e0749d6264 treewide: some random header cleanups
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Benny Halevy
f081e651b3 memtable_list: rename request_flush to just flush
Now that it returns a future that always waits on
pending flushes there is no point in calling it `request_flush`.
`flush()` is simpler and better describes its function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
4f20cd3bea memtable_list: rename seal_active_memtable_immediate to seal_active_memtable
Now that there's no more seal_active_memtable_delayed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
ba65b90b34 memtable_list: get rid of seal_active_memtable_delayed
This path is unused since e5be3352cf.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
52fd2b71b7 table: snapshot: add skip_flush option
skip_flush is false by default.

Also, log a debug message when starting the snapshot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 17:20:21 +03:00
Benny Halevy
1c0769d789 table: clear: make exception safe
It is currently possible that _memtables->add_memtable()
will throw after _memtables->clear(), leaving the memtables
list completely empty.  However, we do rely on always
having at least one allocated in the memtables list
as active_memtable() references a lw_shared_ptr<memtable>
at the back of the memtables vector, and it expected
to always be allocated via add_memtable() upon construction
and after clear().

This change moves the implementation of this convention
to memtable_list::clear() and makes the latter exception safe
by first allocating the to-be-added empty memtable and
only then clearing the vector.

Refs #8749

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210530100232.2104051-1-bhalevy@scylladb.com>
2021-05-30 13:22:52 +03:00
Asias He
72cc596842 repair: Wire off-strategy compaction for regular repair
We have enabled off-strategy compaction for bootstrap, replace,
decommission and removenode operations when repair based node operation
is enabled. Unlike node operations like replace or decommission, it is
harder to know when the repair of a table is finished because users can
send multiple repair requests one after another, each request repairing
a few token ranges.

This patch wires off-strategy compaction for regular repair by adding
a timeout based automatic off-strategy compaction trigger mechanism.
If there is no repair activity for sometime, off-strategy compaction
will be triggered for that table automatically.

Fixes #8677

Closes #8678
2021-05-26 11:41:27 +03:00
Benny Halevy
3ad0f156b9 memtable_list: request_flush: wait on pending flushes also when empty()
In https://github.com/scylladb/scylla/issues/8609,
table::stop() that is called from database::drop_column_family
is expected to wait on outstanding flushes by calling
_memtable->request_flush(), but the memtable_list is considered
empty() at this point as it has a single empty memtable,
so request_flush() returns a ready future, without waiting
on outstanding flushes. This change replaces the call to
request_flush with flush().

Fix that by either returning _flush_coalescing future
that resolves when the memtable is sealed, if available,
or go through the get_flush_permit and
_dirty_memory_manager->flush_one song and dance, even though
the memtable is empty(), as the latter waits on pending flushes.

Fixes #8609

Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210524143438.1056014-1-bhalevy@scylladb.com>
2021-05-25 11:19:51 +02:00
Benny Halevy
c0dafa75d9 utils: phased_barrier: advance_and_await: make noexcept
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.

In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().

Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.

Fixes #8636

A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
2021-05-12 01:36:11 +02:00
Botond Dénes
992819b188 database: add get_unlimited_query_max_result_size()
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
2021-05-05 13:30:42 +03:00
Avi Kivity
3e6232bb92 Merge "Wire offstrategy compaction to repair-based removenode" from Raphael
"
From now on, offstrategy compaction is triggered on completion of repair-based
removenode. So compaction will no longer act aggressively while removenode
is going on, which helps reducing both latency and operation time.

Refs #5226.
"

* 'offstrategy_removenode' of github.com:raphaelsc/scylla:
  repair: Wire offstrategy compaction to repair-based removenode
  table: introduce trigger_offstrategy_compaction()
  repair/row_level: make operations_supported static const
2021-04-28 12:02:07 +03:00
Botond Dénes
4c3454dd07 database: get_reader_concurrency_semaphore(): make the user semaphore the catch-all
Currently said method uses the system semaphore as a catch-all for all
scheduling groups it doesn't know about. This is incompatible with the
recent forward-porting of the service-level infrastructure as it means
that all service level related scheduling groups will fall back to the
system scheduling group, which causes two problems:
* They will experience much limited concurrency, as the system semaphore
  is assigned much less count units, to match the much more limited
  internal traffic.
* They compete with internal reads, severely impacting the respective
  internal processes, potentially causing extreme slowdown, or even
  deadlock in the case of an internal query executed on behalf of a
  user query being blocked on the latter.

Even if we don't have any custom service level scheduling groups at the
moment, it is better to change this such that unknown scheduling groups
fall-back to using the user semaphore. We don't expect any new internal
scheduling group to pop up any time soon (and if they do we can adjust
get_reader_concurrency_semaphore() accordingly), but we do expect user
scheduling groups to be created in the future, even dynamically.

To minimize the chance of the wrong workload being associated with the
user semaphore, all statically created scheduling groups are now
explicitly listed in `get_reader_concurrency_semaphore()`, to make their
association with the respective semaphore explicit and documented.
Added a unit test which also checks the correct association for all
these scheduling groups.

Fixes: #8508

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210420105156.94002-1-bdenes@scylladb.com>
2021-04-20 14:06:25 +03:00
Pavel Emelyanov
5ecbc33be5 database.*: Remove unused headers
The database.hh is the central recursive-headers knot -- it has ~50
includes. This patch leaves only 34 (it remains the champion though).
Similar thing for database.cc.
Both changes help the latter compile ~4% faster :)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210414183107.30374-1-xemul@scylladb.com>
2021-04-18 14:03:17 +03:00
Raphael S. Carvalho
5c630f405a table: introduce trigger_offstrategy_compaction()
this function will be used on repair-based operation completion,
to notify table about the need to start offstrategy compaction
process on the maintenance sstables produced by the operation.
Function which notifies about bootstrap and replace completion
is changed to use this new function.
Removenode and decommission will reuse this function.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-09 14:53:14 -03:00
Botond Dénes
80a03826e3 database: mutation_query(): use table::mutation_query()
Instead of `mutation_query()` from `mutation_query.hh`. The latter is
about to be retired as we want to migrate all users to
`table::mutation_query()`.
As part of this change, move away from `mutation_query_stage` too. This
brings the code paths of the two query variants closer together, as they
both have an execution stage declared in `database`.
2021-04-09 13:40:27 +03:00
Botond Dénes
5c8f142fe5 table: add mutation_query()
We want to migrate `database::mutation_query()` off `mutation_query()`
to use `table::mutation_query()` instead. The reason is the same as for
making `table::query()` standalone: the `mutation_query()`
implementation increasingly became specific to how tables are queried
and is about to became even more specific due to impending changes to
how permits are obtained. As no-one in the codebase is doing generic
mutation queries on generic mutation sources we can just make this a
member of table.
This patch just adds `table::mutation_query()`, no user exists yet.
`table::mutation_query()` is identical to `mutation_query()`, except
that it is a coroutine.
2021-04-09 13:40:27 +03:00
Raphael S. Carvalho
c45d2e1d27 table: extend add_sstable_and_update_cache() for off-strategy
Function is extended to add sstable to maintenance set if requested
by the caller.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
e0e5bf8285 table: Introduce off-strategy compaction on maintenance sstable set
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
439e9b6fab table: change build_new_sstable_list() to accept other sstable sets
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00